Introduction to Hive Commands
Hive command is a data warehouse infrastructure tool that sits on top Hadoop to summarize Big data. It processes structured data. It makes data querying and analyzing easier.
Hive command is also called as “schema on reading;” Hive doesn’t verify data when it is loaded, verification happens only when a query is issued. This property of Hive makes it fast for initial loading. It’s like copying or simply moving a file without putting any constraints or checks.
The hive was first developed by Facebook. Apache Software Foundation took it up later and developed it further.
Here are the components of Hive command:
Fig 1. Components of Hive
Here are the Features of Hive command listed below:
- Hive stores are raw and processed dataset in Hadoop.
- It is designed for OnLine Transaction Processing (OLTP). OLTP is the systems that facilitate high volume data in very less time with no reliance on the single server.
- It is fast, scalable and reliable.
- The SQL type querying language provided here is called HiveQL or HQL. This makes ETL tasks and other analysis easier.
Fig 2. Hive Properties
Sources images:- Google
There are few limitations of Hive command as well, which are listed below:
- Hive doesn’t support subqueries.
- Hive surely supports over-writing, but unfortunately, it doesn’t support deletion and updates.
- Hive is not designed for OLTP, but it is used for it.
To enter the Hive’s interactive shell:
Basic Hive Commands
This will create the new database in Hive.
The drop will remove a table from Hive
Alter command will help you renaming the table or table columns.
hive> ALTER TABLE employee RENAME TO employee1;
Show command will show all the databases residing in Hive.
Describe command will help you with the information about the schema of the table.
Intermediate Hive Commands
Hive divides a table into variously related partitions based on columns. Using these partitions, it gets easier to query data. These partitions further get divided into buckets, to run query efficiently on to data.
In other words, buckets distribute data into the set of clusters by calculating the hash code of key mentioned in the query.
Adding partition can be accomplished by altering the table. Say you have table “EMP”, with fields such as Id, Name, Salary, Dept, Designation, and yoj.
hive> ALTER TABLE employee
> ADD PARTITION (year=’2012’)
hive> ALTER TABLE employee PARTITION (year=’1203’)
RENAME TO PARTITION (Yoj=’1203’);
hive> ALTER TABLE employee DROP [IF EXISTS]
> PARTITION (year=’1203’);
Relational operators consist of a certain set of operators, that helps in fetching relevant information.
For example: Say your “EMP” table look like this:
Let’s execute Hive query which will fetch us the employee whose salary is greater than 30000.
hive> SELECT * FROM EMP WHERE Salary>=40000;
These are operators which help in help in executing arithmetic operations on the operands, and in turn, always return number types.
For example: To add two number such as 22 & 33
hive> SELECT 22+33 ADD FROM temp;
These operators are to execute logical operations, which in return always return True/False.
hive> SELECT * FROM EMP WHERE Salary>40000 && Dept=TP;
Advanced Hive Commands
View concept in Hive is similar like in SQL. The view can be created at the time of executing a SELECT statement.
hive> CREATE VIEW EMP_30000 AS
SELECT * FROM EMP
Loading Data into Table
Hive> Load data local inpath ‘/home/hduser/Desktop/AllStates.csv’ into table States;
Here “States” is the already created table in Hive.
Hive has some built-in functions which help you in fetching your result in a better way.
Like round, floor, BIGINT etc.
Join clause can help in joining two tables based on the same column name.
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
All kinds of joins are supported by Hive: Left outer join, right outer join, full outer join.
Tips and Tricks to Use hive commands
Hive makes data processing that easy, straightforward and extensible, that user pay less attention towards optimizing the Hive queries. But paying attention towards few things while writing Hive query, will surely bring great success in managing the workload and saving money. Below are few tips regarding that:
- Partitions & Buckets: Hive is a big data tool, which can query on large datasets. However, writing the query without understanding the domain can bring great partitions in Hive.
If the user is aware of the dataset, then relevant and highly used columns could be grouped into the same partition. This will help in running the query faster and inefficient way.
Ultimately the no. of mapper and I/O operations will also be reduced.
Fig 3. Partitioning
Sources images:Google image
Fig 4 Bucketing
Sources images:- Google image
- Parallel Execution: Hive runs the query in multiple stages. In some cases these stages may depend on other stages, a hence can’t get started, once the previous stage is completed. However, independent tasks can run parallelly to save overall run time. To enable the parallel run in Hive:
Hence, this will enhance the cluster utilization.
- Block Sampling: Sampling data from a table will allow exploration of queries on data.
Despite bucking, we rather want to sample dataset more randomly. Block sampling comes with various powerful syntax, which helps in sampling the data in a various way.
Sampling can be used for finding approx. info from dataset like the average distance between origin and destination.
Querying 1% of big data will give near to the perfect answer. Exploration gets way easier & effective.
Conclusion – Hive commands
Hive is a higher level abstraction on top of HDFS, which provides flexible query language. It helps in querying and processing data in an easier way.
Hive can be clubbed with other Big data elements, to harness its functionality in a full-fledged way.
This has been a guide to Hive Commands. Here we have discussed basic as well as advanced Hive Commands and some immediate Hive commands. You may also look at the following article to learn more –