Introduction to Hive Commands
Hive command is a data warehouse infrastructure tool that sits on top Hadoop to summarize Big data. It processes structured data. It makes data querying and analyzing easier. Hive command is also called as “schema on reading;” It doesn’t verify data when it is loaded, verification happens only when a query is issued. This property of Hive makes it fast for initial loading. It’s like copying or simply moving a file without putting any constraints or checks. It was first developed by Facebook. Apache Software Foundation took it up later and developed it further.
Components of Hive Commands
The components 0f hive command is as shown below:
Sources images: https://www.developer.com/
Features of Hive Commands
Here are few of the features discussed below:
- Hive stores are raw and processed dataset in Hadoop.
- It is designed for Online Transaction Processing (OLTP). OLTP is the systems that facilitate high volume data in very less time with no reliance on the single server.
- It is fast, scalable and reliable.
- The SQL type querying language provided here is called HiveQL or HQL. This makes ETL tasks and other analysis easier.
Sources images:- Google
There are few limitations of Hive command as well, which are listed below:
- Hive doesn’t support subqueries.
- Hive surely supports over-writing, but unfortunately, it doesn’t support deletion and updates.
- Hive is not designed for OLTP, but it is used for it.
To enter the Hive’s interactive shell:
Basic Hive Commands
The basic commands are as explained below.
1. Create: This will create the new database in Hive.
2. Drop: The drop will remove a table from Hive
3. Alter: Alter command will help you rename the table or table columns.
ALTER TABLE employee RENAME TO employee1;
4. Show: Show command will show all the databases residing in Hive.
5. Describe: Describe command will help you with the information about the schema of the table.
Intermediate Hive Commands
Hive divides a table into variously related partitions based on columns. Using these partitions, it gets easier to query data. These partitions further get divided into buckets, to run query efficiently on to data.
In other words, buckets distribute data into the set of clusters by calculating the hash code of key mentioned in the query.
1. Adding Partition
Adding partition can be accomplished by altering the table. Say you have table “EMP”, with fields such as Id, Name, Salary, Dept, Designation, and yoj.
ALTER TABLE employee
ADD PARTITION (year=’2012’)
2. Renaming Partition
ALTER TABLE employee PARTITION (year=’1203’)
RENAME TO PARTITION (Yoj=’1203’);
3. Drop Partition
ALTER TABLE employee DROP [IF EXISTS]
4. Relational Operators
Relational operators consist of a certain set of operators, that helps in fetching relevant information.
For example: Say your “EMP” table look like this:
Let’s execute Hive query which will fetch us the employee whose salary is greater than 30000.
SELECT * FROM EMP WHERE Salary>=40000;
5. Arithmetic Operators
These are operators which help in help in executing arithmetic operations on the operands, and in turn, always return number types.
For example: To add two number such as 22 & 33
SELECT 22+33 ADD FROM temp;
6. Logical Operator
These operators are to execute logical operations, which in return always return True/False.
SELECT * FROM EMP WHERE Salary>40000 && Dept=TP;
Advanced Hive Commands
The advanced commands are as explained below.
View concept in Hive is similar like in SQL. The view can be created at the time of executing a SELECT statement.
CREATE VIEW EMP_30000 AS
SELECT * FROM EMP
2. Loading Data into Table
Load data local inpath '/home/hduser/Desktop/AllStates.csv' into table States;
Here “States” is the already created table in Hive.
Hive has some built-in functions which help you in fetching your result in a better way.
Like round, floor, BIGINT etc.
Join clause can help in joining two tables based on the same column name.
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
All kinds of joins are supported by Hive: Left outer join, right outer join, full outer join.
Tips and Tricks
Hive makes data processing that easy, straightforward and extensible, that user pay less attention towards optimizing the Hive queries. But paying attention towards few things while writing Hive query, will surely bring great success in managing the workload and saving money. Below are few tips regarding that:
1. Partitions & Buckets
It is a big data tool, which can query on large datasets. However, writing the query without understanding the domain can bring great partitions in Hive.
If the user is aware of the dataset, then relevant and highly used columns could be grouped into the same partition. This will help in running the query faster and inefficient way.
Ultimately the no. of mapper and I/O operations will also be reduced.
Sources images: Google image
Sources images:- Google image
2. Parallel Execution
It runs the query in multiple stages. In some cases these stages may depend on other stages, a hence can’t get started, once the previous stage is completed. However, independent tasks can run parallelly to save overall run time. To enable the parallel run in Hive:
Hence, this will enhance the cluster utilization.
3. Block Sampling
Sampling data from a table will allow exploration of queries on data.
Despite bucking, we rather want to sample dataset more randomly. Block sampling comes with various powerful syntax, which helps in sampling the data in various way.
Sampling can be used for finding approx. info from dataset like the average distance between origin and destination.
Querying 1% of big data will give near to the perfect answer. Exploration gets way easier & effective.
Hive is a higher-level abstraction on top of HDFS, which provides flexible query language. It helps in querying and processing data in an easier way.
It can be clubbed with other Big data elements, to harness its functionality in a full-fledged way.
This has been a guide to Hive Commands. Here we have discussed components, features, basic, advanced as well as intermediate of Hive Commands along with tips and tricks. You may also look at the following article to learn more –