What is a Hive?
Apache Hive is a data warehouse system designed on top of open source Hadoop platform and is used for data summarization, querying of large data, data analysis, etc.
The hive was developed by Facebook and in a later point of time, it was taken over by Apache Software Foundation who further developed it as an open source under the name Apache Hive.
It is not a relational database and that’s why not suitable for online transaction processing and real-time queries with row-level updates. Hive is designed for online analytical processing or OLAP. It also provides a query language called HiveQL. It is scalable, fast and extensible. It converts the queries looking almost like SQL into MapReduce jobs for easy execution and processing of a large amount of data. Apache hive is one of the Hadoop components that are normally used by data analysts whereas apache pig is also used for the same task, but it is more used by researchers and programmers. Apache hive being an open source data warehousing system is used to query and analyze huge data sets stored in Hadoop storage. Hive is best suited for batch jobs and not for online transactional processing work types. It also does not support real-time queries. Hive uses SQL like query language and is mainly used for creating reports. Hive is generally deployed on the server side and it supports structured data. Hive also supports integration with JDBC and BI tools.
Below are the major components of the hive:
The repository that stores the metadata is called the hive meta store. The metadata consists of the different data about the tables like its location, schema, information about the partitions which helps to monitor variously distributed data progress in the cluster. It also keeps track of the data and replicates the data which provides a backup in case of emergencies like data loss. The metadata information is present in relational databases and not in the Hadoop file system.
On execution of the Hive query language statement, the driver receives the statement and it controls it for the full execution cycle. Along with the execution of the statement, the driver also stores the metadata generated from the execution. It also creates sessions to monitor the progress and life cycle of different executions. After the completion of the reducing operation by the MapReduce job the driver collects all the data and results of the query
It is used for translating the Hive query language into MapReduce input. It invokes a method that executes the steps and tasks that are needed to read the HiveQL output as needed by the MapReduce.
The main task of the optimizer is improving the efficiency and scalability creating a task while transforming the data before the reduce operation. It also performs transformations like aggregation, pipeline conversion by a single join for multiple joins.
After compilation and optimization step the main task of the executor is to execute the tasks. The main task of the executor is to interact with Hadoop job tracker for scheduling of tasks ready to run.
4.5 (2,710 ratings)
UI, Thrift server and CLI:
Thrift server is used by other clients to interact with the Hive engine. The user interface and the command-line interface helps to submit the queries as well as process monitoring and instructions so that external users can interact with the hive.
Below are the steps showing hive interaction with Hadoop framework:
Executing the query:
The query is sent to the driver from hive interfaces such as command line or web UI. A driver may be any database driver like JDB or ODBC, etc.
Getting the plan:
The syntax for the requirement of the query or query plan can be checked with the help of a query compiler which passes through the query and is invoked by the driver.
Getting the metadata:
The meta store can be residing in any database and the compiler makes a request to access the metadata.
Sending the metadata:
On the request of the compiler, the meta store sends the metadata.
Sending the plan:
The compiler sends the plan to the driver on verifying the requirements sent by the compiler. This step completes the parsing and compiling of a query.
Executing the plan:
The execution plan is sent to the execution engine by the driver.
Executing the job:
An executing the job is a MapReduce job that runs in the backend. Then it follows the normal convention of Hadoop framework – the execution engine will send a job to the job tracker which is residing on the name node and the name node, in turn, will assign the job to the task tracker which is in data note. The MapReduce job is executed here.
While executing the job the execution engine can execute metadata operations with the meta store.
Fetching the result:
The data nodes after the completion of the processing pass on the result to the execution engine.
Sending the result
The driver receives the result from the execution engine.
Finally, the Hive interfaces receive the result from the driver.
Thus, by the execution of the above steps, a complete query execution in Hive takes place.
How does the Hive make working so easy?
Hive is a data warehousing Framework built on top of Hadoop which helps user for performing data analysis, querying on data and data summarization on large volumes of data sets. HiveQL is a unique feature which looks like SQL data stored in the database and performs the extensive analysis. I was capable of reading data at a very high speed and writing the data into the data warehouses as well as it can manage large data sets distributed across multiple locations. Along with this hive also provides structure to the data that is stored in the database and users are able to connect to hive using command line tool or JDBC driver.
Major organizations working with big data used hive – like facebook, Amazon, Walmart and many others.
What can you do with Hive?
There are a lot of functionalities of the hive like data query, data summarization, and data analysis. Hive supports a query language called HiveQL or Hive Query Language. The Hive query language queries are translated into MapReduce job which is processed on the Hadoop cluster. Apart from this, Hiveql also reduces script that can be added into the queries. In this way, HiveQL increases the schema design flexibility which also supports data deserialization and data serialization.
Working with Hive:
Below are some of the operational details in Hive. Hive datatypes are broadly classified into four types as given below:
- Column Types
- Null Values
- Complex Types
1. Column types:
These are the column data types of the hive. These are classified as below:
- Integral types: Integer data is represented using integral data type. The symbol is INT. Any data exceeding the upper limit of INT has to be assigned datatype of BIGINT. In the same way, any data below the lower limit of INT needs to be assigned SMALLINT. There is another datatype called TINYINT which even smaller than SMALLINT.
- String types: String data type is represented in the hive by a single quote(‘) or double quotes(“). It can be of two types – VARCHAR or CHAR.
- Timestamp: Hive timestamp supports java.sql.Timestamp format “yyyy-mm-dd hh:mm:ss.ffffffffff” and format “YYYY-MM-DD HH:MM:SS.fffffffff”.
- Date: Date is represented in the hive in the format YYYY-MM-DD representing year-month-day.
- Decimals: Decimals in a hive is represented in the java big decimal format and is used to represent immutable arbitrary precision. It is represented in the format Decimal(precision, scale).
- Union types: Union is used in the hive to create a collection of a heterogeneous datatype. It can be created using create a union.
Below is an example:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
There are few literals used in the hive. They are as below:
- Floating point type: They are represented as numbers with a decimal point. These are pretty similar to double data type.
- Decimal type: This type of data contains decimal type data only but with a higher range of floating point value than the double data type. The range of decimal type is approximate -10-308 to 10308.
3. Null value:
The special value NULL represents missing values in the hive.
4. Complex types:
Below are the different complex types found in the hive:
- Arrays: Arrays are represented in a hive in the same form as of java. The syntax is like ARRAY<datatype>.
- Maps: Maps are represented in the hive in the same form as of java. The syntax is like MAP
- <primitivetype, datatype>.
- Structs: Structs in the hive are represented like complex data with comments. The syntax is like
STRUCT<columnname : datatype [COMMENT columncomment], ...>.
Besides all these, we can create databases, tables, partition them and lots of other functions.
- Databases: They are the namespaces containing a collection of tables. Below is the syntax to create a database in a hive.
CREATE DATABASE [IF NOT EXISTS] sampled;
The databases can also be dropped if not needed anymore. Below is the syntax to drop a database.
DROP DATABASE [IF EXISTS] sampled;
- Tables: They can also be created in the hive to store data. Below is the syntax for creating a table.
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_nam
[(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment
[ROW FORMAT row_format] [STORED AS file_format]
A table can also be dropped if not needed anymore. Below is the syntax to drop a table.
DROP TABLE [IF EXISTS] table_name;
The main advantage of Apache Hive is for data querying, summarization, and analysis. Hive is designed for better productivity of the developer and also comes with the cost of increasing latency and decreasing efficiency. Apache Hive provides for wide range of user-defined functions which can be interlinked with other Hadoop packages like RHipe, Apache Mahout, etc. It helps developers to a great extent when working with complex analytical processing and multiple data formats. It is mainly used for data warehousing which means a system used for reporting and data analysis.
It involves cleansing, transforming and modeling data to provide useful information about various business aspects which will help in producing a benefit to an organization. Data analysis a lot of different aspect and approaches which encompass diverse techniques with a variety of names in different business models, social science domains, etc. Hive is much user-friendly and allows users to simultaneously access the data increasing the response time. Compared to the other type of queries on huge data sets the hive’s response time is much faster than others. It is also much flexible in terms of performance when adding more data and by increasing the number of nodes in the cluster.
Why should we use the Hive?
Along with data analysis hive provides a wide range of options to store the data into HDFS. Hive supports different file systems like a flat file or text file, sequence file consisting of binary key-value pairs, RC files that stores column of a table in a columnar database. Nowadays the file that is most suitable with Hive is known as ORC files or Optimized Row Columnar files.
Why do we need Hive?
In today’s world Hadoop is associated with the most spread technologies that are used for big data processing. The very rich collection of tools and technologies that are used for data analysis and other big data processing.
Who is the right audience for learning Hive technologies?
Majorly people having a background as developers, Hadoop analytics, system administrators, data warehousing, SQL professional, and Hadoop administration can master the of the hive.
How this technology will help you in career growth?
Hive is one of the hot skill in the market nowadays and it is one of the best tools for data analysis in the big data Hadoop world. Big enterprises doing analysis over large data sets are always looking for people with the rights of skills so can manage and query huge volumes of data. Hive is one of the best tool available in the market in big data technologies in recent days that can help an organization around the world for their data analysis.
Apart from the above-given functions hive has much more advanced capabilities. The power of hive to process a large number of datasets with great accuracy makes hive one best tools used for analytics in the big data platform. Besides, it also has great potential to emerge as one of the leading big data analytics tools in coming days due to periodic improvement and ease of use for the end user.
This has been a guide to What is Hive. Here we discussed the working, skills, career growth, advantages of Hive and top companies that implement this technology. You can also go through our other suggested articles to learn more –