How To Install Hive?
Apache Hadoop is a collection of the framework that allows for the processing of big data distributed across the cluster. As per Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Apache hive provides SQL like interface to query and processing a large amount of data called HQL (Hive query language). Apache hive runs on top of the Hadoop ecosystem and the data stored in the form of file means Hadoop distributed file system (HDFS). Apache Hive provides a great interface to the user to access and perform an operation on the data in the form of a table it provides great optimization technique to make performance better. It’s very challenging to make the query faster with big data and believe me, it matters in a production environment.
In the backend, compiler convert HQL query into map reduce jobs and then submitted to Hadoop framework for executions.
Difference between Hive and SQL
Apache Hive is very much similar with SQL but as we know hive runs on top of Hadoop ecosystem and internally convert jobs into MR (Map Reduce jobs) it makes some difference between Hive and SQL.
Hive would not be the best approach for those applications where very fast response required and it’s very important to understand that Hive is better suited for batch processing over very large sets of immutable data and we should note this that Hive is a regular RDBMS and in last but not least apache hive is schema on reading means (while inserting data into hive table it will not bother about data type mismatch but while reading data it will show null value if data type is not matched with specific column’s data type).
Prior Requirement to Install Hive
As I said earlier it’s very important to understand Apache hive runs on top of Hadoop Ecosystem and Hadoop Should be up and running with all demons.
Some of the basic Hadoop demons are as follows:
- Name node
- Data node
- Resource manager
- Node manager
To check Hadoop version below is the command:
Type → Hadoop version in command prompt it will give you the version of Hadoop.
To check the Hadoop cluster report trigger below command:
Type →Hadoop dfsadmin –report in command prompt it will give you the whole cluster report if your server is running.
If Hadoop is not installed on your machine requesting you to please follow the apache instruction to install Hadoop on your system.
I hope java has been installed already on your system as well. to check java version please refer below screenshot.
Steps To Install Hive on Ubuntu
Below are the steps to install Hive on Ubuntu are as follows:
Step 1: Hive tar we can download by using below command in the terminal we can directly download from the terminal as well.
Step 2: Extract the tar file by using below command in the terminal we can extract the tar above downloaded tar hive tar file directly.
Command: tar -xzf apache-hive-2.1.0-bin.tar.gz
I will suggest you verify with ls command about extracted hive file.
Step 3: Edit the “.bashrc” file to update the environment variables for the user.
Command: sudo the .bashrc
Add the following at the end of the file:
# Set HIVE_HOME
Execute the below-given command to complete the changes work in the current terminal.
Command: source .bashrc
Step 5: We need to create Hive directories within HDFS location and this directory ‘warehouse’ it will be the location to store the metadata related information of hive table and data related to Hive.
- hdfs dfs -mkdir -p /user/hive/warehouse
- hdfs dfs -mkdir /tmp
Step 6: To set the read and write permission for hive table execute below command.
In below command, providing write permission to the user group:
- hdfs dfs -chmod g+w /user/hive/warehouse
- hdfs dfs -chmod g+w /tmp
Configuring Hive: It’s very important to point of install hive to configure with Hadoop. We need to edit hive-env.sh, a file which is placed in the $HIVE_HOME/conf directory. The following commands redirect to Hive conf folder and copy the template file:
Step 7: Set a Hadoop path in hive-env.sh
Edit the hive-env.sh file by appending the following line:
Now by this process, we are almost done and the hive installations have been completed successfully it’s important to configure Metastore with the external database server and by default, Apache Hive framework uses Derby database. By using below command Initializing Derby database.
Command: bin/schematool -initSchema -dbType derby
Step 8: Launch Hive.
Command: hive (type hive in the terminal within second hive terminal will open.)
Working with Hive: Now we will see some of the operations in the hive to see how many tables we have in default database use refer below screenshots in the below screenshots it’s not showing any tables means we don’t have any tables in the default database.
To create a table in the hive it’s very important to refer required database otherwise any table will get created under the default database.
Important commands in Hive
1: show databases (it will show all databases that have been created till yet).
2: create the database if not exists mydb (this command will create one database with the name of ‘mydb’ if ‘mydb’ not exists and if ‘mydb already exists it will not give any error as well’)
3: use database whenever we have to use some DDl command on the particular database we should use the command “use database” in our case we have already created “mydb” show command would be used mydb.
Important Hive DDL command
CREATE, DROP, TRUNCATE, SHOW, DESCRIBE.
- Create: – Create a statement used to create a database or create a table in hive.
Example: hive> create database Company; (database create)
Hive> create table employee (id int, name String, salary String); (this will create table employee under database Company because we have already executed the command Use database.)
- Describe provides information about the schema of the table.
Hive>describe employee; (this will give the schema details of employee table in details)
- TRUNCATE will delete the data of the table.
Hive> truncate table employee;
We can Install the Hive on a window as well but for best practice, I will prefer Ubuntu to use, it will give a better view of productions environment and your data will increase in the future it will easy to manage.
This has been a guide to Install Hive. Here we have discussed the different steps to install Hive, DDL command etc. You may also look at the following articles to learn more: