Introduction to Impala Shell
Impala is a massively parallel processing SQL like engine based on top of the Hadoop ecosystem. Impala Shell is used to processing a huge volume of data stored in the Hadoop cluster. Impala is developed by Cloudera distribution to overcome the slow processing of hive queries. Impala works better in comparison to a hive when a dataset is not huge. It can be used to share the database of the hive as it can connect hive metastore easily. Also, it can be integrated with HBASE or Amazon S3.
Impala is not map-reduce based, as it stores intermediate results in-memory, unlike map-reduce. It supports various HDFS supported file formats(like parquet, Avro, delimited text, etc) and compression codecs like snappy, gzip, etc. Impala-Shell is a command-line tool used to create a database, tables and also to run interactive queries to fetch data, run Adhoc queries or execute the script files. Impala shell can be started using an impala-shell command on the terminal. Impala queries can be executed using shell, Perl or python scripts.
Basic Commands in Impala Shell
Below are the basic commands to execute a command on impala-shell:
1. Connect: This command is used to connect to running impala instance. The default port connected is 21000.
2. Show databases: This command will list existing databases in impala.
3. Create a database: This command is used to create a database.
create database <database_name>
create database if not exists <database name>;
4. Create table: This command is used to create tables.
create table <table_name>
create table if not exists <database_name>.<table_name>(field_name1 data_type, field_name2 data_type);
5. Describe: This command is used to describe the table and see the schema of the table like datatypes, number of columns and constraints, etc.
6. Drop: This command is used to drop tables, database, functions.
drop database <database_name>
7. Insert: This command is used to insert records in either overwrite or append record in tables.
Insert into <database_name>.<table_name> values(field_value1, field_value2);
8. Select: This command is used to fetch data from the table or view.
Select column1, column2 from <database_name>.<table_name>;
9. Use: This command is used to select the database. Statements will be executed over the tables inside the database.
10. Help: This command is used to list all the commands.
11. History: Impala stores the history of all the commands used in impala-shell. This command is used to display all the commands used in an impala-shell.
12. Alter: Alter table statement is used to alter the properties of the table. Also, an alter table can be used to change the structure of the table. Change in the structure like adding or dropping columns, keys, constraints can be achieved using alter command. Also, it can be used to change the partitions of the table. Also, you can rename the table and can change the properties of a table like an owner, etc. This command is mainly used to alter partition or alter the location where the table is pointing.
Alter table <database_name>.<table_name> drop partition(partition_name);
13. Invalidate Metadata: This command is used to refresh metadata for the tables. As impala shares metastore with hive so if there are any changes made in metadata of the tables outside of impala invalidate metadata is needed to run. As it will invalidate existing metadata of the tables. It will reload metadata for tables before querying the table. This command is very expensive when compared to refresh. Invalidate metadata needs to be executed when there is any change in jars, or when some table will not be queried or when new tables are added.
Invalidate metadata hive_db_name.table_name;
14. Refresh: This command is used to reload metadata about the table from metastore whenever there is a change in metadata outside of impala. Refresh will remove the inconsistency between hive metastore and impala. After refresh metadata will be broadcasted to all impala coordinators. The command needs to be executed in scenarios like whenever there is any addition or deletion of file from the HDFS directory where the Impala table is pointing. Or if there is any change in partitioning made outside of impala like addition or deletion of partition by the hive.
Refresh <db_name>.<table_name> partition(key1=value1,key2=val2);
15. Explain: this command is used to derive the execution plan of a statement, like how impala will read data from hdfs, will divide work among hdfs nodes, etc. This command can be used to select or create a table as or insert statements to understand the execution plan of a query. With the help of execution plan query can be changed to perform better like join clauses can be adjusted or where conditions can be adjusted accordingly. The execution plan should be read from bottom to top.
Explain <query can be either select or insert or CTAS>
16. Compute stats: This command is used to get information about data in a table and will be stored in the metastore database, later will be used by impala to run queries in an optimized way. This query gets information about data distribution or partitioning etc.
Compute stats <database_name>.<table_name>(column1, column2 etc)
This is a guide to Impala Shell. Here we discuss the introduction, 16 different basic commands in Impala Shell along with Syntax. You can also go through our other related articles to learn more–