Introduction to Indexes in Hive
Indexes are a pointer or reference to a record in a table as in relational databases. Indexing is a relatively new feature in Hive. In Hive, the index table is different than the main table. Indexes facilitate in making query execution or search operation faster. However, storing indexes require disk space and creating an index involves cost. So, the use of indexes may not always be of any benefit. “EXPLAIN” query must be checked to evaluate the benefit through a query execution plan. Indexing in hive makes large dataset analysis relatively quicker by better query performance on operations.
Why do we need Indexes?
With the petabytes of data that needs to be analyzed, querying Hive tables with millions of records and hundreds of columns becomes time-consuming. Indexing a table helps in performing any operation faster. First, the index of the column is checked and then the operation is performed on that column only. Without an index, queries involving filtering with the “WHERE” clause would load an entire table and then process all the rows. Indexing in Hive is present only for ORC file format, as it has a built-in index with it.
There are two types of indexing in Hive:
- Bitmap Indexing: This is used with columns having a few distinct values. It is known to store both the indexed column’s value and the list of rows as a bitmap. From Hive V0.8.0 onwards, the bitmap index handler is built-in in Hive.
- Compact Indexing: This type of indexing is known to store the column value and storage blockid.
There is some configuration property that enables better execution of indexing.
- hive.optimize.index.filter: The default value for this property is false. Setting this property to TRUE will enable automatic use of indexes.
- hive.index.compact.query.max.entries: The default value for this property is 10000000. This property is used to set the maximum number of index entries that use compact indexing during query execution.
- hive.optimize.index.filter.compact.minsize: The default value for this property is 5368709120.
This property is used to set the minimum bytes of input, on which compact indexing can be automatically triggered.
Different Operations to Perform on Hive
Different Operations to Perform on HIVE indexes are:
1. Create an Index
The general syntax for creating an index for a column of a table
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
[WITH DEFERRED REBUILD] [IDXPROPERTIES (property_name=property_value, ...)] [IN TABLE index_table_name] [ [ ROW FORMAT ...] STORED AS ...
| STORED BY ... ] [LOCATION hdfs_path] [TBLPROPERTIES (...)] [COMMENT "index comment"]
- index_name will be the name of the table’s index name.
- Base_table_name and the columns in bracket is the table for which index is to be created.
- Index_type will specify the type of indexing to use. If we want to use the built-in compact index handler, below clause will replace index_type
If we want to create a bitmap index, then index_type will be “bitmap”. It specifies the java class for implementing indexing.
It is important to note that we can create any number and types of indexes for one table. But if the same column has multiple indexes created for it, then the index created first will be considered.
- WITH DEFERRED REBUILD: This statement must be present while creating an index because while altering the index later, we will be using this clause.
Let’s consider an example of table mark_sheet with 3 columns: roll_id, class, days_attended.
We are creating an index on the column days_attended using compact indexing with below query:
create index index_attendance on table mark_sheet(days_attended)
WITH DEFERRED REBUILD
Output in HUE editor is:
2. Show an Index
The command used to see the details of an index created on a base table is using below, which shows the details like index table name, base table and column name on which index is created, and also the type of index handler used ( whether compact or Bitmap)
Show formatted index on base_table_name
3. Alter an Index
The syntax for altering an index is as seen below.
ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD
This command ALTER INDEX….REBUILD is used to rebuild an index, which was already built on a table. Partition details should also be provided if the base table has partitions. Indexes need to be rebuilt if the underlying table is overwritten or appended.
4. Drop an Index
The below command is used to drop the index and also the index table storing indexes.
DROP INDEX IF EXISTS index_name ON base_table_name
The index and the indexed table is deleted automatically if the table on which index was built is dropped. Similarly, if a partitioned table is indexed then on dropping the partitions, the indexes are also automatically deleted.
Advantages of Indexes in Hive
- Indexing in Hive provides a good replacement for partitioning, when the number of partitions or logical sub-segments will be too many and small in size, to be of any worth.
- Indexes reduce the query execution time.
- Indexing in Hive helps in case of traversing large data sets and also while building a data model.
- Indexing can be done for external tables or views, except when data is present in S3.
In this tutorial, we got to learn about when and how to use indexes in the Hive. We also came across the advantage it provides over conventional query building in Hive. However, it is important to note that indexes should be preferred if a frequent operation is performed on any column. Indexes can also degrade the query performance if used recklessly, because of the disk space involved in creating an index table. In RDMS, the primary key acts as an index to a table, but Hive indexes have their inherent advantages.
This is a guide to Indexes in Hive. Here we discuss the basic concept, why do we need Indexes, different operations perform on Hive and advantages. You can also go through our other related articles to learn more–