Introduction to Bucketing in Hive
Bucketing is a technique offered by Apache Hive to decompose data into more manageable parts, also known as buckets. This concept enhances query performance. Bucketing can be followed by partitioning, where partitions can be further divided into buckets. Bucketing comes into play when partitioning hive data sets into segments is not effective and can overcome over partitioning. “CLUSTERED BY” clause is used to do bucketing in Hive. The SORTED BY clause ensures local ordering in each bucket, by keeping the rows in each bucket ordered by one or more columns. The plausible cases when bucketing should be favoured are:
- When the number of partitions is limited, or
- When all partitions are of almost equal size
Features of Bucketing in Hive
- To read and store data in buckets, a hashing algorithm is used to calculate the bucketed column value (simplest hashing function is modulus). For example, if we decide to have a total number of buckets to be 10, data will be stored in column value % 10, ranging from 0-9 (0 to n-1) buckets.
- Bucketing is preferred for high cardinality columns as files are physically split into buckets. Each bucket in the Hive is created as a file. Bucket numbering is 1- based.
- Query optimization happens in two layers known as bucket pruning and partition pruning if bucketing is done on partitioned tables.
Example of Bucketing in Hive
Taking an example, let us create a partitioned and a bucketed table named “student”,
CREATE TABLE student (
) PARTITIONED BY (class int) CLUSTERED BY (roll_number) into 15 buckets row format delimited fields terminated by ',' stored as TEXTFILE
Using the ALTER TABLE students CLUSTERED BY (class) into 32 buckets, data can be bucketed only for the new insertions. There is no way that we can reorganize or reformat the existing data. ALTER TABLE command can only modify the metadata, with no effect on existing ones.
Optimizer comes into play when the “where” clause has been executed, for pruning strategies to kick in. In our case, there are 10 directories (classes) multiplied by 15 files (roll buckets), so 150 files are present for the “student” table. Partition pruning will happen, if the where clause has only class = 10, as then 15 files will be searched into. Bucket pruning will happen if the where clause has roll_number = 101, then only 10 directories need to be searched. If both class and roll_number are included in where clause, only one file will be searched. This is how to query layer optimization happens along with these features in HiveQL.
To insert values or data in a bucketed table, we have to specify below property in Hive,
set hive.enforce.bucketing =True
This property is used to enable dynamic bucketing in Hive, while data is being loaded in the same way as dynamic partitioning is set using this:
set hive.exec.dynamic.partition = True
several reduce tasks is set equal to the number of buckets that are mentioned in the table.
Set hive.optimize.bucketmapjoin = True
This enables the bucket to join operation, leading to reduced scan cycles while executing queries on bucketed tables. Also, bucketing can be achieved by setting this property,
Set mapred.reduce.tasks=64 (number of buckets)
After setting this property, the data is populated in the table, as seen in the below query:
INSERT OVERWRITE TABLE student
SELECT student, roll_number, class where class=10;
How to Decide the Number of Buckets?
To decide the number of buckets to be specified, we need to know the data characteristics and the query we want to execute. Buckets can be created in Hive, with or without partitions created on them. Basis what we are trying to do, below 3 ways can help us:
- Sampling Performance: Bucketing improves performance for some sampling operation.
- Join Performance: Join Performance becomes more effective if the two tables that are to be joined are basis the join keys which are bucket columns. But with the advent of cost-based optimization in the recent versions of Hive, the optimizer has the ability to choose between shuffle or map side join, whichever is better. In this case, the bucket join may hinder with choosing the better one.
- Select Performance or Predicate pushdown: Bucketing in Hive helps in predicate pushdown. If we filter on one value from a bucketing column, we will look in only one bucket.
How does Data Distribution Happen in Buckets?
Records get distributed in buckets based on the hash value from a defined hashing algorithm. The hash value obtained from the algorithm varies with the data type of the bucketed column.
For the Int data type, the hash values are also an integer. However, for string or another complex data type, the hash value is computed basis each character present in the string or some other logic which is difficult to decode. Using a hashing algorithm for data distribution enables even distribution of data across buckets.
Advantages of Bucketing
- Bucketed tables allow faster execution of map side joins, as the data is stored in equal-sized buckets/parts.
- Efficient sampling happens for bucketed tables when compared to non – bucketed ones.
- As in partitioning, the Bucketing feature also offers faster query performance.
Limitations of Bucketing
- Loading data into buckets needs to be done manually by programmers, as the “Create table” command only takes care of the metadata.
- Bucketing is not of any incremental value if data is small enough for partitioning to suffice for.
Conclusion – Bucketing in Hive
In this article, we have highlighted the features of creating buckets in a query for large datasets. We can also decide on the number of buckets to be specified and performance enhancements we are expecting from it. Both Partitioning and Bucketing are essential features of Hive, making efficient testing and debugging tasks while handling large data-sets. Breaking a table into partitions and then further segmenting partitions into buckets. Using, clustered by and sort by clause makes bucketing easy to implement.
This is a guide to Bucketing in Hive. Here we discuss the features, examples, advantages, and limitations of Bucketing in Hive. You may also look at the following articles to learn more –