EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login
Home Data Science Data Science Tutorials Hive Tutorial Bucketing in Hive
Secondary Sidebar
Hive Tutorial
  • Basics
    • Hive JDBC Driver
    • What is a Hive
    • Hive Architecture
    • Hive Installation
    • How To Install Hive
    • Hive Versions
    • Hive Commands
    • Hive Data Types
    • Hive Built-in Functions
    • Hive Function
    • Hive String Functions
    • Date Functions in Hive
    • Hive Table
    • Hive Drop Table
    • Hive Show Tables
    • Hive Group By
    • Hive Order By
    • Hive Cluster By
    • Joins in Hive
    • Hive Inner Join
    • Map Join in Hive
    • Hive nvl
    • Hive UDF
    • Dynamic Partitioning in Hive
    • HiveQL
    • HiveQL Queries
    • HiveQL Group By
    • Partitioning in Hive
    • Bucketing in Hive
    • Views in Hive
    • Indexes in Hive
    • External Table in Hive
    • Hive TimeStamp
    • Hive Database
    • Hive Interview Questions
    • Hive insert into

Related Courses

Hive Certification Course

Hadoop Course Training

All in One Data Science Course

Bucketing in Hive

By Priya PedamkarPriya Pedamkar

Bucketing in Hive

Introduction to Bucketing in Hive

Bucketing is a technique offered by Apache Hive to decompose data into more manageable parts, also known as buckets. This concept enhances query performance. Bucketing can be followed by partitioning, where partitions can be further divided into buckets. Bucketing comes into play when partitioning hive data sets into segments is not effective and can overcome over partitioning. “CLUSTERED BY” clause is used to do bucketing in Hive. The SORTED BY clause ensures local ordering in each bucket, by keeping the rows in each bucket ordered by one or more columns. The plausible cases when bucketing should be favoured are:

  • When the number of partitions is limited, or
  • When all partitions are of almost equal size

Features of Bucketing in Hive

  • To read and store data in buckets, a hashing algorithm is used to calculate the bucketed column value (simplest hashing function is modulus). For example, if we decide to have a total number of buckets to be 10, data will be stored in column value % 10, ranging from 0-9 (0 to n-1) buckets.
  • Bucketing is preferred for high cardinality columns as files are physically split into buckets. Each bucket in the Hive is created as a file. Bucket numbering is 1- based.
  • Query optimization happens in two layers known as bucket pruning and partition pruning if bucketing is done on partitioned tables.

Example of Bucketing in Hive

Taking an example, let us create a partitioned and a bucketed table named “student”,

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

CREATE TABLE student (
Student name,
Roll_number int,
Class int
) PARTITIONED BY (class int) CLUSTERED BY (roll_number) into 15 buckets row format delimited fields terminated by ',' stored as TEXTFILE

Using the ALTER TABLE students CLUSTERED BY (class) into 32 buckets, data can be bucketed only for the new insertions. There is no way that we can reorganize or reformat the existing data. ALTER TABLE command can only modify the metadata, with no effect on existing ones.

Optimizer comes into play when the “where” clause has been executed, for pruning strategies to kick in. In our case, there are 10 directories (classes) multiplied by 15 files (roll buckets), so 150 files are present for the “student” table. Partition pruning will happen, if the where clause has only class = 10, as then 15 files will be searched into. Bucket pruning will happen if the where clause has roll_number = 101, then only 10 directories need to be searched. If both class and roll_number are included in where clause, only one file will be searched. This is how to query layer optimization happens along with these features in HiveQL.

To insert values or data in a bucketed table, we have to specify below property in Hive,

set hive.enforce.bucketing =True

This property is used to enable dynamic bucketing in Hive, while data is being loaded in the same way as dynamic partitioning is set using this:

All in One Data Science Bundle(360+ Courses, 50+ projects)
Python TutorialMachine LearningAWSArtificial Intelligence
TableauR ProgrammingPowerBIDeep Learning
Price
View Courses
360+ Online Courses | 50+ projects | 1500+ Hours | Verifiable Certificates | Lifetime Access
4.7 (85,992 ratings)

set hive.exec.dynamic.partition = True

On setting

hive.enforce.bucketing =true

several reduce tasks is set equal to the number of buckets that are mentioned in the table.

Set hive.optimize.bucketmapjoin = True

This enables the bucket to join operation, leading to reduced scan cycles while executing queries on bucketed tables. Also, bucketing can be achieved by setting this property,

Set mapred.reduce.tasks=64 (number of buckets)

After setting this property, the data is populated in the table, as seen in the below query:

INSERT OVERWRITE TABLE student
PARTITION (class=10)
SELECT student, roll_number, class where class=10;

How to Decide the Number of Buckets?

To decide the number of buckets to be specified, we need to know the data characteristics and the query we want to execute. Buckets can be created in Hive, with or without partitions created on them. Basis what we are trying to do, below 3 ways can help us:

  • Sampling Performance: Bucketing improves performance for some sampling operation.
  • Join Performance: Join Performance becomes more effective if the two tables that are to be joined are basis the join keys which are bucket columns. But with the advent of cost-based optimization in the recent versions of Hive, the optimizer has the ability to choose between shuffle or map side join, whichever is better. In this case, the bucket join may hinder with choosing the better one.
  • Select Performance or Predicate pushdown: Bucketing in Hive helps in predicate pushdown. If we filter on one value from a bucketing column, we will look in only one bucket.

How does Data Distribution Happen in Buckets?

Records get distributed in buckets based on the hash value from a defined hashing algorithm. The hash value obtained from the algorithm varies with the data type of the bucketed column.

For the Int data type, the hash values are also an integer. However, for string or another complex data type, the hash value is computed basis each character present in the string or some other logic which is difficult to decode. Using a hashing algorithm for data distribution enables even distribution of data across buckets.

Advantages of Bucketing

  • Bucketed tables allow faster execution of map side joins, as the data is stored in equal-sized buckets/parts.
  • Efficient sampling happens for bucketed tables when compared to non – bucketed ones.
  • As in partitioning, the Bucketing feature also offers faster query performance.

Limitations of Bucketing

  • Loading data into buckets needs to be done manually by programmers, as the “Create table” command only takes care of the metadata.
  • Bucketing is not of any incremental value if data is small enough for partitioning to suffice for.

Conclusion – Bucketing in Hive

In this article, we have highlighted the features of creating buckets in a query for large datasets. We can also decide on the number of buckets to be specified and performance enhancements we are expecting from it. Both Partitioning and Bucketing are essential features of Hive, making efficient testing and debugging tasks while handling large data-sets. Breaking a table into partitions and then further segmenting partitions into buckets. Using, clustered by and sort by clause makes bucketing easy to implement.

Recommended Articles

This is a guide to Bucketing in Hive. Here we discuss the features, examples, advantages, and limitations of Bucketing in Hive. You may also look at the following articles to learn more –

  1. Top 6 Features of Hadoop Tools
  2. Data Manipulation of HiveQL
  3. Hive Architecture with Components
  4. How to Install Hive?
  5. Comprehensive Guide to Partitioning in Hive
Popular Course in this category
Hive Training (2 Courses, 5+ Projects)
  2 Online Courses |  5 Hands-on Projects |  25+ Hours |  Verifiable Certificate of Completion
4.5
Price

View Course

Related Courses

Hadoop Training Program (20 Courses, 14+ Projects, 4 Quizzes)4.9
All in One Data Science Bundle (360+ Courses, 50+ projects)4.8
0 Shares
Share
Tweet
Share
Primary Sidebar
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

SPSS, Data visualization with Python, Matplotlib Library, Seaborn Package

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more