EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login
Home Data Science Data Science Tutorials Hive Tutorial Partitioning in Hive
Secondary Sidebar
Hive Tutorial
  • Basics
    • Hive JDBC Driver
    • What is a Hive
    • Hive Architecture
    • Hive Installation
    • How To Install Hive
    • Hive Versions
    • Hive Commands
    • Hive Data Types
    • Hive Built-in Functions
    • Hive Function
    • Hive String Functions
    • Date Functions in Hive
    • Hive Table
    • Hive Drop Table
    • Hive Show Tables
    • Hive Group By
    • Hive Order By
    • Hive Cluster By
    • Joins in Hive
    • Hive Inner Join
    • Map Join in Hive
    • Hive nvl
    • Hive UDF
    • Dynamic Partitioning in Hive
    • HiveQL
    • HiveQL Queries
    • HiveQL Group By
    • Partitioning in Hive
    • Bucketing in Hive
    • Views in Hive
    • Indexes in Hive
    • External Table in Hive
    • Hive TimeStamp
    • Hive Database
    • Hive Interview Questions
    • Hive insert into

Partitioning in Hive

By Priya PedamkarPriya Pedamkar

Partitioning in Hive

Introduction to Partitioning in Hive

Partitioning is a feature in Hive similar to RDBMS, making querying large datasets much faster and cost-effective. Partitioned tables are logical segments of large data tables based on one or more columns. This makes analyzing data much easier as only relevant subsets can be further investigated for deriving insights. This notion of partitioning is an old one, distributing the load horizontally and moving data closer to the user. Both external and managed (or internal) tables can be partitioned in Hive. Further, bucketing can be done using CLUSTERED by columns on these tables for improved query performance for certain queries.

Creating Data into Hive Tables

Let us create a table to manage “Wallet expenses”, which any digital wallet channel may have to track customers’ spend behaviour, having the following columns:

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Month   String
Spender   String
Merchant   String
Mode   String
Amount   Float

To track monthly expenses, we want to create a partitioned table with columns month and spender.

CREATE TABLE expenses (Month String,
Spender String,
Merchant String,
Mode String,
Amount Float
)
PARTITIONED BY (Month STRING, Spender STRING)
Row format delimited fields terminated by ",";

We get to know the partition keys using the below commands.

Commands:

describe formatted expenses;
show partitions expenses;

Partition keys behave like regular columns, once created, where users need not care whether it is a partitioned column or not unless optimization is required. Also, table schema need not have partition columns specified again as partitions create pseudo columns to query on.

Inserting Data into Hive Tables

Data insertion in HiveQL table can be done in two ways:

1. Static Partitioning

In static partitioning mode, we insert data individually into partitions. Each time data is loaded, the partition column value needs to be specified.

To insert value to the “expenses” table, using the below command in strict mode.

Command:

INSERT INTO TABLE expenses PARTITION (month= ‘201901’, spender = ‘PAY1001’)
SELECT month, spender, merchant, mode, amount
FROM expenses WHERE month=‘201901’ and spender = ‘PAY1001’;

Considering the table “expenses”, if there are 12 months and 100 spenders, then 12*100 = 1200 single insert statements will be written to insert all the table values.

There are certain types of query which are not allowed to run in MapReduce strict mode, i.e. when hive. mapred.mode = strict. These include:

  • Order by without limits
  • Cartesian product
  • Comparing bigints and doubles
  • Comparing bigints and strings
  • No partition key being picked up in a query.

2. Dynamic Partitioning

In dynamic partitioning mode, data is inserted automatically in partitions. It identifies the partition column values to be inserted. By default, Hive allows static partitioning, to prevent creating partitions for tables by accident. To set Hive to dynamic/unstrict mode, certain properties need to be explicitly defined.

Properties

  • hive> set hive.exec.dynamic.partition=true;
  • hive> set hive.exec.dynamic.partition.mode=nonstrict;
  • hive> set hive.exec.max.dynamic.partitions.pernode=1000; //sets the maximum number of dynamic partitions which a mapper or reducer can create, default value is 100.

After the dynamic properties are set as above, to insert value to the “expenses” table, below is the command.

Command:

INSERT INTO TABLE expenses PARTITION (month, spender) stored as sequencefile
SELECT month, spender, merchant, mode, amount
FROM expenses;

OVERWRITE command is used to overwrite the partition column values and replace them with new content. The whole table will be dropped on using overwrite if it is a non-partitioned table. INTO command will append to an existing table and not replace it from HIVE V0.8.0 and later.

Command:

INSERT OVERWRITE TABLE expenses PARTITION (month, spender) stored as sequence file
SELECT month, spender, merchant, mode, amount
FROM expenses;

Commands Used on Partitions in Hive

Below are some of the important commands used on partitions:

1. ALTER Partitions

There can be instances where the partitions created in a table need to be renamed or deleted or added ( same as an insert).

Command:

ALTER TABLE expenses PARTITION (spender = PAY1001) RENAME TO PARTITION( spender = PAYP1001)

We can verify this change by running the “SHOW PARTITIONS” command on the table. Partition column value changes; however, the metadata underlying it remains the same.

Command:

ALTER TABLE expenses DROP IF EXISTS PARTITION (month = 201902)

The partition gets deleted using this command. Both sub-directory and metadata are deleted in case of internal or managed tables.

2. EXCHANGE Partitions

If we have created partition in one table in expenses, it can be moved to another table customer with the same scheme does not have this partition present.

Command:

ALTER TABLE customer EXCHANGE PARTITION (spender) WITH TABLE expenses

3. TOUCH Partitions

The purpose of using this command is to read the metadata and write it back. It is widely used to log or fire hooks in case the table or partition is modified.

Command:

ALTER TABLE expenses TOUCH PARTITION (month, spender)

4. TRUNCATE Partitions

Truncate is used to remove a table or partition even from Trash, as this is similar to using PURGE.

Command:

TRUNCATE TABLE expenses PARTITION (month, spender)

5. PARTIAL Partitions

To change any existing partitions at once by using a single ALTER table statement, so that we don’t need to write multiple such statements, partial partitioning can be used.

We need to set hive.exec.dynamic.partition = true, to enable partial partitioning specifications.

Command:

ALTER TABLE expenses PARTITION (month, spender) CHANGE COLUMN amount amount DECIMAL(38,18)

Advantage and Limitation of Partitioning in Hive

Here are the advantage and limitation of Partitioning in hive explained below:

Advantages: Tables are stored in parts/segments making query response time faster as manipulation or search is required on a small segment rather than traversing the whole table.

Limitation: Too many partitions increase the overhead on name nodes as all metadata is stored in memory only. Each MapReduce job may end up having a huge volume of tasks (running in separate JVMs) due to a large number of partitions in the MapReduce execution engine.

Conclusion

We have got a fair idea of why partitioned tables will be more useful for large data sets with logical segments to be delved into. Widespread use case of partitions is analyzing time-series trends for customers, spending behaviour on specific Merchant categories, industry-wise profit trends, etc. Hive makes partitioning easy by abstracting the details for the users.

Recommended Articles

This is a guide to Partitioning in Hive. Here we discuss creating, inserting and commands used for partitioning in Hive along with their advantages and limitations. You may also look at the following articles to learn more –

  1. What is a Data Scientist?
  2. Hive Data Types
  3. Hive Installation
  4. Hive Versions | Top 7
Popular Course in this category
Hive Training (2 Courses, 5+ Projects)
  2 Online Courses |  5 Hands-on Projects |  25+ Hours |  Verifiable Certificate of Completion
4.5
Price

View Course

Related Courses

Hadoop Training Program (20 Courses, 14+ Projects, 4 Quizzes)4.9
All in One Data Science Bundle (360+ Courses, 50+ projects)4.8
Primary Sidebar
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2023 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more