EDUCBA Logo

EDUCBA

MENUMENU
  • Explore
    • EDUCBA Pro
    • PRO Bundles
    • Featured Skills
    • New & Trending
    • Fresh Entries
    • Finance
    • Data Science
    • Programming and Dev
    • Excel
    • Marketing
    • HR
    • PDP
    • VFX and Design
    • Project Management
    • Exam Prep
    • All Courses
  • Blog
  • Enterprise
  • Free Courses
  • Log in
  • Sign Up
Home Data Science Data Science Tutorials Hive Tutorial Partitioning in Hive
 

Partitioning in Hive

Priya Pedamkar
Article byPriya Pedamkar

Updated March 24, 2023

Partitioning in Hive

 

 

Introduction to Partitioning in Hive

Partitioning is a feature in Hive similar to RDBMS, making querying large datasets much faster and cost-effective. Partitioned tables are logical segments of large data tables based on one or more columns. This makes analyzing data much easier as only relevant subsets can be further investigated for deriving insights. This notion of partitioning is an old one, distributing the load horizontally and moving data closer to the user. Both external and managed (or internal) tables can be partitioned in Hive. Further, bucketing can be done using CLUSTERED by columns on these tables for improved query performance for certain queries.

Watch our Demo Courses and Videos

Valuation, Hadoop, Excel, Mobile Apps, Web Development & many more.

Creating Data into Hive Tables

Let us create a table to manage “Wallet expenses”, which any digital wallet channel may have to track customers’ spend behaviour, having the following columns:

Month   String
Spender   String
Merchant   String
Mode   String
Amount   Float

To track monthly expenses, we want to create a partitioned table with columns month and spender.

CREATE TABLE expenses (Month String,
Spender String,
Merchant String,
Mode String,
Amount Float
)
PARTITIONED BY (Month STRING, Spender STRING)
Row format delimited fields terminated by ",";

We get to know the partition keys using the below commands.

Commands:

describe formatted expenses;
show partitions expenses;

Partition keys behave like regular columns, once created, where users need not care whether it is a partitioned column or not unless optimization is required. Also, table schema need not have partition columns specified again as partitions create pseudo columns to query on.

Inserting Data into Hive Tables

Data insertion in HiveQL table can be done in two ways:

1. Static Partitioning

In static partitioning mode, we insert data individually into partitions. Each time data is loaded, the partition column value needs to be specified.

To insert value to the “expenses” table, using the below command in strict mode.

Command:

INSERT INTO TABLE expenses PARTITION (month= ‘201901’, spender = ‘PAY1001’)
SELECT month, spender, merchant, mode, amount
FROM expenses WHERE month=‘201901’ and spender = ‘PAY1001’;

Considering the table “expenses”, if there are 12 months and 100 spenders, then 12*100 = 1200 single insert statements will be written to insert all the table values.

There are certain types of query which are not allowed to run in MapReduce strict mode, i.e. when hive. mapred.mode = strict. These include:

  • Order by without limits
  • Cartesian product
  • Comparing bigints and doubles
  • Comparing bigints and strings
  • No partition key being picked up in a query.

2. Dynamic Partitioning

In dynamic partitioning mode, data is inserted automatically in partitions. It identifies the partition column values to be inserted. By default, Hive allows static partitioning, to prevent creating partitions for tables by accident. To set Hive to dynamic/unstrict mode, certain properties need to be explicitly defined.

Properties

  • hive> set hive.exec.dynamic.partition=true;
  • hive> set hive.exec.dynamic.partition.mode=nonstrict;
  • hive> set hive.exec.max.dynamic.partitions.pernode=1000; //sets the maximum number of dynamic partitions which a mapper or reducer can create, default value is 100.

After the dynamic properties are set as above, to insert value to the “expenses” table, below is the command.

Command:

INSERT INTO TABLE expenses PARTITION (month, spender) stored as sequencefile
SELECT month, spender, merchant, mode, amount
FROM expenses;

OVERWRITE command is used to overwrite the partition column values and replace them with new content. The whole table will be dropped on using overwrite if it is a non-partitioned table. INTO command will append to an existing table and not replace it from HIVE V0.8.0 and later.

Command:

INSERT OVERWRITE TABLE expenses PARTITION (month, spender) stored as sequence file
SELECT month, spender, merchant, mode, amount
FROM expenses;

Commands Used on Partitions in Hive

Below are some of the important commands used on partitions:

1. ALTER Partitions

There can be instances where the partitions created in a table need to be renamed or deleted or added ( same as an insert).

Command:

ALTER TABLE expenses PARTITION (spender = PAY1001) RENAME TO PARTITION( spender = PAYP1001)

We can verify this change by running the “SHOW PARTITIONS” command on the table. Partition column value changes; however, the metadata underlying it remains the same.

Command:

ALTER TABLE expenses DROP IF EXISTS PARTITION (month = 201902)

The partition gets deleted using this command. Both sub-directory and metadata are deleted in case of internal or managed tables.

2. EXCHANGE Partitions

If we have created partition in one table in expenses, it can be moved to another table customer with the same scheme does not have this partition present.

Command:

ALTER TABLE customer EXCHANGE PARTITION (spender) WITH TABLE expenses

3. TOUCH Partitions

The purpose of using this command is to read the metadata and write it back. It is widely used to log or fire hooks in case the table or partition is modified.

Command:

ALTER TABLE expenses TOUCH PARTITION (month, spender)

4. TRUNCATE Partitions

Truncate is used to remove a table or partition even from Trash, as this is similar to using PURGE.

Command:

TRUNCATE TABLE expenses PARTITION (month, spender)

5. PARTIAL Partitions

To change any existing partitions at once by using a single ALTER table statement, so that we don’t need to write multiple such statements, partial partitioning can be used.

We need to set hive.exec.dynamic.partition = true, to enable partial partitioning specifications.

Command:

ALTER TABLE expenses PARTITION (month, spender) CHANGE COLUMN amount amount DECIMAL(38,18)

Advantage and Limitation of Partitioning in Hive

Here are the advantage and limitation of Partitioning in hive explained below:

Advantages: Tables are stored in parts/segments making query response time faster as manipulation or search is required on a small segment rather than traversing the whole table.

Limitation: Too many partitions increase the overhead on name nodes as all metadata is stored in memory only. Each MapReduce job may end up having a huge volume of tasks (running in separate JVMs) due to a large number of partitions in the MapReduce execution engine.

Conclusion

We have got a fair idea of why partitioned tables will be more useful for large data sets with logical segments to be delved into. Widespread use case of partitions is analyzing time-series trends for customers, spending behaviour on specific Merchant categories, industry-wise profit trends, etc. Hive makes partitioning easy by abstracting the details for the users.

Recommended Articles

This is a guide to Partitioning in Hive. Here we discuss creating, inserting and commands used for partitioning in Hive along with their advantages and limitations. You may also look at the following articles to learn more –

  1. What is a Data Scientist?
  2. Hive Data Types
  3. Hive Installation
  4. Hive Versions | Top 7

Primary Sidebar

Footer

Follow us!
  • EDUCBA FacebookEDUCBA TwitterEDUCBA LinkedINEDUCBA Instagram
  • EDUCBA YoutubeEDUCBA CourseraEDUCBA Udemy
APPS
EDUCBA Android AppEDUCBA iOS App
Blog
  • Blog
  • Free Tutorials
  • About us
  • Contact us
  • Log in
Courses
  • Enterprise Solutions
  • Free Courses
  • Explore Programs
  • All Courses
  • All in One Bundles
  • Sign up
Email
  • [email protected]

ISO 10004:2018 & ISO 9001:2015 Certified

© 2025 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

By continuing above step, you agree to our Terms of Use and Privacy Policy.
*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

EDUCBA Login

Forgot Password?

🚀 Limited Time Offer! - 🎁 ENROLL NOW