EDUCBA

EDUCBA

MENUMENU
  • Explore
    • Lifetime Membership
    • All in One Bundles
    • Fresh Entries
    • Finance
    • Data Science
    • Programming and Dev
    • Excel
    • Marketing
    • HR
    • PDP
    • VFX and Design
    • Project Management
    • Exam Prep
    • All Courses
  • Blog
  • Enterprise
  • Free Courses
  • Log in
  • Sign up
Home Data Science Data Science Tutorials Spark Tutorial Spark Repartition

Spark Repartition

Priya Pedamkar
Article byPriya Pedamkar

Updated February 27, 2023

Spark Repartition

Introduction to Spark Repartition

The repartition() method is used to increase or decrease the number of partitions of an RDD or dataframe in spark. This method performs a full shuffle of data across all the nodes. It creates partitions of more or less equal in size. This is a costly operation given that it involves data movement all over the network. Partitions play an important in the degree of parallelism. The number of parallel tasks running in each stage is equal to the number of partitions. Thus, we can control parallelism using the repartition()method. It also plays a role in deciding the no of files generated in the output.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Syntax:

obj.repartition(numPartitions)

Here, obj is an RDD or data frame and numPartitions is a number signifying the number of partitions we want to create.

How to Use Spark Repartition?

In order to use the method following steps have to be followed:

  1. We need to create an RDD or dataframe on which we can call the method.
    We can create an RDD/dataframe by a) loading data from external sources like hdfs or databases like Cassandra b) calling parallelize()method on a spark context object and pass a collection as the parameter (and then invoking toDf() if we need to a dataframe)
  2. Next is to decide an appropriate value of numPartitions. We cannot choose a very large or very small value of numPartitions. This is because if we choose a very large value then a large no of files will be generated and it will be difficult for the hdfs system to maintain the metadata. On the other hand, if we choose a very small value then data in each partition will be huge and will take a lot of time to process.

Examples of Spark Repartition

Following are the examples of spark repartition:

Example #1 – On RDDs

The dataset us-counties.csv represents the no of Corona Cases at County and state level in the USA in a cumulative manner. This dataset is obtained from https://www.kaggle.com/ and the latest data available is for 10th April. Let’s find the no of Corona cases till the 10th of April at various states of the USA.

Code:

sc.setLogLevel("ERROR")
valmyRDD = sc.textFile("https://cdn.educba.com/home/hadoop/work/arindam/us-counties.csv")
myRDD.take(5).foreach(println) //Printing to show how the data looks like
println("Number of partitions in myRDD: "+myRDD.getNumPartitions) //Printing no of partitions
val head = myRDD.first()
val myRDD1 = myRDD.filter(x=>x!=head &&x.split(",")(0)=="2020-04-10") //Filtering out header and taking latest data available
val myRDD2 = myRDD1.repartition(10) // repartitioning to 10 partitions
println("Number of partitions in myRDD: "+myRDD2.getNumPartitions) //Printing partitions after repartition
val myRDD3 = myRDD2.map(x=>(x.split(",")(2),x.split(",")(4).toLong)) //Creating pairWise RDD with State and no of cases
valrslt = myRDD3.reduceByKey((x,y)=>x+y).collect().sortBy(x=>x._2)(Ordering[Long].reverse) //Summing up all the values of cases
rslt.foreach(println)

Output:

Spark Repartition-1.1

We are sorting the output based on the no of cases in a descending manner so as to fit some top-most affected states in the output.

Spark UI:

Spark Repartition-1.2

As we created 10 partitions, the last two stages are spawning 10 tasks.

Example #2 – On Dataframes

Let’s consider the same problem as example 1, but this time we are going to solve using dataframes and spark-sql.

Code:

import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
valinputSchema = StructType(Array(
StructField("date",StringType,true),
StructField("county",StringType,true),
StructField("state",StringType,true),
StructField("fips",LongType,true),
StructField("cases",LongType,true),
StructField("deaths",LongType,true)
))
valinpPath="https://cdn.educba.com/home/hadoop/work/arindam/us-counties.csv"
valdf = spark.read.option("header",true).schema(inputSchema).csv(inpPath)
df.show(5,false) //printing 5 rows
println("No of partitions in df: "+ df.rdd.getNumPartitions)
valnewDf = df.repartition(10)
println("No of partitions in newDf: "+ newDf.rdd.getNumPartitions)
newDf.createOrReplaceTempView("tempTable")
spark.sql("select state,SUM(cases) as cases from tempTable where date='2020-04-10' group by state order by cases desc").show(10,false)

Here we created a schema first. Then while reading the csv file we imposed the defined schema in order to create a dataframe. Then, we called the repartition method and changed the partitions to 10. After this, we registered the dataframenewDf as a temp table. Finally, we wrote a spark sql query to get the required result. Please note that we don’t have any method to get the number of partitions from a dataframe directly. We have to convert a dataframe to RDD and then call the getNumPartitions method to get the number of partitions available.

Output:

Spark Repartition-1.3

We are printing only top-ten states here and the results are matching with that calculated in the previous example.

Recommended Articles

This is a guide to Spark Repartition. Here we also discuss the introduction and how to use spark repartition along with different examples and its code implementation. you may also have a look at the following articles to learn more –

  1. Spark Versions
  2. Longitudinal Data Analysis
  3. Longitudinal Data Analysis
ADVERTISEMENT
All in One Excel VBA Bundle
500+ Hours of HD Videos
15 Learning Paths
120+ Courses
Verifiable Certificate of Completion
Lifetime Access
ADVERTISEMENT
Financial Analyst Masters Training Program
2000+ Hours of HD Videos
43 Learning Paths
550+ Courses
Verifiable Certificate of Completion
Lifetime Access
ADVERTISEMENT
All in One Data Science Bundle
2000+ Hour of HD Videos
80 Learning Paths
400+ Courses
Verifiable Certificate of Completion
Lifetime Access
ADVERTISEMENT
All in One Software Development Bundle
5000+ Hours of HD Videos
149 Learning Paths
1050+ Courses
Verifiable Certificate of Completion
Lifetime Access
Primary Sidebar
Footer
Follow us!
  • EDUCBA FacebookEDUCBA TwitterEDUCBA LinkedINEDUCBA Instagram
  • EDUCBA YoutubeEDUCBA CourseraEDUCBA Udemy
APPS
EDUCBA Android AppEDUCBA iOS App
Blog
  • Blog
  • Free Tutorials
  • About us
  • Contact us
  • Log in
  • Blog as Guest
Courses
  • Enterprise Solutions
  • Free Courses
  • Explore Programs
  • All Courses
  • All in One Bundles
  • Sign up
Email
  • [email protected]

ISO 10004:2018 & ISO 9001:2015 Certified

© 2023 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

By continuing above step, you agree to our Terms of Use and Privacy Policy.
*Please provide your correct email id. Login details for this Free course will be emailed to you

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA Login

Forgot Password?

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

🚀 Cyber Monday Reloaded Price Drop! All in One Universal Bundle (3700+ Courses) @ 🎁 90% OFF - Ends in ENROLL NOW