EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login
Home Data Science Data Science Tutorials Spark Tutorial Spark Repartition
Secondary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark Cheat Sheet
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Spark Repartition

By Priya PedamkarPriya Pedamkar

Spark Repartition

Introduction to Spark Repartition

The repartition() method is used to increase or decrease the number of partitions of an RDD or dataframe in spark. This method performs a full shuffle of data across all the nodes. It creates partitions of more or less equal in size. This is a costly operation given that it involves data movement all over the network. Partitions play an important in the degree of parallelism. The number of parallel tasks running in each stage is equal to the number of partitions. Thus, we can control parallelism using the repartition()method. It also plays a role in deciding the no of files generated in the output.

Syntax:

obj.repartition(numPartitions)

Here, obj is an RDD or data frame and numPartitions is a number signifying the number of partitions we want to create.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

How to Use Spark Repartition?

In order to use the method following steps have to be followed:

All in One Data Science Bundle(360+ Courses, 50+ projects)
Python TutorialMachine LearningAWSArtificial Intelligence
TableauR ProgrammingPowerBIDeep Learning
Price
View Courses
360+ Online Courses | 50+ projects | 1500+ Hours | Verifiable Certificates | Lifetime Access
4.7 (86,171 ratings)
  1. We need to create an RDD or dataframe on which we can call the method.
    We can create an RDD/dataframe by a) loading data from external sources like hdfs or databases like Cassandra b) calling parallelize()method on a spark context object and pass a collection as the parameter (and then invoking toDf() if we need to a dataframe)
  2. Next is to decide an appropriate value of numPartitions. We cannot choose a very large or very small value of numPartitions. This is because if we choose a very large value then a large no of files will be generated and it will be difficult for the hdfs system to maintain the metadata. On the other hand, if we choose a very small value then data in each partition will be huge and will take a lot of time to process.

Examples of Spark Repartition

Following are the examples of spark repartition:

Example #1 – On RDDs

The dataset us-counties.csv represents the no of Corona Cases at County and state level in the USA in a cumulative manner. This dataset is obtained from https://www.kaggle.com/ and the latest data available is for 10th April. Let’s find the no of Corona cases till the 10th of April at various states of the USA.

Code:

sc.setLogLevel("ERROR")
valmyRDD = sc.textFile("https://cdn.educba.com/home/hadoop/work/arindam/us-counties.csv")
myRDD.take(5).foreach(println) //Printing to show how the data looks like
println("Number of partitions in myRDD: "+myRDD.getNumPartitions) //Printing no of partitions
val head = myRDD.first()
val myRDD1 = myRDD.filter(x=>x!=head &&x.split(",")(0)=="2020-04-10") //Filtering out header and taking latest data available
val myRDD2 = myRDD1.repartition(10) // repartitioning to 10 partitions
println("Number of partitions in myRDD: "+myRDD2.getNumPartitions) //Printing partitions after repartition
val myRDD3 = myRDD2.map(x=>(x.split(",")(2),x.split(",")(4).toLong)) //Creating pairWise RDD with State and no of cases
valrslt = myRDD3.reduceByKey((x,y)=>x+y).collect().sortBy(x=>x._2)(Ordering[Long].reverse) //Summing up all the values of cases
rslt.foreach(println)

Output:

Spark Repartition-1.1

We are sorting the output based on the no of cases in a descending manner so as to fit some top-most affected states in the output.

Spark UI:

Spark Repartition-1.2

As we created 10 partitions, the last two stages are spawning 10 tasks.

Example #2 – On Dataframes

Let’s consider the same problem as example 1, but this time we are going to solve using dataframes and spark-sql.

Code:

import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
valinputSchema = StructType(Array(
StructField("date",StringType,true),
StructField("county",StringType,true),
StructField("state",StringType,true),
StructField("fips",LongType,true),
StructField("cases",LongType,true),
StructField("deaths",LongType,true)
))
valinpPath="https://cdn.educba.com/home/hadoop/work/arindam/us-counties.csv"
valdf = spark.read.option("header",true).schema(inputSchema).csv(inpPath)
df.show(5,false) //printing 5 rows
println("No of partitions in df: "+ df.rdd.getNumPartitions)
valnewDf = df.repartition(10)
println("No of partitions in newDf: "+ newDf.rdd.getNumPartitions)
newDf.createOrReplaceTempView("tempTable")
spark.sql("select state,SUM(cases) as cases from tempTable where date='2020-04-10' group by state order by cases desc").show(10,false)

Here we created a schema first. Then while reading the csv file we imposed the defined schema in order to create a dataframe. Then, we called the repartition method and changed the partitions to 10. After this, we registered the dataframenewDf as a temp table. Finally, we wrote a spark sql query to get the required result. Please note that we don’t have any method to get the number of partitions from a dataframe directly. We have to convert a dataframe to RDD and then call the getNumPartitions method to get the number of partitions available.

Output:

Spark Repartition-1.3

We are printing only top-ten states here and the results are matching with that calculated in the previous example.

Recommended Articles

This is a guide to Spark Repartition. Here we also discuss the introduction and how to use spark repartition along with different examples and its code implementation. you may also have a look at the following articles to learn more –

  1. Spark Versions
  2. Longitudinal Data Analysis
  3. Longitudinal Data Analysis
Popular Course in this category
Apache Spark Training (3 Courses)
  3 Online Courses |  13+ Hours |  Verifiable Certificate of Completion |  Lifetime Access
4.5
Price

View Course

Related Courses

PySpark Tutorials (3 Courses)4.9
Apache Storm Training (1 Courses)4.8
0 Shares
Share
Tweet
Share
Primary Sidebar
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

SPSS, Data visualization with Python, Matplotlib Library, Seaborn Package

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more