EDUCBA Logo

EDUCBA

MENUMENU
  • Explore
    • EDUCBA Pro
    • PRO Bundles
    • All Courses
    • All Specializations
  • Blog
  • Enterprise
  • Free Courses
  • All Courses
  • All Specializations
  • Log in
  • Sign Up
Home Data Science Data Science Tutorials Spark Tutorial Spark Shuffle
 

Spark Shuffle

Priya Pedamkar
Article byPriya Pedamkar

Spark Shuffle

Introduction to Spark Shuffle

In Apache Spark, Spark Shuffle describes the procedure in between reduce task and map task. Shuffling refers to the shuffle of data given. This operation is considered the costliest. Parallelising effectively of the spark shuffle operation gives performance output as good for spark jobs. Spark data frames are the partitions of Shuffle operations. The original data frame partitions differ with the number of data frame partitions. The data moving from one partition to the other partition process in order to mat up, aggregate, join, or spread out in other ways is called a shuffle.

 

 

Syntax

The syntax for Shuffle in Spark Architecture:

Watch our Demo Courses and Videos

Valuation, Hadoop, Excel, Mobile Apps, Web Development & many more.

rdd.flatMap { line => line.split(' ') }.map((_, 1)).reduceByKey((x, y) => x + y).collect()

Explanation: This is a Shuffle spark method of partition in FlatMap operation RDD where we create an application of word count where each word separated into a tuple and then gets aggregated to result.

How Spark Architecture Shuffle Works

Spark shuffle1

Data is returned to disk and is transferred all across the network during a shuffle. The shuffle operation number reduction is to be done or consequently reduce the amount of data being shuffled.

By default, Spark shuffle operation uses partitioning of hash to determine which key-value pair shall be sent to which machine.
More shufflings in numbers are not always bad. Memory constraints and other impossibilities can be overcome by shuffling.

In RDD, the below are a few operations and examples of shuffle:
– subtractByKey
– groupBy
– foldByKey
– reduceByKey
– aggregateByKey
– transformations of a join of any type
– distinct
– cogroup

These above Shuffle operations built in a hash table perform the grouping within each task. This is often huge or large. This can be fixed by increasing the parallelism level and the input task is so set to small.

These are a few series in Spark shuffle operation –
One partition – One executor – One core
Four partitions – One executor – Four core
Two partition – Two executor – Two core
Skewed keys.

Examples to Implement Spark Shuffle

Let us look into an example:

Example #1

( customerId: Int, destination: String, price: Double) case class CFFPurchase

Let us sat that we consist of an RDD of user purchase manual of mobile application CFF’s which has been made in the past one month.

Val purchasesRdd: RDD[CFFPurchase] = sc.textFile(…)

Goal: Let us calculate how much money has been spent by each individual person and see how many trips he has made in a month.

Code:

val buyRDD: RDD[ADD_Purchase] = sc.textFile()
// Return an array - Array[(Int, (Int, Double))] // Pair of RDD
//group By Key returns RDD [(K, iterable[V])] val purchasesForAmonth = buyRDD.map( a=> (a.IdOfCustomer, a.cost))
.groupByKey()
.map(p=> (a._1. (a._2.size, a._2.add)))
.collect()

sample1 – sample1.txt:

val Buy = List (ADDPurchase (100, “Lucerne”, 31.60))
(100, “Geneva”, 22.25))
(300, “Basel”, 16.20))
(200, “St. Gallen”, 8.20))
(100, “Fribourg”, 12.40))
(300, “Zurich”, 42.10))

With the data distribution given above, what must the cluster look like?

Output:

Spark shuffle2

Spark shuffle3

Explanation: We have concrete instances of data. To create collections of values to go with each unique key-value pair we have to move key-value pairs across the network. We have to collect all the values for each key on the node that the key is hosted on. In this example, we have assumed that three nodes, each node will be home to one single key, So we put 100, 200, 300 on each of the nodes shown below. Then we move all the key-value pairs so that all purchase by customer number 100 on the first node and purchase by customer number 200 on second node and purchase by customer number 300 on the third node and they are all in this value which is a collection together. groupByKey part is where all of the data moves around the network. This operation is considered as Shuffle in Spark Architecture.

Important points to be noted about Shuffle in Spark

1. Spark Shuffle partitions have a static number of shuffle partitions.
2. Shuffle Spark partitions do not change with the size of data.
3. 200 is an overkill for small data, which will lead to lowering the processing due to the schedule overheads.
4. 200 is smaller for large data, and it does not use all the resources effectively present in the cluster.
And to overcome such problems, the shuffling partitions in spark should be done dynamically.

Conclusion

We have seen the concept of Shuffle in Spark Architecture. Shuffle operation is pretty swift and sorting is not at all required. Sometimes no hash table is to be maintained. When included with a map, a small amount of data or files are created on the map side. Random Input-output operations, small amounts are required, most of it is sequential read and writes.

Recommended Articles

This is a guide to Spark Shuffle. Here we discuss introduction to Spark Shuffle, how does it work, example, and important points. You can also go through our other related articles to learn more –

  1. Spark Versions
  2. Spark Stages
  3. Spark Broadcast
  4. Spark Commands
Primary Sidebar
Footer
Follow us!
  • EDUCBA FacebookEDUCBA TwitterEDUCBA LinkedINEDUCBA Instagram
  • EDUCBA YoutubeEDUCBA CourseraEDUCBA Udemy
APPS
EDUCBA Android AppEDUCBA iOS App
Blog
  • Blog
  • Free Tutorials
  • About us
  • Contact us
  • Log in
Courses
  • Enterprise Solutions
  • Free Courses
  • Explore Programs
  • All Courses
  • All in One Bundles
  • Sign up
Email
  • [email protected]

ISO 10004:2018 & ISO 9001:2015 Certified

© 2025 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

By continuing above step, you agree to our Terms of Use and Privacy Policy.
*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA Login

Forgot Password?

🚀 Limited Time Offer! - 🎁 ENROLL NOW