EDUCBA Logo

EDUCBA

MENUMENU
  • Explore
    • EDUCBA Pro
    • PRO Bundles
    • All Courses
    • All Specializations
  • Blog
  • Enterprise
  • Free Courses
  • All Courses
  • All Specializations
  • Log in
  • Sign Up
Home Data Science Data Science Tutorials Spark Tutorial Spark Broadcast
 

Spark Broadcast

Priya Pedamkar
Article byPriya Pedamkar

Updated February 28, 2023

Spark Broadcast

 

 

Introduction to Spark Broadcast

Shared variables are used by Apache Spark. When a cluster executor is sent a task by the driver, each node of the cluster receives a copy of shared variables. There are two basic types supported by Apache Spark of shared variables – Accumulator and broadcast. Apache Spark is widely used and is an open-source cluster computing framework. This comes with features like computation machine learning, streaming of APIs, and graph processing algorithms. Broadcast variables are generally used over several stages and require the same data. SparkContext.broadcast(v) is called where the variable v is used in creating Broadcast variables.

Watch our Demo Courses and Videos

Valuation, Hadoop, Excel, Mobile Apps, Web Development & many more.

Syntax

The above code shares the details for the class broadcast of PySpark.

class pyspark.Broadcast (
sc = None,
value = None,
pickle_registry = None,
path = None
)

Explanation: Variables of the broadcast are used to save a copy of huge datasets of all nodes across.  Cached on all machines is this variable and is not sent tasks on machines.

Why Spark Broadcast is used?

When huge datasets are needed to be cached in executors, the broadcast variables are used. Let us imagine, we need to lookup for pin codes or zip codes for doing a transformation and in this case, it is neither we will be able to query the huge database every time nor it is practicable to send the huge lookup for the table every time to the executors. The only solution is by converting the table lookup into a broadcast variable and will be cached by Spark in every executor in reference for future use.

Note: Once the values of nodes are broadcasted, the value should not be changed by us again. We make sure that every node has an exact copy of the same data. The value which is modified is sent to a different node which would give results of expectancy.

How does Spark Broadcast Works?

Variables of broadcast allow the developers of Spark to keep a secured read-only cached variable on different nodes. With the needed tasks, only shipping a copy merely. Without having to waste a lot of time and transfer of network input and output, they can be used in giving a node a large copy of the input dataset. Broadcast variables can be distributed by Spark using a variety of broadcast algorithms which might turn largely and the cost of communication is reduced.

There are different stages in executing the actions of Spark. The stages are then separated by operation – shuffle. In every stage Spark broadcasts automatically the common data need to be in the cache, and should be serialized from which again will be de-serialized by every node before each task is run. And for this cause, If the variables of the broadcast are created explicitly, the multiple staged tasks all across needed with the same data, the above should be done. The mentioned above broadcast variable creation by wrapping function SparkConext.broadcast.

Code:

val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
broadcastVar.value
res2: Array[Int] = Array(1, 2, 3)
package org.spark.broacast.crowd.now.aggregator.sample2
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.hammerlab.coverage.histogram.JointHistogram.Depth
trait CanDownSampleRDD[V] {
def rdd: RDD[((Depth, Depth), V)]
def filtersBroadcast: Broadcast[(Set[Depth], Set[Depth])]
@transient lazy val filtered = filterDistribution(filtersBroadcast)
(for {
((d1, d2), value) ? rdd
(d1Filter, d2Filter) = filtersBroadcast.value
if d1Filter(d1) && d2Filter(d2)
} yield
(d1, d2) ? value
)
.collect
.sortBy(_._1)
}

The variable of the broadcast is called value and it stores the user data. The variable also returns a value of broadcast.
from py spark import SparkContext

sc = SparkContext("local", "Broadcast app")
words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"])
data = words_new.value
print "Stored data -> %s" % (data)
elem = words_new.value[2]
print "A particular element is being printed"

The broadcast command line is given below:

$SPARK_HOME/bin/spark-submit broadcast.py

Output:

Spark Broadcast1

Advantages and uses of Spark Broadcast

Below are mentioned the advantages & uses:

  • Memory access is very direct.
  • Garbage values are least collected in processing overhead.
  • The memory format is a compact columnar.
  • Query catalyst optimization.
  • Code generation is the whole stage.
  • Advantages of compile tile type by datasets over the data frames.

Conclusion

We have seen the concept of Spark broadcast. Spark uses shared variables, for processing and parallel. For information aggregations and communicative associations and operations, accumulators variables are used. in a map-reduce, for summing the counter or operation we can use an accumulator. Whereas in spark, the variables are mutable.

Recommended Articles

This is a guide to Spark Broadcast. Here we discuss an introduction to Spark Broadcast, syntax, why it is used, How does it work, advantages. You can also go through our other related articles to learn more –

  1. Spark Versions
  2. Spark Commands
  3. PySpark SQL
  4. Spark Stages
Primary Sidebar
Footer
Follow us!
  • EDUCBA FacebookEDUCBA TwitterEDUCBA LinkedINEDUCBA Instagram
  • EDUCBA YoutubeEDUCBA CourseraEDUCBA Udemy
APPS
EDUCBA Android AppEDUCBA iOS App
Blog
  • Blog
  • Free Tutorials
  • About us
  • Contact us
  • Log in
Courses
  • Enterprise Solutions
  • Free Courses
  • Explore Programs
  • All Courses
  • All in One Bundles
  • Sign up
Email
  • [email protected]

ISO 10004:2018 & ISO 9001:2015 Certified

© 2025 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

By continuing above step, you agree to our Terms of Use and Privacy Policy.
*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA Login

Forgot Password?

🚀 Limited Time Offer! - 🎁 ENROLL NOW