EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login
Home Data Science Data Science Tutorials Spark Tutorial Spark Accumulator
Secondary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark Cheat Sheet
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Spark Accumulator

By Priya PedamkarPriya Pedamkar

Spark-Accumulator

Introduction to Spark Accumulator

Shared variables are used by Apache Spark. When a cluster executor is sent a task by the driver, each node of the cluster receives a copy of shared variables. There are two basic types supported by Apache Spark of shared variables – Accumulator and broadcast. Apache Spark is widely used and is an open-source cluster computing framework. This comes with features like computation machine learning, streaming of API’s, and graph processing algorithms. Variables that are added through associated operations are Accumulators. Implementing sums and counters are one of the examples of accumulator tasks and there are many other tasks as such. Numeric types are supported by spark easily than any other type, but support can be added to other types by the programmers.

Syntax:

The above code shares the details for the class accumulator of PySpark.

val acc = sc.accumulator(v)

Initially v is set to zero more preferentially when one performs sum r a count operation.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Why do we Use Spark Accumulator?

When a user wants to perform communicative or associate operations on the data, we use the Spark accumulator. These can be created without or with a name. The Sparks UI helps in viewing the name created with accumulator and these can also be useful in understanding the progress of running stages. By calling SparkContext.accumulator(v), the accumulator can be created taking the initial value as v, just as similar to Spark broadcast. Used in implementing sums and counter operations as in MapReduce functions. Accumulators are not supported in Python.

All in One Data Science Bundle(360+ Courses, 50+ projects)
Python TutorialMachine LearningAWSArtificial Intelligence
TableauR ProgrammingPowerBIDeep Learning
Price
View Courses
360+ Online Courses | 50+ projects | 1500+ Hours | Verifiable Certificates | Lifetime Access
4.7 (86,650 ratings)

Code:

package org.spark.accumulator.crowd.now.aggregator.sample2
var lalaLines: Int = 0
sc.textFile("some log file", 4)
.forech { line =>
if (line.length() == 0) lalaLines += 1
}
println (s " Lala Lines are from the above code = $lalaLines")

In the above code, the value will be zero when the blank lines code output is printed. Shifting of code by the Spark to each and every executor, the variables are local to that executor, and the latest and updated value is not given back to the driver. Making the blank Lines as an accumulator might solve the above problem. And that will help in updating back all the changes to every variable in every executor.

The above code is written like this:

Code:

package org.spark.accumulator.crowd.now.aggregator.sample2
var lalaLines = sc.accumulator(, “lala Lines”)
sc.textFile("some log file", 4)
.forech { line =>
if (line.length() == 0) lalaLines += 1
}
println (s "\tlala Lines are from the above code = $lalaLines.value")

This code makes sure that the accumulator blankLine is up to date across each executor and relays back to the driver.

How Does Spark Accumulator Work?

Variables of broadcast allow the developers of Spark to keep a secured read only cached variable on different nodes. With the needed tasks, only shipping a copy merely. Without having to waste a lot of time and transfer of network input and output, they can be used in giving a node a large copy of the input dataset. Broadcast variables can be distributed by Spark using a variety of accumulator algorithms which might turn largely and the cost of communication is reduced.

There are different stages in executing the actions of Spark. The stages are then separated by operation – shuffle. In every stage Spark accumulator automatically the common data needs to be in the cache, and should be serialized from which again will be de-serialised by every node before each task is run. And for this cause, If the variables of the broadcast are created explicitly, the multiple staged tasks all across needed with the same data, the above should be done.

The mentioned above broadcast variable creation by wrapping function SparkConext.accumulator, the code for it is:

Code:

val accum = sc.accumulator(0)
accum: spark.Accumulator[Int] = 0
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
res2: Int = 20
package org.spark.accumulator.crowd.now.aggregator.sample2
object Boot {
import util.is
def main(args: Array[String]): Unit = {
val sparkConfigiration = new Spark.Configuration(true)
.setMaster_("L[4]")
.setNameOfApp("Analyzer Spark")
val sparkContext = new Spark.Context(spark.Configuration)
val httpStatusList = sparkContext broadcast populateHttpStatusList
val Info_http = sparkContext accumulator(0, "HTTP a")
val Success_http = sparkContext accumulator(0, "HTTP b")
val Redirect_http = sparkContext accumulator(0, "HTTP c")
val Client.Error_http = sparkContext accumulator(0, "HTTP d")
val Server.Error_http = sparkContext accumulator(0, "HTTP e")
sparkContext.tf(gClass.gRes("log").gPath,
println("THE_START")
println("HttpStatusCodes are going to be printed in result from access the log parse")
println("Http Status Info")
println("Status Success")
println("Status Redirect")
println("Client Error")
println("Server Error")
println("THE_END")
spark.Context.stop()
}
}
}

The variable of the broadcast is called a value and it stores the user data. The variable also returns a value of broadcast.

accumulator_ = sc.accumulator(0)
rdd = sc.parallelize([1, 2, 3, 4])
def f(x):
global accumulator_
accum += x
rdd.foreach(f)
accum.value

The accumulator command line is given below:

$SPARK_HOME/bin/spark-submit accumulator.py

Output:

Spark Accumulator Example

Advantages and Uses of Spark Accumulator

Memory access is very direct.

  1. Garbage values are least collected in processing overhead.
  2. Memory format is compact columnar.
  3. Query catalyst optimization.
  4. Code generation is the whole stage.
  5. Advantages of compile tile type by datasets over the data-frames.

Conclusion

We have seen the concept of the Spark accumulator. Spark uses shared variables, for processing and parallel. For information aggregations and communicative associations and operations, accumulators variables are used. in map-reduce, for summing the counter or operation we can use an accumulator. Whereas in spark, the variables are mutable. Accumulator’s value cannot be read by the executors. But only the driver program can. Counter in Map reduce java is similar to this.

Recommended Articles

This is a guide to Spark Accumulator. Here we discuss Introduction to Spark Accumulator and how it works along with its advantages and Uses. You can also go through our other suggested articles to learn more –

  1. JavaScript Math Functions (Examples)
  2. Top 9 Types of Java Compilers
  3. Android Architecture | What is Android Architecture?
  4. Top 9 Android Ad Blocker
Popular Course in this category
Apache Spark Training (3 Courses)
  3 Online Courses |  13+ Hours |  Verifiable Certificate of Completion |  Lifetime Access
4.5
Price

View Course

Related Courses

PySpark Tutorials (3 Courses)4.9
Apache Storm Training (1 Courses)4.8
1 Shares
Share
Tweet
Share
Primary Sidebar
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

SPSS, Data visualization with Python, Matplotlib Library, Seaborn Package

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more