EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login
Home Data Science Data Science Tutorials Spark Tutorial Spark Parallelize
Secondary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark Cheat Sheet
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Spark Parallelize

By Priya PedamkarPriya Pedamkar

Spark Parallelize

Introduction to Spark Parallelize

Parallelize is a method to create an RDD from an existing collection (For e.g Array) present in the driver. The elements present in the collection are copied to form a distributed dataset on which we can operate on in parallel. In this topic, we are going to learn about Spark Parallelize.

Parallelize is one of the three methods of creating an RDD in spark, the other two methods being:

  • From an external data-source like a local filesystem, HDFS, Cassandra, etc.
  • By running a transformation operation on an existing RDD.

Syntax:

sc.parallelize (seq: Seq[T],numSlices: Int)

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Here sc is the SparkContext object,

All in One Data Science Bundle(360+ Courses, 50+ projects)
Python TutorialMachine LearningAWSArtificial Intelligence
TableauR ProgrammingPowerBIDeep Learning
Price
View Courses
360+ Online Courses | 50+ projects | 1500+ Hours | Verifiable Certificates | Lifetime Access
4.7 (86,700 ratings)

seq is a collection object which is present in the driver program,

numSlices is an optional parameter and it denotes the number of partitions that will be created for the dataset. Spark runs one task for each partition. Thus, it’s important in the sense that governs the number of parallel operations being performed.

How to Use the method?

In order to use the parallelize() method, the first thing that has to be created is a SparkContext object.

It can be created in the following way:

1. Import following classes :

org.apache.spark.SparkContext
org.apache.spark.SparkConf

2. Create SparkConf object :

val conf = new SparkConf().setMaster("local").setAppName("testApp")

Master and AppName are the minimum properties that have to be set in order to run a      spark application.

3. Create SparkContext object using the SparkConf object created in above step:

val sc = new SparkContext(conf)

The next step is to create a collection object.

Let’s see some commonly used collections which can be parallelized to form RDD:

Array: It is a special type of collection in Scala. It is of fixed size and can store elements of same type. The values stored in an Array are mutable.

An array can be created in the following ways:

var arr=new Array[dataType](size)

After creating the variable arr we have to insert the values at each index.

var arr=Array(1,2,3,4,5,6)

Here we are providing the elements of the array directly and datatype and size are inferred automatically.

Sequence: Sequences are special cases of iterable collections of class iterable. But contrary to iterables, the sequence always has defined order of elements. A sequence can be created by:

var mySeq=Seq(1,2,3,4,5,6)

List: Lists are similar to Arrays in the sense that they can have only same type of elements. But there’s two significant differences: 1) Elements of a list cannot be modified unlike Array and 2) A list represent a linked list.

A list can be created by:

Val myList=List(1,2,3,4,5,6)

Now that we have all the required objects, we can call the parallelize() method available on the sparkContext object and pass the collection as the parameter.

Examples of Spark Parallelize

Here are the following examples mention below:

Example #1

Code:

val conf= new SparkConf().setMaster("local").setAppName("test")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val rdd1= sc.parallelize(Array(1,2,3,4,5))
println("elements of rdd1")
rdd1.foreach(x=>print(x+","))
println()
val rdd2 = sc.parallelize(List(6,7,8,9,10))
println("elements of rdd2")
rdd2.foreach(x=>print(x+","))
println()
val rdd3=rdd1.union(rdd2)
println("elements of rdd3")
rdd3.foreach(x=>print(x+","))

Output:

Spark Parallelize output 1

In the above code First, we created sparkContext object (sc) and the created rdd1 by passing an array to parallelize method. Then we created rdd2 by passing a List and finally, we merged the two rdds by calling the union method one rdd1 and passing rdd2 as the argument.

Example #2

Code:

val conf= new SparkConf().setAppName("test").setMaster("local")
val sc =new SparkContext(conf)
val spark=SparkSession.builder().config(conf).getOrCreate()
sc.setLogLevel("ERROR")
val line1 = "live life enjoy detox"
val line2="learn apply live motivate"
val line3="life detox motivate live learn"
val rdd =sc.parallelize(Array(line1,line2,line3))
val rdd1 = rdd.flatMap(=>x.split(" "))
import  spark.implicits x._
val df = rdd1.toDF("word")
df.createOrReplaceTempView("tempTable")
val rslt=spark.sql("select word,COUNT(1) from tempTable GROUP BY word ")
rslt.show(1000,false)

Output:

Spark Parallelize output 2

The above code represents the classical word-count program. We used spark-sql to do it. To use sql, we converted the rdd1 into a dataFrame by calling the toDF method. To use this method, we have to import spark.implicits._.  We registered the dataFrame(df ) as a temp table and ran the query on top of it.

Example #3

Code:

val conf= new SparkConf().setAppName("test").setMaster("local")
val sc =new SparkContext(conf)
sc.setLogLevel("ERROR")
val myRdd=sc.parallelize(Seq(1,1,1,10,10,5,100,100,100,200,400),10)
println("Printing myRdd: ")
myRdd.foreach(x=>print(x+" "))
println()
val newRdd= myRdd.distinct()
println("Printing newRdd: ")
newRdd.foreach(x=>print(x+" "))

Output:

output 3

In the above code, we have created an RDD(myRdd) using a sequence and have passed numSlices as 10. Then, we have called the distinct method which gives the distinct elements of in the RDD. As expected the output prints the distinct elements.

Conclusion

In this article, we have learned how to create RDDs using the parallelize() method. This method is mostly used by beginners who are learning spark for the first time and in a production environment, it’s often used for writing test cases.

Recommended Articles

This is a guide to Spark Parallelize. Here we discuss how to use the Spark Parallelize method and examples for better understanding. You may also look at the following articles to learn more –

  1. Spark Functions
  2. Spark Versions
  3. Spark Components
  4. Spark Tools
Popular Course in this category
Apache Spark Training (3 Courses)
  3 Online Courses |  13+ Hours |  Verifiable Certificate of Completion |  Lifetime Access
4.5
Price

View Course

Related Courses

PySpark Tutorials (3 Courses)4.9
Apache Storm Training (1 Courses)4.8
0 Shares
Share
Tweet
Share
Primary Sidebar
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

SPSS, Data visualization with Python, Matplotlib Library, Seaborn Package

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more