EDUCBA Logo

EDUCBA

MENUMENU
  • Explore
    • EDUCBA Pro
    • PRO Bundles
    • Featured Skills
    • New & Trending
    • Fresh Entries
    • Finance
    • Data Science
    • Programming and Dev
    • Excel
    • Marketing
    • HR
    • PDP
    • VFX and Design
    • Project Management
    • Exam Prep
    • All Courses
  • Blog
  • Enterprise
  • Free Courses
  • Log in
  • Sign Up
Home Data Science Data Science Tutorials Spark Tutorial Spark Shell Commands
 

Spark Shell Commands

Priya Pedamkar
Article byPriya Pedamkar

Updated March 20, 2023

Spark Shell Commands

 

 

What is Spark Shell Commands?

Spark Shell Commands are the command-line interfaces that are used to operate spark processing. Spark Shell commands are useful for processing ETL and Analytics through Machine Learning implementation on high volume datasets with very less time. There are mainly three types of shell commands used in spark such as spark-shell for scala, pyspark for python and SparkR for R language. The Spark-shell uses scala and java language as a prerequisite setup on the environment. There are specific Spark shell commands available to perform spark actions such as checking the installed version of Spark, Creating and managing the resilient distributed datasets known as RDD.

Watch our Demo Courses and Videos

Valuation, Hadoop, Excel, Mobile Apps, Web Development & many more.

Types of Spark Shell Commands

The various kinds of Spark-shell commands are as follows:

1. To check if the Spark is installed and to know its version, below command, is used (All commands hereafter shall be indicated starting with this symbol “$”)

$ spark-shell

The following output is displayed if the spark is installed:

$ spark-shell

SPARK_MAJOR_VERSION is set to 2, using Spark2

Setting the default log level to “WARN”.

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://10.113.59.34:4040

Spark context available as ‘sc’ (master = local[*], app id = local-1568732886588).

Spark session available as ‘spark’.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ ‘_/

/___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.3.0-235

/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)

Type in expressions to have them evaluated.

Type: help for more information.

scala>

2. The basic data structure of Spark is called an RDD (Resilient Distributed Datasets) which contains an immutable collection of objects for distributed computing of records. All the datasets of RDD are partitioned logically across multiple nodes of a cluster.

An RDD can be created only by reading from a local file system or by transforming an existing RDD.

a) To create a new RDD we use the following command:

scala> val examplefile = sc.textFile("file.txt")

Here sc is called the object of SparkContext.

Output:

examplefile: org.apache.spark.rdd.RDD[String] = file.txt MapPartitionsRDD[3] at textFile at <console>:24

b) An RDD can be created through Parallelized Collection as follows:

scala> val oddnum = Array(1, 3, 5, 7, 9)

Output:

oddnum: Array[Int] = Array(1, 3, 5, 7, 9)
scala> val value = sc.parallelize(oddnum)

Output:

value: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:26

c) To create from existing RDD’s:

scala> val newRDD = oddnum.map(value => (value * 2))

Output:

newRDD: Array[Int] = Array(2, 6, 10, 14, 18)

3. There are two types of Spark RDD Operations which can be performed on the created datasets:

  • Actions
  • Transformations

Actions: It is used to perform certain required operations on the existing datasets. Following are a few of the commands which can be used to perform the below actions on the created datasets:

a) count() function to count the number of elements in RDD:

scala> value.count()

Output:

res3: Long = 5

b) collect() function to display all the elements of the array:

scala> value.collect()

Output:

res5: Array[Int] = Array(1, 3, 5, 7, 9)

c) first() function used to display the first element of the dataset:

scala> value.first()

Output:

res4: Int = 1

d) take(n) function displays the first n elements of the array:

scala> value.take(3)

Output:

res6: Array[Int] = Array(1, 3, 5)

e) takeSample (withReplacement, num, [seed]) function displays a random array of “num” elements where the seed is for the random number generator.

scala> value.takeSample(false, 3, System.nanoTime.toInt)

Output:

res8: Array[Int] = Array(3, 1, 7)

f) saveAsTextFile(path) function saves the dataset in the specified path of hdfs location

scala> value.saveAsTextFile("/user/valuedir")

g) partitions. length function can be used to find the number of partitions in the RDD

scala> value.partitions.length

Output:

res1: Int = 8

RDD Transformations

Transformation is used to form a new RDD from the existing ones. Since the inputs of the RDD are immutable, the result formed upon transformation can be one or more RDD as output.

There are two types of transformations:

  • Narrow transformations
  • Wide transformations

Narrow Transformations – Each parent RDD is divided into various partitions and among these only one partition will be used by the child RDD.

Example: map() and filter() are the two basic kinds of basic transformations that are called when an action is called.

  • map(func) function operates on each of the elements in the dataset “value” iteratively to produce the output RDD.

Example: In this example, we are adding the value 10 to each of the elements of the dataset value and displaying the transformed output with the help of collect function.

scala> val mapfunc = value.map(x => x+10)
mapfunc: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at map at <console>:28

scala> mapfunc.collect
res2: Array[Int] = Array(11, 13, 15, 17, 19)

filter(func) function is basically used to filter out the elements satisfying a particular condition specified using the function.

Example: In this example, we are trying to retrieve all the elements except number 2 of the dataset “value” and fetching the output via the collect function.

scala> val fill = value.filter(x => x!=2)
fill: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at filter at <console>:28

scala> fill.collect
res8: Array[Int] = Array(4, 6, 8, 10)

Wide Transformations – A single parent RDD partition is shared upon its various multiple child RDD partitions.

Example: groupbykey and reducebyKey are examples of wide transformations.

  • groupbyKey function groups the dataset values into key-value pairs according to the key values from another RDD. This process involves shuffling to take place when the group by function collects the data associated with a particular key and stores them in a single key-value pair.

Example: In this example, we are assigning the integers 5,6 to the string value “key” and integer 8 assigned to “8” which are displayed in the same key-value pair format in the output.

scala> val data = spark.sparkContext.parallelize(Array(("key",5),("val",8),("key",6)),3)
data: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[13] at parallelize at <console>:23

scala> val group = data.groupByKey().collect()
group: Array[(String, Iterable[Int])] = Array((key,CompactBuffer(5, 6)), (val,CompactBuffer(8)))

scala> group.foreach(println)
(key,CompactBuffer(5, 6))
(val,CompactBuffer(8))

  • reduceByKey function also combines the key-value pairs from different RDD’s. It combines the keys and their respective values into a single element after performing the mentioned transformation.

Example: In this example, the common keys of the array “letters” are first parallelized by the function and each letter is mapped with count 10 to it. The reduceByKey will add the values having similar keys and saves in the variable value2. The output is then displayed using the collect function.

scala> val letters = Array("A","B","C","D","B","C","E","D")
letters: Array[String] = Array(A, B, C, D, B, C, E, D)

scala> val value2 = spark.sparkContext.parallelize(letters).map(w => (w,10)).reduceByKey(_+_)
value2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[20] at reduceByKey at <console>:25

scala> value2.foreach(println)
(C,20)
(E,10)
(D,20)
(B,20)
(A,10)

Along with the above-mentioned actions like partitioning to RDD and performing actions/transformations on them, Spark also supports caching which is helpful where the same data is being called recursively.

With the help of all these properties, Apache Spark can process huge volumes of data and perform batch processing and streaming processing. The in-memory computation done by Spark is responsible for the extremely fast processing of applications. Hence Spark is the go-to method because of its versatility of programming over different languages, ease of use and integration capabilities.

Recommended Articles

This is a guide to Spark Shell Commands. Here we discuss the Various Types of Spark Shell Commands for different programming languages. You may also look at the following article to learn more –

  1. Shell Scripting Commands
  2. How to Install Spark
  3. Spark Interview Questions
  4. Spark Commands
  5. Adhoc Testing
  6. Random Number Generator in JavaScript
  7. Guide to the List of Unix Shell Commands
  8. PySpark SQL | Modules and Methods of PySpark SQL
  9. For Loop in Shell Scripting | How for loop works?
  10. Batch Scripting Commands with Examples
  11. Complete Overview of Spark Components

Primary Sidebar

Footer

Follow us!
  • EDUCBA FacebookEDUCBA TwitterEDUCBA LinkedINEDUCBA Instagram
  • EDUCBA YoutubeEDUCBA CourseraEDUCBA Udemy
APPS
EDUCBA Android AppEDUCBA iOS App
Blog
  • Blog
  • Free Tutorials
  • About us
  • Contact us
  • Log in
Courses
  • Enterprise Solutions
  • Free Courses
  • Explore Programs
  • All Courses
  • All in One Bundles
  • Sign up
Email
  • [email protected]

ISO 10004:2018 & ISO 9001:2015 Certified

© 2025 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

By continuing above step, you agree to our Terms of Use and Privacy Policy.
*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

EDUCBA Login

Forgot Password?

🚀 Limited Time Offer! - ENROLL NOW