EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login
Home Data Science Data Science Tutorials Spark Tutorial Spark Shell Commands
Secondary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark Cheat Sheet
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Spark Shell Commands

By Priya PedamkarPriya Pedamkar

Spark Shell Commands

What is Spark Shell Commands?

Spark Shell Commands are the command-line interfaces that are used to operate spark processing. Spark Shell commands are useful for processing ETL and Analytics through Machine Learning implementation on high volume datasets with very less time. There are mainly three types of shell commands used in spark such as spark-shell for scala, pyspark for python and SparkR for R language. The Spark-shell uses scala and java language as a prerequisite setup on the environment. There are specific Spark shell commands available to perform spark actions such as checking the installed version of Spark, Creating and managing the resilient distributed datasets known as RDD.

Types of Spark Shell Commands

The various kinds of Spark-shell commands are as follows:

1. To check if the Spark is installed and to know its version, below command, is used (All commands hereafter shall be indicated starting with this symbol “$”)

$ spark-shell

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

The following output is displayed if the spark is installed:

All in One Data Science Bundle(360+ Courses, 50+ projects)
Python TutorialMachine LearningAWSArtificial Intelligence
TableauR ProgrammingPowerBIDeep Learning
Price
View Courses
360+ Online Courses | 50+ projects | 1500+ Hours | Verifiable Certificates | Lifetime Access
4.7 (86,527 ratings)

$ spark-shell

SPARK_MAJOR_VERSION is set to 2, using Spark2

Setting the default log level to “WARN”.

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://10.113.59.34:4040

Spark context available as ‘sc’ (master = local[*], app id = local-1568732886588).

Spark session available as ‘spark’.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ ‘_/

/___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.3.0-235

/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)

Type in expressions to have them evaluated.

Type: help for more information.

scala>

2. The basic data structure of Spark is called an RDD (Resilient Distributed Datasets) which contains an immutable collection of objects for distributed computing of records. All the datasets of RDD are partitioned logically across multiple nodes of a cluster.

An RDD can be created only by reading from a local file system or by transforming an existing RDD.

a) To create a new RDD we use the following command:

scala> val examplefile = sc.textFile("file.txt")

Here sc is called the object of SparkContext.

Output:

examplefile: org.apache.spark.rdd.RDD[String] = file.txt MapPartitionsRDD[3] at textFile at <console>:24

b) An RDD can be created through Parallelized Collection as follows:

scala> val oddnum = Array(1, 3, 5, 7, 9)

Output:

oddnum: Array[Int] = Array(1, 3, 5, 7, 9)
scala> val value = sc.parallelize(oddnum)

Output:

value: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:26

c) To create from existing RDD’s:

scala> val newRDD = oddnum.map(value => (value * 2))

Output:

newRDD: Array[Int] = Array(2, 6, 10, 14, 18)

3. There are two types of Spark RDD Operations which can be performed on the created datasets:

  • Actions
  • Transformations

Actions: It is used to perform certain required operations on the existing datasets. Following are a few of the commands which can be used to perform the below actions on the created datasets:

a) count() function to count the number of elements in RDD:

scala> value.count()

Output:

res3: Long = 5

b) collect() function to display all the elements of the array:

scala> value.collect()

Output:

res5: Array[Int] = Array(1, 3, 5, 7, 9)

c) first() function used to display the first element of the dataset:

scala> value.first()

Output:

res4: Int = 1

d) take(n) function displays the first n elements of the array:

scala> value.take(3)

Output:

res6: Array[Int] = Array(1, 3, 5)

e) takeSample (withReplacement, num, [seed]) function displays a random array of “num” elements where the seed is for the random number generator.

scala> value.takeSample(false, 3, System.nanoTime.toInt)

Output:

res8: Array[Int] = Array(3, 1, 7)

f) saveAsTextFile(path) function saves the dataset in the specified path of hdfs location

scala> value.saveAsTextFile("/user/valuedir")

g) partitions. length function can be used to find the number of partitions in the RDD

scala> value.partitions.length

Output:

res1: Int = 8

RDD Transformations

Transformation is used to form a new RDD from the existing ones. Since the inputs of the RDD are immutable, the result formed upon transformation can be one or more RDD as output.

There are two types of transformations:

  • Narrow transformations
  • Wide transformations

Narrow Transformations – Each parent RDD is divided into various partitions and among these only one partition will be used by the child RDD.

Example: map() and filter() are the two basic kinds of basic transformations that are called when an action is called.

  • map(func) function operates on each of the elements in the dataset “value” iteratively to produce the output RDD.

Example: In this example, we are adding the value 10 to each of the elements of the dataset value and displaying the transformed output with the help of collect function.

scala> val mapfunc = value.map(x => x+10)
mapfunc: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at map at <console>:28

scala> mapfunc.collect
res2: Array[Int] = Array(11, 13, 15, 17, 19)

filter(func) function is basically used to filter out the elements satisfying a particular condition specified using the function.

Example: In this example, we are trying to retrieve all the elements except number 2 of the dataset “value” and fetching the output via the collect function.

scala> val fill = value.filter(x => x!=2)
fill: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at filter at <console>:28

scala> fill.collect
res8: Array[Int] = Array(4, 6, 8, 10)

Wide Transformations – A single parent RDD partition is shared upon its various multiple child RDD partitions.

Example: groupbykey and reducebyKey are examples of wide transformations.

  • groupbyKey function groups the dataset values into key-value pairs according to the key values from another RDD. This process involves shuffling to take place when the group by function collects the data associated with a particular key and stores them in a single key-value pair.

Example: In this example, we are assigning the integers 5,6 to the string value “key” and integer 8 assigned to “8” which are displayed in the same key-value pair format in the output.

scala> val data = spark.sparkContext.parallelize(Array(("key",5),("val",8),("key",6)),3)
data: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[13] at parallelize at <console>:23

scala> val group = data.groupByKey().collect()
group: Array[(String, Iterable[Int])] = Array((key,CompactBuffer(5, 6)), (val,CompactBuffer(8)))

scala> group.foreach(println)
(key,CompactBuffer(5, 6))
(val,CompactBuffer(8))

  • reduceByKey function also combines the key-value pairs from different RDD’s. It combines the keys and their respective values into a single element after performing the mentioned transformation.

Example: In this example, the common keys of the array “letters” are first parallelized by the function and each letter is mapped with count 10 to it. The reduceByKey will add the values having similar keys and saves in the variable value2. The output is then displayed using the collect function.

scala> val letters = Array("A","B","C","D","B","C","E","D")
letters: Array[String] = Array(A, B, C, D, B, C, E, D)

scala> val value2 = spark.sparkContext.parallelize(letters).map(w => (w,10)).reduceByKey(_+_)
value2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[20] at reduceByKey at <console>:25

scala> value2.foreach(println)
(C,20)
(E,10)
(D,20)
(B,20)
(A,10)

Along with the above-mentioned actions like partitioning to RDD and performing actions/transformations on them, Spark also supports caching which is helpful where the same data is being called recursively.

With the help of all these properties, Apache Spark can process huge volumes of data and perform batch processing and streaming processing. The in-memory computation done by Spark is responsible for the extremely fast processing of applications. Hence Spark is the go-to method because of its versatility of programming over different languages, ease of use and integration capabilities.

Recommended Articles

This is a guide to Spark Shell Commands. Here we discuss the Various Types of Spark Shell Commands for different programming languages. You may also look at the following article to learn more –

  1. Shell Scripting Commands
  2. How to Install Spark
  3. Spark Interview Questions
  4. Spark Commands
  5. Adhoc Testing
  6. Random Number Generator in JavaScript
  7. Guide to the List of Unix Shell Commands
  8. PySpark SQL | Modules and Methods of PySpark SQL
  9. For Loop in Shell Scripting | How for loop works?
  10. Batch Scripting Commands with Examples
  11. Complete Overview of Spark Components
Popular Course in this category
Apache Spark Training (3 Courses)
  3 Online Courses |  13+ Hours |  Verifiable Certificate of Completion |  Lifetime Access
4.5
Price

View Course

Related Courses

PySpark Tutorials (3 Courses)4.9
Apache Storm Training (1 Courses)4.8
0 Shares
Share
Tweet
Share
Primary Sidebar
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

SPSS, Data visualization with Python, Matplotlib Library, Seaborn Package

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more