EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login

Spark Commands

By Priya PedamkarPriya Pedamkar

Home » Data Science » Data Science Tutorials » Spark Tutorial » Spark Commands

Spark Commands

Introduction to Spark Commands

Apache Spark is a framework built on top of Hadoop for fast computations. It extends the concept of MapReduce in the cluster-based scenario to efficiently run a task. Spark Command is written in Scala.

Hadoop can be utilized by Spark in the following ways (see below):

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

utilized by Spark

  1. Standalone: Spark directly deployed on top of Hadoop. Spark jobs run parallelly on Hadoop and Spark.
  2. Hadoop YARN: Spark runs on Yarn without the need of any pre-installation.
  3. Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job, in addition to standalone deployment. With SIMR, one can start Spark and can use its shell without any administrative access.

Components of Spark

Spark comprises of the following parts:

  1. Apache Spark Core
  2. Spark SQL
  3. Spark Streaming
  4. MLib
  5. GraphX

Resilient Distributed Datasets (RDD) is considered as the fundamental data structure of Spark commands. RDD is immutable and read-only in nature. All kind of computations in spark commands is done through transformations and actions on RDD’s.

Resilient Distributed Datasets (RDD)

Google image

Spark shell provides a medium for users to interact with its functionalities. They have a lot of different commands which can be used to process data on the interactive shell.

Basic Spark Commands

Let’s take a look at some of the basic commands which are given below:

1. To start the Spark shell

Spark shell

Basic spark commands

2. Read file from local system:

local system

Here “sc” is the spark context. Considering “data.txt” is in the home directory, it is read like this, else one need to specify the full path.

3. Create RDD through parallelizing

RDD

NewData is the RDD now.

4. Count Items in RDD

RDD 2

5. Collect

This function returns all RDD’s content to driver program. This is helpful in debugging at various steps of the writing program.

Popular Course in this category
Hadoop Training Program (20 Courses, 14+ Projects, 4 Quizzes)20 Online Courses | 14 Hands-on Projects | 135+ Hours | Verifiable Certificate of Completion | Lifetime Access | 4 Quizzes with Solutions
4.5 (6,061 ratings)
Course Price

View Course

Related Courses
Apache Spark Training (3 Courses)PySpark Tutorials (3 Courses)Apache Storm Training (1 Courses)

6. Read first 3 Items from RDD

RDD 3

7. Save output/processed data into the text file

RDD 4

Here “output” folder is the current path.

Intermediate Spark Commands

Let’s take a look at some of the intermediate commands which are given below:

1. Filter on RDD

Let’s create new RDD for items which contain “yes”.

Filter on RDD

Transformation filter needs to be called on existing RDD to filter on the word “yes”, which will create new RDD with the new list of items.

2. Chain Operation

Chain Operation

Here filter transformation and count action acted together. This is called chain operation.

3. Read the first item from RDD

Read first item from RDD

4. Count RDD Partitions

As we know, RDD is made of multiple partitions, there occurs the need to count the no. of partitions. As it helps in tuning and troubleshooting while working with Spark commands.

Count RDD Partitions

By default, minimum no. pf partition is 2.

5. join

This function joins two tables (table element is in pairwise fashion) based on the common key. In pairwise RDD, the first element is the key and second element is the value.

6. Cache a File

Caching is an optimization technique. Caching RDD means, RDD will reside in memory, and all future computation will be done on those RDD in memory. It saves the disk read time and improves the performances. In short, it reduces the time to access the data.

Cache a File

However, data will not be cached if you run above function. This can be proved by visiting the webpage:

http://localhost:4040/storage

RDD will be cached, once the action is done. For example:

Cache a File 2

One more function which works similar to cache() is persist(). Persist gives users the flexibility to give the argument, which can help data to be cached in memory, disk or off-heap memory. Persist without any argument works same as cache().

Advanced spark commands

Let’s take a look at some of the advanced commands which are given below:

1. Broadcast a variable

Broadcast variable helps the programmer to keep read the only variable cached on every machine in the cluster, rather than shipping copy of that variable with tasks. This helps in the reduction of communication costs.

Broadcast Variable

In short, there are three main features of the Broadcasted variable:

  1. Immutable
  2. Fit in memory
  3. Distributed over cluster

Broadcast a variable 2

2. Accumulators

Accumulators are the variables which get added to associated operations.  There are many uses for accumulators like counters, sums etc.

Accumulators

The name of the accumulator in the code could also be seen in Spark UI.

3. Map

Map function helps in iterating over every line in RDD. The function used in the map is applied to every element in RDD.

For example, in RDD {1, 2, 3, 4, 6} if we apply “rdd.map(x=>x+2)” we will get the result as (3, 4, 5, 6, 8).

4. Flatmap

Flatmap works similar to the map, but map returns only one element whereas flatmap can return the list of elements. Hence, splitting sentences into words will need flatmap.

5. Coalesce

This function helps to avoid the shuffling of data. This is applied in the existing partition so that less data is shuffled. This way, we can restrict the usage of nodes in the cluster.

Tips and Tricks to Use Spark Commands

Below are the different tips and tricks of Spark commands:

  1. Beginners of Spark may use Spark-shell. As they are built on Scala, so definitely using scala spark shell is great. However, python spark shell is also available, so even that also something one can use, who are well versed with python.
  2. Spark shell has a lot of options to manage the resources of the cluster. Below Command can help you with that:

Use spark commands

  1. In Spark, working with long datasets is the usual thing. But things go wrong when bad input is taken. It’s always a good idea to drop bad rows by using the filter function of Spark. The good set of input will be a great go.
  2. Spark chooses good partition by its own for your data. But it’s always a good practice to keep an eye on partitions before you start your job. Trying out different partitions will help you with the parallelism of your job.

Conclusion

Spark command is a revolutionary and versatile big data engine, which can work for batch processing, real-time processing, caching data etc. Spark has a rich set of Machine Learning libraries that can enable data scientists and analytical organizations to build strong, interactive and speedy applications.

Recommended Articles

This has been a guide to Spark commands. Here we have discussed concept, basic, intermediate as well as advanced Spark Commands along with tips and tricks to use effectively. You may also look at the following article to learn more –

  1. Types of Joins in Spark SQL (Examples)
  2. Spark Components | Overview and Top 6 Components
  3. Spark Tools
  4. Spark Versions

Hadoop Training Program (20 Courses, 14+ Projects)

20 Online Courses

14 Hands-on Projects

135+ Hours

Verifiable Certificate of Completion

Lifetime Access

4 Quizzes with Solutions

Learn More

0 Shares
Share
Tweet
Share
Primary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • 7 Important Things You Must Know About Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

© 2020 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA Login

Forgot Password?

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you
Book Your One Instructor : One Learner Free Class

Let’s Get Started

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

Special Offer - Hadoop Training Program (20 Courses, 14+ Projects) Learn More