EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login

How to Install Spark

By Priya PedamkarPriya Pedamkar

Home » Data Science » Data Science Tutorials » Spark Tutorial » How to Install Spark

How to Install Spark

Introduction to Spark

Spark is an open-source framework for running analytics applications. It is a data processing engine hosted at the vendor-independent Apache Software Foundation to work on large data sets or big data. It is a general-purpose cluster computing system that provides high-level APIs in Scala, Python, Java, and R. It was developed to overcome the limitations in the MapReduce paradigm of Hadoop. Data scientists believe that Spark executes 100 times faster than MapReduce as it can cache data in memory whereas MapReduce works more by reading and writing on disks. It performs in-memory processing which makes it more powerful and fast.

Spark does not have its own file system. It processes data from diverse data sources such as Hadoop Distributed File System (HDFS), Amazon’s S3 system, Apache Cassandra, MongoDB, Alluxio, Apache Hive. It can run on Hadoop YARN (Yet Another Resource Negotiator), on Mesos, on EC2, on Kubernetes or using standalone cluster mode. It uses RDDs (Resilient Distributed Dataset) to delegate workloads to individual nodes that support iterative applications. Due to RDD, programming is easy as compared to Hadoop.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Spark Ecosystem Components

  • Spark Core: It is the foundation of Spark application on which other components are directly dependent. It provides a platform for a wide variety of applications such as scheduling, distributed task dispatching, in-memory processing and data referencing.
  • Spark Streaming: It is the component that works on live streaming data to provide real-time analytics. The live data is ingested into discrete units called batches which are executed on Spark Core.
  • Spark SQL: It is the component that works on top of Spark core to run SQL queries on structured or semi-structured data. Data Frame is the way to interact with Spark SQL.
  • GraphX: It is the graph computation engine or framework that allows processing graph data. It provides various graph algorithms to run on Spark.
  • MLlib: It contains machine learning algorithms that provide machine learning framework in a memory-based distributed environment. It performs iterative algorithms efficiently due to in-memory data processing capability.
  • SparkR: Spark provides an R package to run or analyze data sets using R shell.

Three ways to deploy Spark

  • Standalone Mode in Apache Spark
  • Hadoop YARN/ Mesos
  • SIMR(Spark in MapReduce)

Let’s see the deployment in Standalone mode.

1. Spark Standalone Mode of Deployment

Step #1: Update the package index

This is necessary to update all the present packages in your machine.

Use command:

$ sudo apt-get update

sudo apt

Step #2: Install Java Development Kit (JDK)

This will install JDK in your machine and would help you to run Java applications.

run java applications

Step #3: Check if Java has installed properly

Java is a pre-requisite for using or running Apache Spark Applications.

Use command:

$ java –version

Apache Spark Applications

This screenshot shows the java version and assures the presence of java on the machine.

Popular Course in this category
Sale
Apache Spark Training (3 Courses)3 Online Courses | 13+ Hours | Verifiable Certificate of Completion | Lifetime Access
4.5 (9,091 ratings)
Course Price

View Course

Related Courses
PySpark Tutorials (3 Courses)Apache Storm Training (1 Courses)

Step #4: Install Scala on your machine

As Spark is written in scala so scale must be installed to run spark on your machine.

Use Command:

$ sudo apt-get install scala

run spark on your machine

Step #5: Verify if Scala is properly installed

This will ensure the successful installation of scale on your system.

Use Command:

$ scala –version</code

successful installation

Step #6: Download Apache Spark

Download Apache Spark according to your Hadoop version from https://spark.apache.org/downloads.html

When you will go on the above link, a window will appear.

window will appear

Step #7: Select the appropriate version according to your Hadoop version and click on the link marked.

Another window would appear.

hadoop version and click

 Step #8: Click on the link marked and Apache spark would be downloaded in your system.

Verify if the .tar.gz file is available in the Downloads folder.

available in downloads folder

Step #9: Install Apache Spark

For the installation of Spark, the tar file must be extracted.

Use Command:

$ tar xvf spark- 2.4.0-bin-hadoop2.7.tgz

file must be extracted

You must change the version mentioned in the command according to your downloaded version. In this, we have downloaded spark-2.4.0-bin-hadoop2.7 version.

Step #10: Setup environment variable for Apache Spark

Use Command:

$ source ~/.bashrc

Add line: export PATH=$PATH:/usr/local/spark/bin

environment variable for Apache Spark

Step #11: Verify the installation of Apache Spark

Use Command:

$spark-shell

If the installation was successful then the following output will be produced.

output will be produced

This signifies the successful installation of Apache Spark on your machine and Apache Spark will start in Scala.

2. Deployment of Spark on Hadoop YARN

There are two modes to deploy Apache Spark on Hadoop YARN.

  1. Cluster mode: In this mode YARN on the cluster manages the Spark driver that runs inside an application master process. After initiating the application the client can go.
  2. Client mode: In this mode, the resources are requested from YARN by application master and Spark driver runs in the client process.

To deploy a Spark application in cluster mode use command:

$spark-submit –master yarn –deploy –mode cluster mySparkApp.jar

The above command will start a YARN client program which will start the default Application Master.

To deploy a Spark application in client mode use command:

$ spark-submit –master yarn –deploy –mode client mySparkApp.jar

You can run spark-shell in client mode by using the command:

$ spark-shell –master yarn –deploy-mode client

Tips and Tricks

  1. Ensure that Java is installed on your machine before installing spark.
  2. If you use scala language then ensure that scale is already installed before using Apache Spark.
  3. You can use Python also instead of Scala for programming in Spark but it must also be pre-installed like Scala.
  4. You can run Apache Spark on Windows also but it is suggested to create a virtual machine and install Ubuntu using Oracle Virtual Box or VMWare Player.
  5. Spark can run without Hadoop (i.e. Standalone mode) but if a multi-node setup is required then resource managers like YARN or Mesos are needed.
  6. While using YARN it is not necessary to install Spark on all three nodes. You have to install Apache Spark on one node only.
  7. While using YARN if you are in the same local network with the cluster then you can use client mode whereas if you are far away then you can use cluster mode.

Recommended Articles

This has been a guide on how to install Spark. Here we have seen how to deploy Apache Spark in Standalone mode and on top of resource manager YARN and also Some tips and tricks are also mentioned for a smooth installation of Spark. You may also look at the following article to learn more –

  1. How to use Spark Commands
  2. A career in Spark – You Must Try
  3. Differences of Splunk vs Spark
  4. Spark Interview Questions and Answers
  5. Advantages of Spark Streaming
  6. Types of Joins in Spark SQL (Examples)

Apache Spark Training (3 Courses)

3 Online Courses

13+ Hours

Verifiable Certificate of Completion

Lifetime Access

Learn More

0 Shares
Share
Tweet
Share
Primary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

Special Offer - Apache Spark Training (3 Courses) Learn More