EDUCBA

EDUCBA

MENUMENU
  • Blog
  • Free Courses
  • All Courses
  • All in One Bundle
  • Login
Home Data Science Data Science Tutorials Spark Tutorial How to Install Spark

How to Install Spark

Priya Pedamkar
Article byPriya Pedamkar

Updated April 4, 2023

How to Install Spark

How to Install Spark: Introduction

The following article describes in detail how to install Spark on the system.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Spark is an open-source framework for running analytics applications. It is a data processing engine hosted by the vendor-independent Apache Software Foundation and is designed to work with large data sets or big data. Spark is a general-purpose cluster computing system that provides high-level APIs in Scala, Python, Java, and R. It was developed to overcome the limitations of the MapReduce paradigm in Hadoop. Data scientists believe that Spark executes 100 times faster than MapReduce because it can cache data in memory, whereas MapReduce works primarily by reading and writing to disks. It performs in-memory processing, making it more powerful and fast.

Spark does not have its own file system. Instead, it processes data from diverse data sources, such as Hadoop Distributed File System (HDFS), Amazon’s S3 system, Apache Cassandra, MongoDB, Alluxio, and Apache Hive. It can run on Hadoop YARN (Yet Another Resource Negotiator), on Mesos, on EC2, on Kubernetes, or using standalone cluster mode. It uses RDDs (Resilient Distributed Dataset) to delegate workloads to individual nodes that support iterative applications. Due to RDD, programming is easier in comparison to Hadoop.

Spark Ecosystem Components

  • Spark Core: It is the foundation of a Spark application on which other components directly depend. It provides a platform for a wide variety of applications such as scheduling, distributed task dispatching, in-memory processing, and data referencing.
  • Spark Streaming: It is the component that works on live streaming data to provide real-time analytics. The live data is ingested into discrete units called batches, which are executed on the Spark Core.
  • Spark SQL: It is the component that works on top of the Spark Core to run SQL queries on structured or semi-structured data. Data Frames are the way to interact with Spark SQL.
  • GraphX: It is the graph computation engine or framework that allows processing graph data. It provides various graph algorithms to run on Spark.
  • MLlib: It contains machine learning algorithms that provide a machine learning framework in a memory-based distributed environment. It performs iterative algorithms efficiently due to the in-memory data processing capability.
  • SparkR: Spark provides an R package to run or analyze datasets using the R shell.

Three ways to deploy/install Spark

  • Standalone Mode in Apache Spark
  • Hadoop YARN/Mesos
  • SIMR(Spark in MapReduce)

Let’s see the deployment in Standalone mode.

1. Spark Standalone Mode of Deployment

Step #1: Update the package index

This is necessary to update all the present packages in the machine.

Use command:

$ sudo apt-get update

sudo apt

Step #2: Install Java Development Kit (JDK)

It will install JDK in the machine and would help run Java applications.

run java applications

Step #3: Check if Java is installed properly

Java is a prerequisite for using Apache Spark Applications.

Use command:

$ java –version

Apache Spark Applications

The above image shows the Java version and assures the presence of Java on the machine.

Step #4: Install Scala on your machine

As Spark is written in Scala, the latter must be installed to run Spark on the machine.

Use Command:

$ sudo apt-get install scala

run spark on your machine

Step #5: Verify if Scala is properly installed

This ensures the successful installation of Scala on the system.

Use Command:

$ scala –version</code

successful installation

Step #6: Download Apache Spark

Download Apache Spark as per the Hadoop version from https://spark.apache.org/downloads.html

Upon opening the above link, a window will appear:

window will appear

Step #7: Select the relevant version as per the Hadoop version and click on the link marked.

Now, another window would appear:

hadoop version and click

 Step #8: Click on the marked link and Apache Spark would download in your system.

Verify if the .tar.gz file is available in the ‘Downloads’ folder.

available in downloads folder

Step #9: Install Apache Spark

Extract the tar file in order to install Spark.

Use Command:

$ tar xvf spark- 2.4.0-bin-hadoop2.7.tgz

file must be extracted

Change the version mentioned in the command according to the downloaded version. Here, the downloaded version is spark-2.4.0-bin-hadoop2.7.

Step #10: Setup environment variable for Apache Spark

Use Command:

$ source ~/.bashrc

Add line: export PATH=$PATH:/usr/local/spark/bin

environment variable for Apache Spark

Step #11: Verify the installation of Apache Spark

Use Command:

$spark-shell

If the installation is successful, then the following output will be produced.

output will be produced

This indicates the successful installation of Apache Spark on the machine. Now, Apache Spark will start in Scala.

2. Deployment of Spark on Hadoop YARN

There are two modes to deploy Apache Spark on Hadoop YARN:

  1. Cluster mode: In this mode, YARN on the cluster manages the Spark driver that runs inside an application master process. Once the application is initiated, the client can leave.
  2. Client mode: In this mode, the application master requests resources from YARN, and the Spark driver runs in the client process.

To deploy a Spark application in cluster mode, use the following command:

$spark-submit –master yarn –deploy –mode cluster mySparkApp.jar

This command will start a YARN client program, which will initiate the default Application Master.

To deploy a Spark application in client mode, use the following command:

$ spark-submit –master yarn –deploy –mode client mySparkApp.jar

One can run spark-shell in client mode using the following command:

$ spark-shell –master yarn –deploy-mode client

Tips and Tricks

  1. Before installing Spark, ensure that Java is installed on your machine.
  2. If you plan to use the Scala language with Apache Spark, ensure that Scala is also installed on your machine.
  3. Python can also be used for programming with Spark, but it must also be pre-installed like Scala.
  4. While Apache Spark can run on Windows, it is of high recommendation to create a virtual machine and install Ubuntu using Oracle Virtual Box or VMWare Player.
  5. Spark can run in standalone mode without Hadoop, but if a multi-node setup is required, resource managers like YARN or Mesos are needed.
  6. When using YARN, it is not necessary to install Spark on all three nodes. One only needs to install Apache Spark on one node.
  7. When using YARN, one can use client mode if one is on the same local network with the cluster, and cluster mode if one is far away.

Recommended Articles

Here are some further related articles for expanding understanding:

  1. How to use Spark Commands
  2. A career in Spark – You Must Try
  3. Differences between Splunk vs Spark
  4. Spark Interview Questions and Answers
All in One Excel VBA Bundle
500+ Hours of HD Videos
15 Learning Paths
120+ Courses
Verifiable Certificate of Completion
Lifetime Access
Financial Analyst Masters Training Program
2000+ Hours of HD Videos
43 Learning Paths
550+ Courses
Verifiable Certificate of Completion
Lifetime Access
All in One Data Science Bundle
2000+ Hour of HD Videos
80 Learning Paths
400+ Courses
Verifiable Certificate of Completion
Lifetime Access
All in One Software Development Bundle
5000+ Hours of HD Videos
149 Learning Paths
1050+ Courses
Verifiable Certificate of Completion
Lifetime Access
Primary Sidebar
All in One Data Science Bundle2000+ Hour of HD Videos | 80 Learning Paths | 400+ Courses | Verifiable Certificate of Completion | Lifetime Access
Financial Analyst Masters Training Program2000+ Hours of HD Videos | 43 Learning Paths | 550+ Courses | Verifiable Certificate of Completion | Lifetime Access
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2023 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

By continuing above step, you agree to our Terms of Use and Privacy Policy.
*Please provide your correct email id. Login details for this Free course will be emailed to you

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more