Introduction to Spark
Spark is an open-source framework for running analytics applications. It is a data processing engine hosted at the vendor-independent Apache Software Foundation to work on large data sets or big data. It is a general-purpose cluster computing system that provides high-level APIs in Scala, Python, Java, and R. It was developed to overcome the limitations in the MapReduce paradigm of Hadoop. Data scientists believe that Spark executes 100 times faster than MapReduce as it can cache data in memory whereas MapReduce works more by reading and writing on disks. It performs in-memory processing which makes it more powerful and fast.
Spark does not have its own file system. It processes data from diverse data sources such as Hadoop Distributed File System (HDFS), Amazon’s S3 system, Apache Cassandra, MongoDB, Alluxio, Apache Hive. It can run on Hadoop YARN (Yet Another Resource Negotiator), on Mesos, on EC2, on Kubernetes or using standalone cluster mode. It uses RDDs (Resilient Distributed Dataset) to delegate workloads to individual nodes that support iterative applications. Due to RDD, programming is easy as compared to Hadoop.
Spark Ecosystem Components
- Spark Core: It is the foundation of Spark application on which other components are directly dependent. It provides a platform for a wide variety of applications such as scheduling, distributed task dispatching, in-memory processing and data referencing.
- Spark Streaming: It is the component that works on live streaming data to provide real-time analytics. The live data is ingested into discrete units called batches which are executed on Spark Core.
- Spark SQL: It is the component that works on top of Spark core to run SQL queries on structured or semi-structured data. Data Frame is the way to interact with Spark SQL.
- GraphX: It is the graph computation engine or framework that allows processing graph data. It provides various graph algorithms to run on Spark.
- MLlib: It contains machine learning algorithms that provide machine learning framework in a memory-based distributed environment. It performs iterative algorithms efficiently due to in-memory data processing capability.
- SparkR: Spark provides an R package to run or analyze data sets using R shell.
Three ways to deploy Spark
- Standalone Mode in Apache Spark
- Hadoop YARN/ Mesos
- SIMR(Spark in MapReduce)
Let’s see the deployment in Standalone mode.
1. Spark Standalone Mode of Deployment
Step #1: Update the package index
This is necessary to update all the present packages in your machine.
Use command:
$ sudo apt-get update
Step #2: Install Java Development Kit (JDK)
This will install JDK in your machine and would help you to run Java applications.
Step #3: Check if Java has installed properly
Java is a pre-requisite for using or running Apache Spark Applications.
Use command:
$ java –version
This screenshot shows the java version and assures the presence of java on the machine.
Step #4: Install Scala on your machine
As Spark is written in scala so scale must be installed to run spark on your machine.
Use Command:
$ sudo apt-get install scala
Step #5: Verify if Scala is properly installed
This will ensure the successful installation of scale on your system.
Use Command:
$ scala –version</code
Step #6: Download Apache Spark
Download Apache Spark according to your Hadoop version from https://spark.apache.org/downloads.html
When you will go on the above link, a window will appear.
Step #7: Select the appropriate version according to your Hadoop version and click on the link marked.
Another window would appear.
Step #8: Click on the link marked and Apache spark would be downloaded in your system.
Verify if the .tar.gz file is available in the Downloads folder.
Step #9: Install Apache Spark
For the installation of Spark, the tar file must be extracted.
Use Command:
$ tar xvf spark- 2.4.0-bin-hadoop2.7.tgz
You must change the version mentioned in the command according to your downloaded version. In this, we have downloaded spark-2.4.0-bin-hadoop2.7 version.
Step #10: Setup environment variable for Apache Spark
Use Command:
$ source ~/.bashrc
Add line: export PATH=$PATH:/usr/local/spark/bin
Step #11: Verify the installation of Apache Spark
Use Command:
$spark-shell
If the installation was successful then the following output will be produced.
This signifies the successful installation of Apache Spark on your machine and Apache Spark will start in Scala.
2. Deployment of Spark on Hadoop YARN
There are two modes to deploy Apache Spark on Hadoop YARN.
- Cluster mode: In this mode YARN on the cluster manages the Spark driver that runs inside an application master process. After initiating the application the client can go.
- Client mode: In this mode, the resources are requested from YARN by application master and Spark driver runs in the client process.
To deploy a Spark application in cluster mode use command:
$spark-submit –master yarn –deploy –mode cluster mySparkApp.jar
The above command will start a YARN client program which will start the default Application Master.
To deploy a Spark application in client mode use command:
$ spark-submit –master yarn –deploy –mode client mySparkApp.jar
You can run spark-shell in client mode by using the command:
$ spark-shell –master yarn –deploy-mode client
Tips and Tricks
- Ensure that Java is installed on your machine before installing spark.
- If you use scala language then ensure that scale is already installed before using Apache Spark.
- You can use Python also instead of Scala for programming in Spark but it must also be pre-installed like Scala.
- You can run Apache Spark on Windows also but it is suggested to create a virtual machine and install Ubuntu using Oracle Virtual Box or VMWare Player.
- Spark can run without Hadoop (i.e. Standalone mode) but if a multi-node setup is required then resource managers like YARN or Mesos are needed.
- While using YARN it is not necessary to install Spark on all three nodes. You have to install Apache Spark on one node only.
- While using YARN if you are in the same local network with the cluster then you can use client mode whereas if you are far away then you can use cluster mode.
Recommended Articles
This has been a guide on how to install Spark. Here we have seen how to deploy Apache Spark in Standalone mode and on top of resource manager YARN and also Some tips and tricks are also mentioned for a smooth installation of Spark. You may also look at the following article to learn more –