Introduction to Apache Spark Architecture
This article defines the concepts of Apache Spark Architecture and its components. It is considered to be a peak level data processing on Apache Software Foundation. Spark is ultimately considered to be another possibility to Hadoop and Map Reduce.
Spark is an open-source architecture with well – defined components that help in analyzing a bulk amount of data and works for distributed computing. Spark is written in Scala with enabled features like APIs, R and has enriched set of libraries which makes it easier to do data processing. Apache Spark has its individual cluster computation for faster accessibility and executes in different platforms.
Apache Spark has a wide range of features like speedrunning faster in-memory process, adapts to multiple programming languages, has enhanced caching techniques and supports SQL queries, Machine learning. It runs on the platforms like Microsoft Windows and Linux operating systems. Their deployment says they could run on a single machine and cluster mode as well.
Apache Spark Architecture
The Architecture of Apache spark has loosely coupled components. Spark consider the master/worker process in the architecture and all the task works on the top of the Hadoop distributed file system. Apache spark makes use of Hadoop for data processing and data storage processes. They are considered to be in-memory data processing engine and makes their applications to run on Hadoop clusters faster than a memory. Having in-memory processing prevents the failure of disk I/O. Spark allows the heterogeneous job to work with the same data. Spark divides its data into partitions, the size of the split partitions depends on the given data source.
Below are the two main implementations of Apache Spark Architecture:
1. Resilient Distributed Datasets (RDD)
It is responsible for providing API for controlling caching and partitioning. It’s an important toolset for data computation. It helps in recomputing elements in case of failures and considered to be immutable data and acts as an interface. Transformations and actions are the two operations done by RDD.
2. Directed Acyclic Graph (DAG)
It forms a sequence connection from one node to another. The driver converts the program into DAG for each job. The Apache Spark Eco-system has various components like API core, Spark SQL, Streaming and real-time processing, MLIB and Graph X. Some terminologies that to be learned here is Spark shell which helps in reading large volumes of data, Spark context -cancel, run a job, task ( a work), job( computation)
Components of Apache Spark Architecture
The Four main components of Spark are given below and it is necessary to understand them for the complete framework.
- Spark Driver
- Cluster manager
- Worker Nodes
The following diagram shows the Architecture and Components of spark:
Fig: Standalone mode of Apache Spark Architecture
The execution flow begins as follows:
1. Spark Driver
The driver’s responsibility is to coordinate the tasks and the workers for management. It’s an Application JVM process and considered as a master node. A driver splits the spark into tasks and schedules to execute on executors in the clusters. In the diagram, the driver programs invoke the main application and create a spark context (acts as a gateway) collectively monitor the job working within the given cluster and connect to a Spark cluster All the functionalities and the commands are done through the spark context.
Spark context is an entry for each session. Spark driver has more components to execute jobs in the clusters. Spark clusters get connected to different types of cluster managers and simultaneously context acquires worker nodes to execute and store data. In the cluster, when we execute the process their job is subdivided into stages with gain stages into scheduled tasks.
It is responsible for the execution of a job and stores data in a cache. At the very initial stage, executors register with the drivers. This executor has a number of time slots to run the application concurrently. Executors perform read/ write process on external sources. The executor runs the job when it has loaded data and they are been removed in the idle mode. The executor is enabled by dynamic allocation and they are constantly included and excluded depending on the duration. During the execution of the tasks, the executors are monitored by a driver program. Executors execute users’ task in java process.
3. Cluster Manager
It helps in managing the clusters which have one master and number of slaves. There are two types of cluster managers like YARN and standalone both these are managed by Resource Manager and Node. cluster work on Stand-alone requires Spark Master and worker node as their roles. The responsibility of the cluster manager is to allocate resources and to execute the task,
4. Worker Nodes
They are the slave nodes; the main responsibility is to execute the tasks and the output of them is returned back to the spark context. They communicate with the master node about the availability of the resources. Spark context executes it and issues to the worker nodes. Each worker nodes are been assigned one spark worker for monitoring. They make the computation very simply by increasing the worker nodes (1 to n no of workers) so that all the tasks are performed parallel by dividing the job into partitions on multiple systems. The other element task is considered to be a unit of work and assigned to one executor, for each partition spark runs one task.
Therefore, by understanding Apache Spark Architecture, it signifies how to implement big data in an easy manner. Ultimately, we have learned their accessibility and their components roles which is very beneficial for cluster computing and big data technology. Spark compute the desired results in an easier way and preferred in batch processing.
Spark’s distinctive features like datasets and data frames help to optimize the users’ code. An important feature like SQL engine promotes execution speed and makes this software versatile. Therefore, we have seen spark applications run locally or distributed in a cluster. Apache Spark is considered to be a great complement in a wide range of industries like big data. To sum up, spark helps in resolving high computational tasks.
This is a guide to Apache Spark Architecture. Here we discuss the Introduction to Apache Spark Architecture along with the Components and the block diagram of Apache Spark. You can also go through our other suggested articles to learn more–