Introduction to Spark Commands
Apache Spark is a framework built on top of Hadoop for fast computations. It extends the concept of MapReduce in the cluster-based scenario to efficiently run a task. Spark Command is written in Scala.
Hadoop can be utilized by Spark in the following ways (see below):
- Standalone: Spark directly deployed on top of Hadoop. Spark jobs run parallelly on Hadoop and Spark.
- Hadoop YARN: Spark runs on Yarn without the need of any pre-installation.
- Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job, in addition to standalone deployment. With SIMR, one can start Spark and can use its shell without any administrative access.
Components of Spark
Spark comprises of the following parts:
- Apache Spark Core
- Spark SQL
- Spark Streaming
- MLib
- GraphX
Resilient Distributed Datasets (RDD) is considered as the fundamental data structure of Spark commands. RDD is immutable and read-only in nature. All kind of computations in spark commands is done through transformations and actions on RDD’s.
Spark shell provides a medium for users to interact with its functionalities. They have a lot of different commands which can be used to process data on the interactive shell.
Basic Spark Commands
Let’s take a look at some of the basic commands which are given below:
1. To start the Spark shell
2. Read file from local system:
Here “sc” is the spark context. Considering “data.txt” is in the home directory, it is read like this, else one need to specify the full path.
3. Create RDD through parallelizing
NewData is the RDD now.
4. Count Items in RDD
5. Collect
This function returns all RDD’s content to driver program. This is helpful in debugging at various steps of the writing program.
4.5 (5,677 ratings)
View Course
6. Read first 3 Items from RDD
7. Save output/processed data into the text file
Here “output” folder is the current path.
Intermediate Spark Commands
Let’s take a look at some of the intermediate commands which are given below:
1. Filter on RDD
Let’s create new RDD for items which contain “yes”.
Transformation filter needs to be called on existing RDD to filter on the word “yes”, which will create new RDD with the new list of items.
2. Chain Operation
Here filter transformation and count action acted together. This is called chain operation.
3. Read the first item from RDD
4. Count RDD Partitions
As we know, RDD is made of multiple partitions, there occurs the need to count the no. of partitions. As it helps in tuning and troubleshooting while working with Spark commands.
By default, minimum no. pf partition is 2.
5. join
This function joins two tables (table element is in pairwise fashion) based on the common key. In pairwise RDD, the first element is the key and second element is the value.
6. Cache a File
Caching is an optimization technique. Caching RDD means, RDD will reside in memory, and all future computation will be done on those RDD in memory. It saves the disk read time and improves the performances. In short, it reduces the time to access the data.
However, data will not be cached if you run above function. This can be proved by visiting the webpage:
RDD will be cached, once the action is done. For example:
One more function which works similar to cache() is persist(). Persist gives users the flexibility to give the argument, which can help data to be cached in memory, disk or off-heap memory. Persist without any argument works same as cache().
Advanced spark commands
Let’s take a look at some of the advanced commands which are given below:
1. Broadcast a variable
Broadcast variable helps the programmer to keep read the only variable cached on every machine in the cluster, rather than shipping copy of that variable with tasks. This helps in the reduction of communication costs.
In short, there are three main features of the Broadcasted variable:
- Immutable
- Fit in memory
- Distributed over cluster
2. Accumulators
Accumulators are the variables which get added to associated operations. There are many uses for accumulators like counters, sums etc.
The name of the accumulator in the code could also be seen in Spark UI.
3. Map
Map function helps in iterating over every line in RDD. The function used in the map is applied to every element in RDD.
For example, in RDD {1, 2, 3, 4, 6} if we apply “rdd.map(x=>x+2)” we will get the result as (3, 4, 5, 6, 8).
4. Flatmap
Flatmap works similar to the map, but map returns only one element whereas flatmap can return the list of elements. Hence, splitting sentences into words will need flatmap.
5. Coalesce
This function helps to avoid the shuffling of data. This is applied in the existing partition so that less data is shuffled. This way, we can restrict the usage of nodes in the cluster.
Tips and Tricks to Use Spark Commands
Below are the different tips and tricks of Spark commands:
- Beginners of Spark may use Spark-shell. As they are built on Scala, so definitely using scala spark shell is great. However, python spark shell is also available, so even that also something one can use, who are well versed with python.
- Spark shell has a lot of options to manage the resources of the cluster. Below Command can help you with that:
- In Spark, working with long datasets is the usual thing. But things go wrong when bad input is taken. It’s always a good idea to drop bad rows by using the filter function of Spark. The good set of input will be a great go.
- Spark chooses good partition by its own for your data. But it’s always a good practice to keep an eye on partitions before you start your job. Trying out different partitions will help you with the parallelism of your job.
Conclusion
Spark command is a revolutionary and versatile big data engine, which can work for batch processing, real-time processing, caching data etc. Spark has a rich set of Machine Learning libraries that can enable data scientists and analytical organizations to build strong, interactive and speedy applications.
Recommended Articles
This has been a guide to Spark commands. Here we have discussed concept, basic, intermediate as well as advanced Spark Commands along with tips and tricks to use effectively. You may also look at the following article to learn more –