EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login
Home Data Science Data Science Tutorials Spark Tutorial What is RDD?
Secondary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark Cheat Sheet
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

What is RDD?

By Aanchal SinghAanchal Singh

What is RDD

Introduction to RDD

A Resilient Distributed Data set is the basic component of Spark. Each data set is divided into logical parts and these can be easily computed on different nodes of the cluster. They can be operated in parallel and are fault-tolerant. RDD objects can be created by Python, Java or Scala. It can also include user-defined classes. To get faster, efficient and accurate results RDD is used by Spark. RDDs can be created in two ways. One can be parallelizing an existing collection in your Spark Context driver program. The other way can be referencing a data set in an external storage system that can be HDFS, HBase or any other source which has Hadoop file format.

To understand the basic functionality of the Resilient Distributed Data (RDD) set, it is important to know the basics of Spark. It is a major component in Spark. Spark is a data processing engine that provides faster and easy analytics. Spark does in-memory processing with the help of Resilient Distributed Data sets. This means that it catches most of the data in memory. It helps in managing the distributed processing of data. After this, the transformation of data can also be taken care of. Each data set in RDD is firstly partitioned into logical portions and it can be computed on different nodes of the cluster.

Understanding

To understand it better we need to know how they are different and what are the distinguishing factors. Below are the few factors that distinguish RDDs.

1. In Memory: This is the most important feature of RDD. The collection of objects which are created are stored in memory on the disk. This increases the execution speed of Spark as the data is being fetched from data which in memory. There is no need for data to be fetched from the disk for any operation.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

2. Lazy Evaluation: The transformation in Spark is lazy. The data which is available in RDD is not executed until any action is performed on them. To get the data user can make use of count() action on RDD.

3. Cach Enable: As RDD is lazily evaluated the actions that are performed on them need to be evaluated. This leads to the creation of RDDs for all transformations. The data can also persist on memory or disk.

How does RDD Make Working So Easy?

RDD lets you have all your input files like any other variable which is present. This is not possible by using Map Reduce. These RDDs get automatically distributed over the available network through partitions. Whenever an action is executed a task is launched per partition. This encourages parallelism, More the number of partitions more parallelism. The partitions are automatically determined by Spark. Once this is done two operations can be performed by RDDs. This includes actions and transformations.

What Can You do with RDD?

As mentioned in the previous point, it can be used for two operations. This includes actions and transformations. In the case of transformation, a new data set is created from an existing data set. Each data set is passed through a function. As a return value, it sends a new RDD as a result.

Actions on the other hand return value to the program. It performs the computations on the required data set. Here when the action is performed a new data set is not created. Hence they can be said as RDD operations that return non-RDD values. These values are stored either on external systems or to the drivers.

Working with RDD

To work efficiently with it is important to follow the below steps. Starting with getting the data files. These can be easily obtained by making use of import command. Once this is done the next step is of creating data files. Commonly data is loaded in RDD through a file. It can also be created by using a parallelize command. Once this is done users can easily start performing different tasks. Transformations include filter transformation, map transformation where a map can be used with pre-defined functions as well. Different actions can also be performed. These include collect action, count action, take action, etc. Once the RDD is created and basic transformations are done then the RDD is sampled. It is performed by making use of sample transformation and take sample action. Transformations help in applying successive transformations and actions help in retrieving the given sample.

Advantages

The following are the major properties or advantages:

All in One Data Science Bundle(360+ Courses, 50+ projects)
Python TutorialMachine LearningAWSArtificial Intelligence
TableauR ProgrammingPowerBIDeep Learning
Price
View Courses
360+ Online Courses | 50+ projects | 1500+ Hours | Verifiable Certificates | Lifetime Access
4.7 (86,171 ratings)

1. Immutable and Partitioned: All records are partitioned and hence RDD is the basic unit of parallelism. Each partition is logically divided and is immutable. This helps in achieving the consistency of data.

2. Coarse-Grained Operations: These are the operations that are applied to all elements which are present in a data set. To elaborate, if a data set has a map, a filter and a group by an operation then these will be performed on all elements which are present in that partition.

3. Transformation and Actions: After creating actions data can be read from only stable storage. This includes HDFS or by making transformations to existing RDDs. Actions can also be performed and saved separately.

4. Fault Tolerance: This is the major advantage of using it. Since a set of transformations are created all changes are logged and rather the actual data is not preferred to be changed.

5. Persistence: It can be reused which makes them persistent.

Required Skills & Scope

For RDD you need to have a basic idea about the Hadoop ecosystem. Once you have an idea you can easily understand Spark and get to know the concepts. It has a lot of scopes as it is one of the emerging technologies. By understanding, you can easily get knowledge of processing and storing huge amounts of data. Data being the building block makes it mandatory for RDD to stay.

Why Should We Use?

RDDs are the talk of the town mainly because of the speed with which it processes huge amounts of data. RDDs are persistent and fault-tolerant which makes data to stay resilient.

Need for RDD

In order to perform data operations quickly and efficiently RDDs are used. The in-memory concept helps in getting the data fast and reusability makes it efficient.

Career Growth

It is widely being used in data processing and analytics. Once you learn RDD you will be able to work with Spark which is highly recommended in technology these days. You can easily ask for raise and also apply for high paying jobs.

Conclusion

To conclude, if you want to stay in the data industry and analytics it is surely a plus point. It will help you in working with the latest technologies with agility and efficiency.

Recommended Articles

This has been a guide to What is RDD?. Here we discussed the concept, scope, need, career, understanding, working & advantages of RDD. You can also go through our other suggested articles to learn more-

  1. What is Virtualization?
  2. What is Big Data Technology
  3. What is Apache Spark?
  4. Advantages of OOP
Popular Course in this category
Apache Spark Training (3 Courses)
  3 Online Courses |  13+ Hours |  Verifiable Certificate of Completion |  Lifetime Access
4.5
Price

View Course

Related Courses

PySpark Tutorials (3 Courses)4.9
Apache Storm Training (1 Courses)4.8
0 Shares
Share
Tweet
Share
Primary Sidebar
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

SPSS, Data visualization with Python, Matplotlib Library, Seaborn Package

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more