EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login
Home Data Science Data Science Tutorials Spark Tutorial Spark DataFrame
Secondary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark Cheat Sheet
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Spark DataFrame

By Priya PedamkarPriya Pedamkar

Spark DataFrame

Introduction to Spark DataFrame

A spark dataframe can be said to be a distributed data collection organized into named columns and is also used to provide operations such as filtering, computation of aggregations, grouping, and can be used with Spark SQL. Dataframes can be created by using structured data files, existing RDDs, external databases, and Hive tables. It is basically termed and known as an abstraction layer which is built on top of RDD and is also followed by the dataset API, which was introduced in later versions of Spark (2.0 +). Moreover, the datasets were not introduced in Pyspark but only in Scala with Spark, but this was not the case in the case of Dataframes. Dataframes, popularly known as DFs, are logical columnar formats that make working with RDDs easier and more convenient, also making use of the same functions as RDDs in the same way. If you talk more on the conceptual level, it is equivalent to the relational tables along with good optimization features and techniques.

How to Create a DataFrame?

A DataFrame is generally created by any one of the mentioned methods. It can be created by making use of Hive tables, external databases, Structured data files or even in the case of existing RDDs. These all ways can create these named columns known as Dataframes used for the processing in Apache Spark. By making use of SQLContext or SparkSession, applications can be used to create Dataframes.

Spark DataFrame Operations

In Spark, a dataframe is the distribution and collection of an organized form of data into named columns which is equivalent to a relational database or a schema or a dataframe in a language such as R or python but along with a richer level of optimizations to be used. It is used to provide a specific domain kind of language that could be used for structured data manipulation.

The below mentioned are some basic Operations of Structured Data Processing by making use of Dataframes.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

1. Reading a document which is of type: JSON: We would be making use of the command sqlContext.read.json.

Example: Let us suppose our filename is student.json, then our piece of code will look like:

val dfs = sqlContext.read.json("student.json")

Output: In this case, the output will be that the field names will be automatically taken from the file student.json.

2. Showing of Data: In order to see the data in the Spark dataframes, you will need to use the command:

dfs.show()

Example: Let us suppose our filename is student.json, then our piece of code will look like:

val dfs= sqlContext.read.json("student.json")
dfs.show()

Output: The student data will be present to you in a tabular format.

3. Using printSchema method: If you are interested to see the structure, i.e. schema of the dataframe, then make use of the following command: dfs.printSchema()

Example: Let us suppose our filename is student.json, then our piece of code will look like:

val dfs= sqlContext.read.json("student.json")
dfs.printSchema()

Output: The structure or the schema will be present to you

4. Use the select method: In order to use the select method, the following command will be used to fetch the names and columns from the list of dataframes.

dfs.select("column-name").show()

Example: Let us suppose our filename is student.json, then our piece of code will look like:

val dfs= sqlContext.read.json("student.json")
dfs.select("name").show()

Output: The values of the name column can be seen.

5. Using Age filter: The following command can be used to find the range of students whose age is more than 23 years.

dfs.filter(dfs("column-name") > value).show()

Example: Let us suppose our filename is student.json, then our piece of code will look like:

val dfs= sqlContext.read.json("student.json")
dfs.filter(dfs("age")>23).show()

Output: The filtered age for greater than 23 will appear in the results.

6. Using the groupBy method: The following method could be used to count the number of students who have the same age.

dfs.groupBy("column-name").count().show()

Example: Let us suppose our filename is student.json, then our piece of code will look like:

val dfs= sqlContext.read.json("student.json")
dfs.groupBy("age").count().show()

7. Using SQL function upon a SparkSession: It enables the application to execute SQL type queries programmatically and hence returns the result in the form of a dataframe.
spark.sql(query)

Example: Suppose we have to register the SQL dataframe as a temp view then:

df.createOrReplaceTempView("student")
sqlDF=spark.sql("select * from student")
sqlDF.show()

Output: A temporary view will be created by the name of the student, and a spark.sql will be applied on top of it to convert it into a dataframe.

8. Using SQL function upon a Spark Session for Global temporary view: This enables the application to execute SQL type queries programmatically and hence returns the result in the form of a dataframe.
spark.sql(query)

Example: Suppose we have to register the SQL dataframe as a temp view then:

df.createGlobalTempView("student")
park.sql("select * from global_temp.student").show()
spark.newSession().sql("Select * from global_temp.student").show()

Output: A temporary view will be created by the name of the student, and a spark.sql will be applied on top of it to convert it into a dataframe.

Advantages of Spark DataFrame

  1.  The dataframe is the Data’s distributed collection, and therefore the data is organized in named column fashion.
  2. They are more or less similar to the table in the case of relational databases and have a rich set of optimization.
  3. Dataframes are used to empower the queries written in SQL and also the dataframe API
  4. It can be used to process both structured as well as unstructured kinds of data.
  5. The use of a catalyst optimizer makes optimization easy and effective.
  6. The libraries are present in many languages such as Python, Scala, Java, and R.
  7. This is used to provide strong compatibility with Hive and is used to run unmodified Hive queries on the already present hive warehouse.
  8. It can scale very well, right from a few kbs on the personal system to many petabytes on the large clusters.
  9. It is used to provide an easy level of integration with other big data technologies and frameworks.
  10. The abstraction which they provide to RDDs is efficient and makes processing faster.

Conclusion – Spark DataFrame

In this post, you have learned a very critical feature of Apache Spark, which is the dataframes and their usage in the applications running today, along with operations and advantages. I hope you have liked our article. Stay tuned for more like these.

Recommended Articles

This has been a guide to Spark DataFrame. Here we discuss steps to create a DataFrame its advantages, and different operations of DataFrames along with the appropriate sample code. You can also go through our other suggested articles to learn more –

  1. Spark Streaming
  2. How to Install Spark
  3. Career in Spark
  4. Spark Interview Questions
  5. Data Frames in R
  6. 7 Different Types of Joins in Spark SQL (Examples)
  7. PySpark SQL | Modules and Methods of PySpark SQL
  8. Spark Components | Overview of Components of Spark
  9. Complete Guide to Top Spark Tools
Popular Course in this category
Apache Spark Training (3 Courses)
  3 Online Courses |  13+ Hours |  Verifiable Certificate of Completion |  Lifetime Access
4.5
Price

View Course

Related Courses

All in One Data Science Bundle(360+ Courses, 50+ projects)
Python TutorialMachine LearningAWSArtificial Intelligence
TableauR ProgrammingPowerBIDeep Learning
Price
View Courses
360+ Online Courses | 50+ projects | 1500+ Hours | Verifiable Certificates | Lifetime Access
4.7 (86,171 ratings)
PySpark Tutorials (3 Courses)4.9
Apache Storm Training (1 Courses)4.8
1 Shares
Share
Tweet
Share
Primary Sidebar
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

SPSS, Data visualization with Python, Matplotlib Library, Seaborn Package

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more