EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login

PySpark Union

Home » Data Science » Data Science Tutorials » Spark Tutorial » PySpark Union

PySpark Union

Introduction to PySpark Union

PySpark UNION is a transformation in PySpark that is used to merge two or more data frames in a PySpark application.  The union operation is applied to spark data frames with the same schema and structure. This is a very important condition for the union operation to be performed in any PySpark application. The union operation can be carried out with two or more PySpark data frames and can be used to combine the data frame to get the defined result. It returns a new Spark Data Frame that contains the union of rows of the data frames used.

Syntax:

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

The syntax  for the PYSPARK UNION  function is:

Df  = df1.union(df2)
Df = DataFrame post union.
Df1 = DataFrame1 used for union operation.
Df2 = DataFrame2 used for union operation.
.union :- The union transformation
c=a.union(b).show()

Output:

PySpark Union Example

Working of UnionIN PySpark

Let us see how the UNION function works in PySpark:

  • The Union is a transformation in Spark that is used to work with multiple data frames in Spark.  It takes the data frame as the input and the return type is a new data frame containing the elements that are in data frame1 as well as in data frame2.
  • This transformation takes out all the elements whether its duplicate or not and appends them making them into a single data frame for further operational purposes.
  • We can also apply the union operation to more than one data frame in a spark application. The physical plan for union shows that the shuffle stage is represented by the Exchange node from all the columns involved in the union and is applied to each and every element in the data Frame.

Example of PySpark Union

Let us see some Example of how the PYSPARK  UNION function works:

Example #1

Let’s start by creating a simple Data Frame over we want to use the Filter Operation.

Code:

Creation of DataFrame:

a= spark.createDataFrame(["SAM","JOHN","AND","ROBIN","ANAND"], "string").toDF("Name")

Lets create one more Data Frame b over union operation to be performed on.

Popular Course in this category
Sale
Apache Spark Training (3 Courses)3 Online Courses | 13+ Hours | Verifiable Certificate of Completion | Lifetime Access
4.5 (9,132 ratings)
Course Price

View Course

Related Courses
PySpark Tutorials (3 Courses)Apache Storm Training (1 Courses)

b= spark.createDataFrame(["DAN","JACK","AND"], "string").toDF("Name")

Let’s start a basic union operation.

c = a.union(b).show()

The output will append both the data frames together and the result will have both the data Frames together.

Output:

PySpark Union Example 1

Example #2

The union operations deal with all the data and doesn’t handle the duplicate data in it. To remove the duplicates from the data frame we need to do the distinct operation from the data frame. The Distinct or Drop Duplicate operation is used to remove the duplicates from the Data Frame.

Code:

c.dropDuplicates()
c.distinct()
c.distinct().show()

Output:

PySpark Union Example 2

Example #3

We can also perform multiple union operations over the PySpark Data Frame. The same union operation can be applied to all the data frames.

a= spark.createDataFrame(["SAM","JOHN","AND","ROBIN","ANAND"], "string").toDF("Name")
b= spark.createDataFrame(["DAN","JACK","AND"], "string").toDF("Name")
c= spark.createDataFrame(["DAtN","JACKj","AND"], "string").toDF("Name")
d=a.union(b).union(c)
d.show
()

This will union all three Data frames.

Output:

PySpark Union Example 3 a

d.distinct().show()

The distinct will remove all the duplicates from the Data Frame created.

PySpark Union Example 3 b

Example #4

The same union operation can be done with Data Frame with Integer type as the data type also.

c= spark.createDataFrame([23,98,67], "integer").toDF("Number")
b= spark.createDataFrame([22,12,12], "integer").toDF("Number1")
d = c.union(b)
d.show()

This is of type Integer and post applying union will append all the data frames together.

d.distinct()
d.distinct().show()

Output:

Data Frame with Integer type Example 4

Example #5

Let us see some more examples over the union operation on data frame.

data1  = [{'Name':'Jhon','ID':2,'Add':'USA'},{'Name':'Joe','ID':3,'Add':'MX'},{'Name':'Tina','ID':4,'Add':'IND'}] data2  = [{'Name':'Jhon','ID':21,'Add':'USA'},{'Name':'Joes','ID':31,'Add':'MX'},{'Name':'Tina','ID':43,'Add':'IND'}] rd1 = sc.parallelize(data1)
rd2 = sc.parallelize(data2)
df1 = spark.createDataFrame(rd1)
df2 = spark.createDataFrame(rd2)
df1.show()

Let perform union operation over the Data Frame and analyze.

df3 = df1.union(df2)
df3.show()
df3.distinct()
df3.distinct().show()

Output:

Data Frame and analyze Example 5

These are some of the Examples of Union Transformation in PySpark.

Conclusion

From the above article, we saw the use of Union Operation in PySpark. From various examples and classification, we tried to understand how the  UNION method works in PySpark and what are is used in the programming level. We also saw the internal working and the advantages of having Union in Spark Data Frame and its usage in various programming purposes. Also, the syntax and examples helped us to understand much precisely the function.

Recommended Articles

This is a guide to PySpark Union. Here we discuss the introduction to PySpark Union, its syntax and the use of Union Operation along with Working and code implementation. You may also have a look at the following articles to learn more –

  1. PySpark Join
  2. PySpark SQL
  3. Spark Versions
  4. Spark flatMap

All in One Data Science Bundle (360+ Courses, 50+ projects)

360+ Online Courses

50+ projects

1500+ Hours

Verifiable Certificates

Lifetime Access

Learn More

0 Shares
Share
Tweet
Share
Primary Sidebar
Spark Tutorial
  • PySpark
    • PySpark version
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

Independence Day Offer - Spark Certification Course Learn More