Introduction to PySpark Join
PYSPARK JOIN Operation is a way to combine Data frames in a spark application.
A join operation basically comes up with the concept of joining and merging or extracting data from two different data frames or sources. It is used to combine rows in a Data Frame in Spark based on certain relational columns with it. The data satisfying the relation comes into the range while the other one gets eradicated. PySpark JOIN is very important to deal with bulk data or nested data coming up from two Data frames in Spark. A join operation has the capability of joining multiple data frames or working on multiple rows of a Data Frame in a PySpark application.
PySpark JOINS has various types with which we can join a data frame and work over the data as per need. Some of the joins operations are:-
Inner Join, Outer Join, Right Join, Left Join, Right Semi Join, Left Semi Join, etc.
These operations are needed for Data operations over the Spark application. Let us check some examples of this operation over the PySpark application.
Examples of PySpark Joins
Let us see some examples of how PySpark Join operation works:
Before starting the operation let’s create two Data frames in PySpark from which the join operation example will start. Create a data Frame with the name Data1 and another with the name Data2. createDataframe function is used in Pyspark to create a DataFrame.
Code:
data1 = [{'Name':'Jhon','ID':2,'Add':'USA'},{'Name':'Joe','ID':3,'Add':'MX'},{'Name':'Tina','ID':4,'Add':'IND'}]
data2 = [{'Name':'Jhon','ID':21,'Add':'USA'},{'Name':'Joes','ID':31,'Add':'MX'},{'Name':'Tina','ID':43,'Add':'IND'}]
Create an RDD
rd1 = sc.parallelize(data1)
rd2 = sc.parallelize(data2)
Create DataFrame from RDD
df1 = spark.createDataFrame(rd1)
df2 = spark.createDataFrame(rd2)
df1.show()
df2.show()
Output:
The Sample Data frame is created now let’s see the join operation and its usage.
Inner Join
The Matching records from both the data frame are selected in Inner join.
The operation is performed on Columns and the column with the same value is joined with the result being displayed as the output.
Code:
df_inner = df1.join(df2 , on=['Name'] , how = 'inner')
df_inner.show()
df_inner = df1.join(df2 , on=['ID'] , how = 'inner').show()
The one matching the condition will come as a result and the one not will not.
Non-satisfying conditions are produced with no result.
Output:
Outer Join
All the data from both the data frame is selected in Outer join.
The operation is performed on Columns and the matched columns are returned as result. Missing columns are filled with Null.
Code:
df_inner = df1.join(df2 , on=['ID'] , how = 'outer').show()
df_inner = df1.join(df2 , on=['Name'] , how = 'outer').show()
The one matching the condition will come as a result and the one not will not.
Non-satisfying conditions are filled with null and the result is displayed.
Output:
Left Join
All the data from the Left data frame is selected and data that matches the condition and fills the record in the matched case in Left Join. The operation is performed on Columns and the matched columns are returned as result. Missing columns are filled with Null.
Code:
df_inner = df1.join(df2 , on=['Name'] , how = 'left').show()
df_inner = df1.join(df2 , on=['id'] , how = 'left').show()
The one matching the condition will come as a result and the one not will not.
Non-satisfying conditions are filled with null and the result is displayed. All the elements from the left data Frame will come in the result filling the values satisfied else null.
Output:
Right Join
All the data from the Right data frame is selected and data that matches the condition and fills the record in the matched case in Right Join.
The operation is performed on Columns and the matched columns are returned as result. Missing columns are filled with Null.
Code:
df_inner = df1.join(df2 , on=['Name'] , how = 'right').show()
df_inner = df1.join(df2 , on=['id'] , how = 'right').show()
The one matching the condition will come as a result and the one not will not.
Non-satisfying conditions are filled with null and the result is displayed. All the elements from the right data Frame will come in the result filling the values satisfied else null.
Output:
Left Semi Join
The Matching records from the Left data frame are selected in Left Semi join.
The operation is just like the Inner Join just the selected data are from the left Data Frame.
Code:
df_inner = df1.join(df2 , on=['Name'] , how = ‘left_semi’).show()
df_inner = df1.join(df2 , on=['ID'] , how = 'left_semi').show()
The one matching the condition will come as a result (Only Left Data Frame Data) and the one not will not.
Nons-atisfying conditions are produced with no result.
Output:
Left Anti Join
The difference in the record from both the data frame. It selects rows that are not in DataFrame2 from DataFrame1.
Code:
df_inner = df1.join(df2 , on=['Name'] , how = 'left_anti').show()
df_inner = df1.join(df2 , on=['ID'] , how = 'left_anti').show()
Output:
These were various join operations of PYSPARK.
Conclusion – PySpark Join
From the above article, we saw the use of Join Operation in PySpark. From various examples and classifications, we tried to understand how the JOIN operation works in PySpark and what is use at the programming level. We also saw the internal working and the advantages of having JOIN in PySpark Data Frame and its usage for various programming purposes. Also, the syntax and examples helped us to understand much precisely the function.
Recommended Articles
This is a guide to PySpark Join. Here we discuss the introduction, syntax, how PySpark Join operation works with code implementation. You may also have a look at the following articles to learn more –