EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login

Join in Spark SQL

By Priya PedamkarPriya Pedamkar

Home » Data Science » Data Science Tutorials » Spark Tutorial » Join in Spark SQL

Join in Spark SQL

Introduction to Join in Spark SQL

Join in Spark SQL is the functionality to join two or more datasets that are similar to the table join in SQL based databases. Spark works as the tabular form of datasets and data frames. The Spark SQL supports several types of joins such as inner join, cross join, left outer join, right outer join, full outer join, left semi-join, left anti join. Joins scenarios are implemented in Spark SQL based upon the business use case. Some of the joins require high resource and computation efficiency. For managing such scenarios spark support SQL optimizer and cross join enabler flags features.

Types of Join in Spark SQL

Following are the different types of Joins:

Types of Join in Spark SQL

  • INNER JOIN
  • CROSS JOIN
  • LEFT OUTER JOIN
  • RIGHT OUTER JOIN
  • FULL OUTER JOIN
  • LEFT SEMI JOIN
  • LEFT ANTI JOIN

Example of Data Creation

We will use the following data to demonstrate the different types of joins:

Book Dataset:

case class Book(book_name: String, cost: Int, writer_id:Int)
val bookDS = Seq(
Book("Scala", 400, 1),
Book("Spark", 500, 2),
Book("Kafka", 300, 3),
Book("Java", 350, 5)
).toDS()
bookDS.show()

Book Dataset Joins in Spark SQL

Writer Dataset:

case class Writer(writer_name: String, writer_id:Int)
val writerDS = Seq(
Writer("Martin",1),
Writer("Zaharia " 2),
Writer("Neha", 3),
Writer("James", 4)
).toDS()
writerDS.show()

Writer Dataset joins in spark SQL

Types of Joins

Below are mentioned 7 different types of Joins:

1. INNER JOIN

The INNER JOIN returns the dataset which has the rows that have matching values in both the datasets i.e. value of the common field will be the same.

val BookWriterInner = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "inner")
BookWriterInner.show()

INNER JOIN Joins in Spark SQL

2. CROSS JOIN

The CROSS JOIN returns the dataset which is the number of rows in the first dataset multiplied by the number of rows in the second dataset. Such kind of result is called the Cartesian Product.
Prerequisite: For using a cross join, spark.sql.crossJoin.enabled must be set to true. Otherwise, the exception will be thrown.

spark.conf.set("spark.sql.crossJoin.enabled", true)
val BookWriterCross = bookDS.join(writerDS)
BookWriterCross.show()

CROSS JOIN Joins in Spark SQL

3. LEFT OUTER JOIN

The LEFT OUTER JOIN returns the dataset that has all rows from the left dataset, and the matched rows from the right dataset.

Popular Course in this category
Sale
Apache Spark Training (3 Courses)3 Online Courses | 13+ Hours | Verifiable Certificate of Completion | Lifetime Access
4.5 (9,137 ratings)
Course Price

View Course

Related Courses
PySpark Tutorials (3 Courses)Apache Storm Training (1 Courses)

val BookWriterLeft = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "leftouter")
BookWriterLeft.show()

LEFT OUTER JOIN Joins in Spark SQL

4. RIGHT OUTER JOIN

The RIGHT OUTER JOIN returns the dataset that has all rows from the right dataset, and the matched rows from the left dataset.

val BookWriterRight = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "rightouter")
BookWriterRight.show()

RIGHT OUTER JOIN Joins in Spark SQL

5. FULL OUTER JOIN

The FULL OUTER JOIN returns the dataset that has all rows when there is a match in either the left or right dataset.

val BookWriterFull = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "fullouter")
BookWriterFull.show()

FULL OUTER JOIN

6. LEFT SEMI JOIN

The LEFT SEMI JOIN returns the dataset which has all rows from the left dataset having their correspondence in the right dataset. Unlike the LEFT OUTER JOIN, the returned dataset in LEFT SEMI JOIN contains only the columns from the left dataset.

val BookWriterLeftSemi = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "leftsemi")
BookWriterLeftSemi.show()

LEFT SEMI JOIN

7. LEFT ANTI JOIN

The ANTI SEMI JOIN returns the dataset which has all the rows from the left dataset that don’t have their matching in the right dataset. It also contains only the columns from the left dataset.

val BookWriterLeftAnti = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "leftanti")
BookWriterLeftAnti.show()

LEFT ANTI JOIN

Conclusion

Joining data is one of the most common and important operations for fulfilling our business use case. Spark SQL supports all the fundamental types of joins. While joining, we need to also consider performance as they may require large network transfers or even create datasets beyond our capability to handle. For performance improvement, Spark uses SQL optimizer to re-order or push down filters. Spark also restrict the dangerous join i. e CROSS JOIN. For using a cross join, spark.sql.crossJoin.enabled must be set to true explicitly.

Recommended Articles

This is a guide to Join in Spark SQL. Here we discuss the different types of Joins available in Spark SQL with the Example. You may also look at the following article.

  1. Types of Joins in SQL
  2. Table in SQL
  3. SQL Insert Query
  4. Transactions in SQL
  5. PHP Filters | How to validate User Input using Various Filters?

Apache Spark Training (3 Courses)

3 Online Courses

13+ Hours

Verifiable Certificate of Completion

Lifetime Access

Learn More

0 Shares
Share
Tweet
Share
Primary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

Already registered !

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

🇮🇳 Independence Day Offer - Apache Spark Training (3 Courses) Learn More