Updated April 19, 2023

Introduction to PySpark version

PySpark is a Python API which is released by the Apache Spark community in order to support Spark with Python. PySpark is used widely by the scientists and researchers to work with RDD in the Python Programming language. PySpark is like a boon to the Data engineers when working with large data sets, analyzing them, performing computations, etc. It is also compatible with many languages like Java, R, Scala which makes it more preferable by the users. Because of the speed and its ability to deal with Big Data, it got large support from the community. Many versions have been released of PySpark from May 2017 making new changes day by day.

Versions of PySpark

Many versions of PySpark have been released and are available to use for the general public. Some of the latest Spark versions supporting the Python language and having the major changes are given below :

1. Spark Release 2.3.0

This is the fourth major release of the 2.x version of Apache Spark. This release includes a number of PySpark performance enhancements including the updates in DataSource and Data Streaming APIs.

Some important features and the updates that were introduced in this release are given below:

Improvements were made regarding the performance and interoperability of python by vectorized execution and fast data serialization.
A new Spark History Server was added in order to provide better scalability for the large applications.
register* for UDFs in SQLContext and Catalog was deprecated in PySpark.
Python na.fill() function now also accepts boolean values and replaces the null values with booleans (in previous versions PySpark ignores it and returns the original DataFrame).
In order to respect session timezone, timestamp behavior was changed for the Panda related functionalities.
From this release, Pandas 0.19.2 or upper version is required for the user to use Panda related functionalities.
Many documentation changes and the test scripts were revised in this release for the Python language.

2. Spark Release 2.4.7

This was basically the maintenance release including the bug fixes while maintaining the stability and security of the ongoing software system. Not any specific and major feature was introduced related to the Python API of PySpark in this release. Some of the notable changes that were made in this release are given below:

Now loading of the job UI page takes only 40 sec.
Python Scripts were changes that were failing in certain environments in previous releases.
Now users can compare two dataframes with the same schema (Except for the nullable property).
In the release DockerFile, R language version is upgraded to 4.0.2
Support for the R less than 3.5 version is dropped.
Exception messages at various places were improved.
Error messages were locked when failing in interpreter mode.
Many changes were made in the documentation for the inconsistent AWS variables.

3. Spark Release 3.0.0

This is the first release of 3.x version. It brings many new ideas from the 2.x release and continues the same ongoing project in development. It was officially released in June 2020. The top component in this release is SparkSQL as more than 45% of the tickets were resolved on SparkSQL. It benefits all the high level APIs and high level libraries including the DataFrames and SQL. At this stage, Python is the most widely used language on Apache Spark. Millions of users downloaded Apache Spark with the Python language only. Major changes and the features that were introduced in this release are given below:

In this release functionality and usability is improved including the redesign of Pandas UDF APIs.
Various Pythonic error handling were done.
Python 2 support was deprecated in this release.
PySpark SQL exceptions were made more pythonic in this release.
Various changes in the test coverage and documentation of Python UDFs were made.
For K85 Python Bindings, Python 3 was made as the default language.
Validation sets were added to fit with Gradient Boosted trees in Python.
Parity was maintained in the ML function between Python and Scala programming language.
Various exceptions in the Python UDFs were improved as complaints by the Python users.
Now a multiclass logistic regression in PySpark correctly returns a LogisticRegressionSummary from this release.

4. Spark Release 3.0.1

Spark Release 3.0.1 was the maintenance release containing the major fixes related to the stability of the ongoing project. It was officially released on September 8, 2020. As such no major changes related to the PySpark were introduced in this release. It was based on a maintenance branch of 3.0 Spark release. Other related changes/ fixes that were made in this release are given below:

Double catching was fixed in KMeans and BiKMeans.
Apache Arrow 1.0.0 was supported in SparkR.
For the overflow conditions, silent changes were made for timestamp parsing.
Revisiting keywords based on ANSI SQL standard was done.
Regression was done in handling the NaN values in Sql COUNT.
Changes were made for the Spark producing incorrect results in group by clause.
Grouping problems were resolved as per the case sensitivity in panda UDFs.
MLlibs acceleration docs were improved in this release.
Issues related to the LEFT JOIN found in the regression of 3.0.0 producing unexpected results were resolved.

5. Spark Release 3.1.1

Spark Release 3.1.1 would now be considered as the new official release of Apache Spark including the bug fixes and new features introduced in it. Though it was planned to be released in early January 2021, there is no official documentation of it available on its official site as of now.

Conclusion

Above description clearly explains the various versions of PySpark. Apache Spark is used widely in the IT industry. Python is a high level, general purpose and one of the most widely used languages. In order to implement the key features of Python in Spark framework and to use the building blocks of Spark with Python language, Python Spark (PySpark) is a precious gift of Apache Spark for the IT industry.