Overview of Spark Versions
Spark Versions is a cluster computed processing engine used for processing large dataset was initially started by Matei Zaharia in 2009 at UC Berkeley’s AMP Labs and was open-sourced in 2010, before licensing it to Apache Software in 2013.
Spark after being released and open-sourced came over with lots of transformation and challenges that made it open for changes every time. The ever time growing of data and its complexity made this tool ready for changes. Several versions of spark were released that made it stable accordingly.
Various Versions of Spark
Let us check the versions of spark released:
1. Version 0.5
This was the initial version of the spark released in June 2012. It runs on Mesos 0.9 which contains usability and stability improvements. It was easy to access old jobs and logs were maintained. Many new operators were introduced like sortBykey, take a sample with the added New Hadoop API support.
The latest version available is 0.5.1.
2. Version 0.6
Just after the release of 0.5 after few months October 2012 spark made a new release of a version that brought several new features and architectural changes and performance enhancements. Standalone deploy mode was introduced that made it easy to launch the cluster without installing an external cluster manager. Persist() method was used over RDD and more join operators were introduced. It was deployed over Maven central and now can be used over maven projects.
The latest version available is 0.6.2.
3. Version 0.7
Version 0.7 was introduced over the starting of 2013. It was a major release as python API was introduced known as Pyspark that makes it possible for the spark to use with python. Some native libraries were introduced NumPy, SciPy. EC2 was introduced which reads s3 credential from AWS_ACCESS_KEY and AWS_SECRET_KEY that made it easy to access s3. Shuffle operations were introduced and performance improvements were introduced.
The latest version available is 0.7.3.
4. Version 0.8
This version of spark was released in Sept 2013. Monitoring UI and dashboards were introduced by a default port 4040. it contains all the information about running, completed, failed jobs. Machine Learning Library was introduced and now we can run our spark jobs over the YARN cluster. Support for YARN was introduced over this version. The Deployment of applications was easy with extending EC2 capabilities.
The latest version available is 0.8.1.
5. Version 0.9
It is a major release over the starting of 2014. It updates spark over scala2.10 with various libraries’ addition and improvements. It includes the first version of GraphX, a powerful tool for graph processing. Spark Conf class was now a preferred way to configure advanced settings on our spark context. Windows operators were speeding up to 50%. we can use the Graph library to build graphs from RDD and then we can transform graphs and extract subgraphs.
Spark streaming’s were improved Streaming listeners were introduced. It was a major release as Spark was updated with Scala 2.10.
The latest version available is 0.9.2
6. Version 1.0
Spark 1.0 was the start of the 1.X line. Released over 2014, it was a major release as it adds on a major new component SPARK SQL for loading and working over structured data in SPARK. With the introduction of SPARK SQL, it was easy to query and deal with large datasets and do operations over there. Extended JAVA and PYTHON support were introduced with new lambda syntax in java bindings.
The latest version available is 1.0.2.
7. Version 1.1
It was the first minor release on the 1.X line . Since SPARK SQL was introduced this release added JDBC/ODBC servers to connect to SPARK SQL from many different applications. Support for JSON was introduced. Performance and usability Improvements was there, accumulators were introduced that were displayed in spark UI.
The latest version available is 1.1.1.
8. Version 1.2
Released over 2014, this brings performance and usage improvements over Spark Core Engine. Spark communication manager used during bulk transfers was improved and the shuffling mechanism was upgraded.
The latest version available is 1.2.2.
9. Version 1.3
Spark 1.3 was the fourth release on the 1.X line. It comes with the introduction of DataFrame API along with the improvement of SPARK SQL API. Multiple level aggregation trees were introduced to help the help speeding up the reduce operations over spark core.SSL encryptions were introduced. Kafka docs were introduced.
The latest version available is 1.3.1.
10. Version 1.4
Released over 2015, package sparkR was introduced and expansion of MLib and Streaming was introduced. Visualization of SparkDAGs was introduced, Docker support in Mesos was introduced
The latest version available is 1.4.1.
11. Version 1.5
This version of spark basically deals with the improvement of API like RDD, DataFrame, and Datasets. Joins execution over the data frame was improved and memory management was handled.
The latest version available is 1.5.2.
12. Version 1.6
Over 2016, version 1.6 was introduced this was the last update over the spark 1.X framework.
Datasets were introduced a new spark API that helps to work with the custom objects. Reading of Non-Standard JSON files were introduced. Null Safe joins were added with the addition of working over the parquet files with the columnar data approach.
The latest version available is 1.6.3.
13. Version 2.0
This was the first release over the 2.X line. Mid 2016 let the release for version 2.0 of spark, Hive style bucketing, performance improvement and SQL improvements were added in this version. A native SQL parser was introduced. R was added with many new functionalities such as dapply, gapply, and lapply.
The latest version available is 2.0.2.
14. Version 2.1
This was the second release over 2.X family with focused improvement over spark streaming with Kafka support. The API was updates making Data type API as the stable API, json parsing was introduced, pager ranks were introduced in R.
Apart from this the performance and memory management were like random forest and faster regression features were introduced.
The latest version available is 2.1.3.
15. Version 2.2
The third release for the 2.X family came over 2017 with support over creating hive tables with data frame writer and catalog. Broadcast joins, Mapjoins were introduced for SQL Queries. Parsing of multiple json,csv files were introduced.
The latest version available is 2.2.3.
16. Version 2.3
It was the fourth release over 2.X family, Spark over Kubernetes was introduced that supports the submission of jobs managed by Kubernetes. History server was introduced, with performance improvements’ over pyspark, Hive partitioning were improved dynamic partitions were introduced.
The latest version available is 2.3.3.
17. Version 2.4
This is the latest stable release of spark application.
Adding experimental support to Scala 2.12 it gives the application owner to write their programs in Scala 2.12. Built-in Avro Data Source for better performance and usability was introduced. SQL syntax for Pivot was introduced, Coalesce and repartitions were introduced for SQL queries. This is the most stable release over the SPARK and it is widely used to create certain spark level applications all over
The latest version available is 2.4.4.
Here we saw from this blog the various spark versions released to date and some changes over that were performed over with these releases. With this changing data pattern and volumes, we are set up for new releases over time for better functionality of spark application.
This is a guide to Spark Versions. Here we discuss 17 different Versions of Spark in detail. You can also go through our other related articles to learn more–