Differences Between to Spark SQL vs Presto
Presto in simple terms is ‘SQL Query Engine’, initially developed for Apache Hadoop. It’s an open source distributed SQL query engine designed for running interactive analytic queries against data sets of all sizes.
Spark SQL is a distributed in-memory computation engine with a SQL layer on top of structured and semi-structured data sets. Since its in-memory processing, the processing will be fast in Spark SQL.
Head to Head Comparison Between Spark SQL and Presto (Infographics)
Below are the Top 7 comparison between Spark SQL and Presto:
Key Differences Between Spark SQL and Presto
Below is the list, about the key difference between Presto and Spark SQL:
- Apache Spark introduces a programming module for processing structured data called Spark SQL. Spark SQL includes an encoding abstraction called Data Frame which can act as distributed SQL query engine.
- The motive behind the beginning of Presto was to enable interactive analytics and approaches to the speed of commercial data warehouses with the power to scale size of organizations matching Facebook.
- Whereas Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD (Resilient Distributed Datasets), it provides support for structured/semi-structured data.
- Presto was designed as an alternative to tools that query HDFS data using MapReduce jobs such as Hive or Pig, but Presto is not limited to HDFS.
- Spark SQL follows in-memory processing, that increases the processing speed. Spark is designed to process a wide range of workloads such as batch queries, iterative algorithms, interactive queries, streaming etc.
- Presto is capable of executing the federative queries. Below is the example of Presto Federated Queries
Let us assume any RDBMS with table sample1
And HIVE with table sample2,
‘Testdb’ is the database in both hive and MYSQL. Using Presto we can evaluate data using in a single query once their connectors are configured correctly as shown below-
presto> <Function (select/Group by ..etc)> hive.Testdb.sample2
Function (select/Group by ..etc)>mysql.Testdb.sample1
- Spark SQL architecture consists of Spark SQL, Schema RDD, and Data Frame
- A Data Frame is a collection of data; the data is organized into named columns. Technically, it is same as relational database tables.
- Schema RDD: Spark Core contains special data structure called RDD. Spark SQL works on schemas, tables, and records. Therefore, a user can use the Schema RDD as a temporary table. So that user can call this Schema RDD as Data Frame
- Data Frame Capabilities: Data frame process the data in the size of Kilobytes to Petabytes on a single node cluster to multiple node clusters,
- Data Frame supports different data formats ( CSV, elasticsearch, Cassandra etc) and storage systems (HDFS, HIVE tables, MySQL, etc), It can be integrated with all Big Data tools/frameworks via Spark-Core and provides API for languages such as Python, Java, Scala, and R Programming.
- Whereas Presto is a distributed engine, works on a cluster setup. Presto architecture is simple to understand and extensible. Presto client (CLI) submits SQL statements to a master daemon coordinator which manages the processing.
- Companies using Presto: Facebook, Netflix, Airbnd, Dropbox etc.
- Apache Spark Use Cases can be found in Industries like Finance, Retail, Healthcare, and Travel etc. Many e-commerce websites like eBay, Alibaba, Pinterest are using Spark SQL to analyze hundreds of petabytes of data on its e-commerce platform.
Comparisons Table Spark SQL and Presto
Below is the topmost comparison between SQL and Presto.
|Basis of comparison between SQL vs Presto||Presto||Spark SQL|
|Eco-Systems / Platforms||Hadoop, Big Data Processing etc||Spark Framework, Big Data Processing etc|
|Purpose||Presto is designed for running SQL queries over Big Data (Huge workloads).
It was designed by Facebook to process their huge workloads..
|Spark SQL is one of the components of Apache Spark Core.
Spark Core is the fundamental execution engine for spark platform
|Capabilities/Features||Presto allows data querying over many data sources; For example, Data might be residing in data stores: Hive, Cassandra, RDBMS, and some other proprietary data stores.||Spark SQL gives flexibility in integration with other data sources using the data frames and JDBC connectors.|
|Support for Connectors||Presto supports pluggable connectors. These connectors provide data sets for queries.
Below are several pre-existing connectors available in presto, while Presto provides the ability to connect with custom connectors, as well.
|A Data Frame interface allows different Data Sources to work on Spark SQL.
Spark SQL includes a server mode with industry-standard JDBC and ODBC connectivity.
|Federated Queries||Presto supports the Federated Queries. Presto can be configured to connect with different DBs and once configured; its CLI can be used to launch ‘Federated Queries’.
In one Presto query user can combine data from multiple data sources and run the query.
|Spark SQL comes with an inbuilt feature to connect with other databases using JDBC that is “JDBC to other Databases”, it aids in federation feature.
Spark creates the data frames using the JDBC: database feature by leveraging scala/python API, but it also works directly with Spark SQL Thrift server and allows users to query external JDBC tables effortlessly like other hive/spark tables.
|Who Uses?||Data Analysts, Data Engineers, Data Scientists etc||Data Analysts, Data Engineers, Data Scientists, Spark Developer etc|
Spark SQL and Presto, both are SQL distributed engines available in the market.
Presto is very helpful when it comes to BI-type queries, and Spark SQL leads performance-wise in large analytics queries. When comparing with respect to configuration, Presto set up easy than Spark SQL. Both Spark SQL and Presto are standing equally in a market and solving a different kind of business problems.
This has been a guide to Spark SQL vs Presto. Here we have discussed Spark SQL vs Presto head to head comparison, key differences, along with infographics and comparison table. You may also look at the following articles to learn more –
- Apache Spark vs Apache Flink – 8 useful Things You Need To Know
- Apache Hive vs Apache Spark SQL – 13 Amazing Differences
- Best 6 Comparisons Between Hadoop Vs SQL
- Hadoop vs Teradata -Valuable Differnces