Differences Between Spark SQL vs Presto

Presto, in simple terms, is the ‘SQL Query Engine,’ initially developed for Apache Hadoop. It’s an open-source distributed SQL query engine designed for running interactive analytic queries against data sets of all sizes. Spark SQL is a distributed in-memory computation engine with a SQL layer on top of structured and semi-structured data sets. Since it’s in-memory processing, the processing will be fast in Spark SQL.

Head to Head Comparison Between Spark SQL and Presto (Infographics)

Below are the Top 7 comparisons between Spark SQL and Presto:

Spark SQL vs Presto

Key Differences Between Spark SQL and Presto

Below is the list of the critical difference between Presto and Spark SQL:

Apache Spark introduces a programming module for processing structured data called Spark SQL. Spark SQL includes an encoding abstraction called Data Frame which can act as distributed SQL query engine.
The motive behind the beginning of Presto was to enable interactive analytics and approaches to the speed of commercial data warehouses with the power to scale the size of organizations matching Facebook.
Whereas Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD (Resilient Distributed Datasets), it supports structured/semi-structured data.
Presto was designed as an alternative to tools that query HDFS data using MapReduce jobs such as Hive or Pig, but Presto is not limited to HDFS.
Spark SQL follows in-memory processing that increases the processing speed. Spark is designed to process various workloads, such as batch queries, iterative algorithms, interactive queries, streaming, etc.
Presto is capable of executing federative queries. Below is an example of Presto Federated Queries

Let us assume any RDBMS with table sample1

And HIVE with table sample2,

‘Testdb’ is the database in both hive and MYSQL. Using Presto, we can evaluate data using a single query once their connectors are configured correctly, as shown below-

presto> <Function (select/Group by ..etc.)> hive.Testdb.sample2

Function (select/Group by ..etc.)>mysql.Testdb.sample1

Spark SQL architecture consists of Spark SQL, Schema RDD, and Data Frame.

- A Data Frame is a collection of data; the data is organized into named columns. Technically, it is the same as relational database tables.
- Schema RDD: Spark Core contains a unique data structure called RDD. Spark SQL works on schemas, tables, and records. Therefore, a user can use the Schema RDD as a temporary table. So that user can call this Schema RDD a Data Frame
Data Frame Capabilities: Data frame process the data in the size of Kilobytes to Petabytes on a single node cluster to multiple node clusters,
Data Frame supports different data formats ( CSV, elastic search, Cassandra, etc.) and storage systems (HDFS, HIVE tables, MySQL, etc.); it can be integrated with all Big Data tools/frameworks via Spark-Core and provides API for languages such as Python, Java, Scala, and R Programming.
Whereas Presto is a distributed engine that works on a cluster setup. Presto architecture is simple to understand and extensible. Presto client (CLI) submits SQL statements to a master daemon coordinator, who manages the processing.
Companies using Presto: Facebook, Netflix, Airbnb, Dropbox,, etc.
Apache Spark Use Cases can be found in Industries like Finance, Retail, Healthcare, Travel,, etc. Many e-commerce websites like eBay, Alibaba, and Pinterest use Spark SQL to analyze hundreds of petabytes of data on their e-commerce platform.

Comparison Table of Spark SQL vs Presto

Below is the topmost comparison between SQL vs Presto.

Basis of comparison	Presto	Spark SQL
Eco-Systems / Platforms	Hadoop, Big Data Processing, etc	Spark Framework, Big Data Processing, etc
Purpose	Presto is designed for running SQL queries over Big Data (Huge workloads). It was designed by Facebook to process their huge workloads.	Spark SQL is one of the components of Apache Spark Core. Spark Core is the fundamental execution engine for the spark platform
Set up	Presto is a distributed SQL query engine for processing pet bytes of data, and it runs on a cluster-like setup with a set of machines. A full Presto cluster setup includes a coordinator (Manager Node) and multiple workers. The user submits the queries from a client, which is the Presto CLI, to the coordinator. The coordinator parses, analyzes, and plans the query execution, and then distributes the query processing to the workers.	Spark SQL setup will be out of the box if you install and configure Apache Spark Cluster. Apache Spark is Hadoop’s sub-project. Apaches Spark is a cluster-based Big Data processing technology designed for fast computation.
Capabilities/Features	Presto allows data querying over many data sources; For example, Data might be residing in data stores: Hive, Cassandra, RDBMS, and some other proprietary data stores.	Spark SQL gives flexibility in integration with other data sources using the data frames and JDBC connectors.
Support for Connectors	Presto supports pluggable connectors. These connectors provide data sets for queries. Below are several pre-existing connectors available in Presto, while Presto provides the ability to connect with custom connectors, as well. Below are some of the connectors it supports. Hadoop/Hive Cassandra Teradata PostgreSQL Oracle etc	A Data Frame interface allows different Data Sources to work on Spark SQL. Spark SQL includes a server mode with industry-standard JDBC and ODBC connectivity.
Federated Queries	Presto supports the Federated Queries. Presto can be configured to connect with different DBs, and once configured, its CLI can be used to launch ‘Federated Queries’. In one Presto query user can combine data from multiple data sources and run the query.	Spark SQL comes with an inbuilt feature to connect with other databases using JDBC, that is, “JDBC to other Databases,” which aids in the federation feature. Spark creates the data frames using the JDBC: database feature by leveraging scala/python API. Still, it also works directly with the Spark SQL Thrift server and allows users to query external JDBC tables effortlessly like other hive/spark tables.
Who Uses?	Data Analysts, Data Engineers, Data Scientists, etc	Data Analysts, Data Engineers, Data Scientists, Spark Developer, etc

Conclusions

Presto is very helpful regarding BI-type queries, and Spark SQL leads performance-wise in large analytics queries. When comparing with respect to configuration, Presto set up easy than Spark SQL. Both Spark SQL and Presto are standing equally in the market and solving different kinds of business problems.