Difference Between Hive and Impala
Hive is a data warehouse software project built on top of APACHE HADOOP developed by Jeff’s team at Facebook with a current stable version of 2.3.0 released 7 months ago on 19 July 2017. It is used for summarising Big data and makes querying and analysis easy. Apache Hive is an effective standard for SQL-in Hadoop.
Impala is a parallel processing SQL query engine which runs on Apache Hadoop and use to process the data which stores in HBase (Hadoop Database) and Hadoop Distributed File System. Impala is an open source product for parallel processing (MPP) SQL query engine for data stored in a local system cluster running Apache Hadoop developed by Apache Software Foundation with the stable version of 2.10. So let’s study both Hive and Impala in a detail.
- Apache Hive helps in analyzing the huge dataset stored in Hadoop file system (HDFS) and other compatible file systems.
- Hive QL – For querying data stored in Hadoop Cluster.
- Exploits the Scalability of Hadoop by translation.
- Hive is NOT a Full Database.
- Does Not provide record level update.
- Hadoop is Batch Oriented System.
- Hive Queries have high latency due to MapReduce.
- Hive does not provide features of It is close to OLAP.
- Best suited for Data Warehouse Applications.
- Query execution via MapReduce.
- query language can be used with custom scalar functions (UDF’s), aggregations (UDAF’s), and table functions (UDTF’s).
- Hive also provides Indexing to accelerate, index type including compaction and bitmap index as of 0.10, more index types are planned.
- Storage types supported by Hive are RCfile, HBase, ORC and Plain text.
- SQL-like queries (Hive QL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
- By default, Hive stores metadata in an embedded Apache Derby database.
- Impala is a query engine that runs on Hadoop. It public beta test distribution was announced in October 2012 and became generally available in May 2013.
- It supports HDFS Apache HBase storage and Amazon S3.
- Reads Hadoop file formats, including text, Parquet, Avro, RCFile, LZO, and Sequence file.
- Supports Hadoop security (Kerberos authentication).
- Uses metadata, ODBC driver, and SQL syntax from Apache Hive.
- It supports multiple compression codecs:
(a) Snappy (Recommended for its effective balance between compression ratio and decompression speed),
(b) Gzip (Recommended when achieving the highest level of compression),
(c) Deflate (not supported for text files), Bzip2, LZO (for text files only);
- It allows you to query on nested structures including maps, structs, and arrays.
- It allows multi-user concurrent queries and also allows admission control on the basis of prioritization and queuing of queries.
Head To Head Comparisons Between Hive vs Impala (Infographics)
Below is the Top 20 Comparision between Hive vs Impala
Key Difference Between Hive vs Impala
The differences between Hive vs Impala are explained in points presented below:
- Hive is developed by Jeff’s team at Facebook but Impala is developed by Apache Software Foundation.
- Hive supports file format of Optimized row columnar (ORC) format with Zlib compression but Impala supports the Parquet format with snappy compression.
- Hive is written in Java but Impala is written in C++.
- Query processing speed in Hive is slow but Impala is 6-69 times faster than Hive.
- In Hive Latency is high but in Impala Latency is low.
- Hive supports storage of RC file and ORC but Impala storage supports is Hadoop and Apache HBase.
- Hive generates query expression at compile time but in Impala code generation for ‘’big loops” happens during runtime.
- Hive does not support parallel processing but Impala supports parallel processing.
- Hive supports MapReduce but Impala does not support MapReduce.
- In Hive, there is no security feature but Impala supports Kerberos Authentication.
- In an upgrade of any project where compatibility and speed both are important Hive is an ideal choice but for a new project, Impala is the ideal choice.
- Hive is Fault tolerant but Impala does not support fault tolerance.
- Hive supports complex type but Impala does not support complex types.
- Hive is batch based Hadoop MapReduce but Impala is MPP database.
- Hive does not support interactive computing but Impala supports interactive computing.
- Hive query has a problem of “cold start” but in Impala daemon process are started at boot time itself.
- Hive resource manager is YARN (Yet Another Resource Negotiator) but in Impala resource manager is native *YARN.
- Hive Distributions are all Hadoop distribution, Hortonworks (Tez, LLAP) but in Impala distribution are Cloudera MapR (*Amazon EMR).
- Hive audience is Data Engineers but in Impala audience are Data Analyst/Data scientists.
- Hive throughput is high but in Impala throughput is low.
Hive vs Impala Comparison Table
|Serial No.||Basis For Comparison||Hive||Impala|
|1.||Developed By||Apache Software
|3.||Language||Written in JAVA||Written in C++|
|4.||Processing speed||Hive is Slow||Impala is Fast|
|6.||Storage support||RC file, ORC||Hadoop, Apache HBase|
|7.||Code conversion||Generates query expression at compile time||Code generation happens at runtime.|
|8.||Supports parallel processing||No||Yes|
|10.||Hadoop Security||No||Supports Kerberos Authentication.|
|11.||Usage||Ideal for project upgradation||Ideal for starting New Project.|
|Hive is Fault Tolerant.||Does not Supports Fault tolerance.|
|13.||Complex Types||Hive supports complex types.||Impala does not support complex types.|
|14.||Database type||Hive is batch based Hadoop MapReduce.||It is MPP database|
|15.||Interactive Computing||Does not support Interactive computing.||Supports Interactive Computing.|
|16.||Execution||Hive query has a problem with “Cold Start”||Impala process always starts at the Boot time of Daemons.|
|17.||Resource Management||YARN||Native *YARN|
|18.||Distributions||HIVE – all Hadoop Distributions,
Hortonworks (Tez, LLAP)
|19.||Audience||Data Engineers||Data Analyst/Data Scientists|
|20.||Throughput||High Throughput||Low Throughput|
Conclusion – Hive vs Impala
In this article, we have tried showcase that what are two technologies namely Hive and Impala are and also what the basic difference between these technologies. In practical terms, we can say that Hive and Impala are not the competitors they both belong to the same foundation which is known as MapReduce for executing the queries, the usage of both may create the difference. According to our need we can use it together or the best according to the compatibility, need, and performance. Hive query language is Hive QL which is very versatile and universal language while Impala is memory intensive and does not works well for processing heavy data operations example join queries. If in your project work is related with batch processing for a large amount of data, the Hive will better in that case and if your work is related with the real-time process of an ad-hoc query on data then Impala will be better in that case.
This has been a guide to Hive Vs Impala, their Meaning, Head to Head Comparison, Key Differences, Comparision Table, and Conclusion. You may also look at the following articles to learn more –