Introduction to Hbase Interview Questions And Answers
HBase is a popular column-oriented, NoSQL database management system that runs on top of the Hadoop Distributed File System (HDFS). It is well suited for sparse data sets, which are common in many big data use cases.
Part 1 – HBase Interview Questions (Basic)
This first part covers basic HBase Interview Questions And Answers.
1. When should you use HBase?
Hbase is not suitable for all use cases. A best suitable scenario can be identified with the following checks –
i. Data volume: Should have petabytes of data to be processed in a distributed environment.
ii. Application: HBase is not suitable for OLTP(Online Transaction Processing) systems which require complex multi-statement transactions. It also lacks complex SQL support which is required for relational analytics. It is preferred when you have a huge amount of data with a slightly different schema.
iii. Cluster Hardware: HBase runs on top of HDFS. And HDFS works efficiently with a large number of nodes (minimum 5). So HBase can be a good selection only with good hardware support.
iv. Not Traditional RDBMS: Hbase cannot support any use case which requires traditional features like Join multiple tables, Complex SQls with nested or window functions, etc.
v. Quick random access to data: If you need random and real-time access to your data, then HBase is a suitable candidate. It is also a perfect fit for storing large tables with multi-structured data.
2. What is the difference between Cassandra and HBase?
Both HBase and Cassandra have distributed NoSQL database for Big Data from the Hadoop ecosystem. Both built for different use cases.
The HBase has a kind of master-slave architecture with several components like Zookeeper, Namenode, HBase Master(Hmaster), and Data Nodes. Cassandra treats all nodes as masters which means all nodes are equal and perform all functions.
HBase is optimized for reads, write is only happening to the master node and has strong consistency for reading after write. Cassandra has excellent single-row read performance if eventual consistency is selected.
Hbase does not natively support secondary indexes, Cassandra supports secondary indexes on column families where the column name is known.
Initially, Hbase is created in Google, and they named it BigTable. Even now, APIs of Bigtable and HBase is compatible. Origin of Cassandra is from a paper for DynamoDB, which is a NoSQL database from AWS.
Let us move to the next HBase Interview Questions.
3. What are the Major Components of Hbase?
HBase has three important components- HMaster, Region Server, and ZooKeeper.
i. HBase Master: HBase Tables are divided into regions. At the same time, startup Master decides which region to assign to which region server(Region server will be a node in a cluster). It also handles table metadata operations like create, or change the schema. This component also plays an important role in failure recovery
ii. Region Server: As mentioned above, this is where actual data write and read happens. These are actual cluster nodes. This will have regions of many tables which is decided by starting and ending row keys. A typical region server can serve up to a thousand regions
iii. ZooKeeper: ZooKeeper is a cluster coordination framework widely used in the Hadoop ecosystem.Zookeeper tracks all servers (Master and region servers) present in cluster HMaster contacts ZooKeeper and notifications are produced in case of errors.
4. What is HBase Bloom Filter?
This is the common HBase Interview Questions asked in an interview. An HBase Bloom Filter is an efficient mechanism to test whether a store file (When something is written to HBase, it is first written to an in-memory store, once this memstore reaches a certain size, it is flushed to disk into a store file) contains a specific row or row-col cell. Normally, the only way to decide if a row key is present in a store file is to check in file’s block index, which has the start row-key of each block in the store file. Bloom filters act as an in-memory data structure that reduces disk reads to only the files likely to contain that row – Not all store files. So it acts like an in-memory index to indicate a probability of finding a row in a particular store file.
5. What is Compaction? Explain different types of it.
HBase stores all the received operations into its memstore memory area. When the memory buffer is full, it is flushed to disk. Because this can create many small files in HDFS, HBase can select files to be compacted together into a bigger one from time to time. Compaction is called Minor when HBase elects only some of the HFiles to be compacted but not all. In Major compaction, all the files are elected to be compacted together. Major compaction works like a minor one except that the delete markers can be removed after applying to all the related cells, and all extra versions of the same cell will also be dropped.
Part 2 – HBase Interview Questions (Advanced)
Let us now have a look at the advanced HBase Interview Questions.
6. How HBase version data?
When a piece of data is inserted/updated/deleted HBase will create a new version for that column. Actual deletion is happening only during compaction. If a particular cell exceeded the number of versions allowed, extra versions would be dropped during compaction.
7. What is the difference between getting and Scan?
Get will return an only single row from Hbase table based on row key given. Scan command returns set of rows depending upon given search condition. Usually get is faster than scan. So should prefer to use that if possible.
Let us move to the next HBase Interview Questions.
8. What happens when deleting a row?
At the time of deletion, command data is not physically deleted from the file system instead make invisible by setting a marker. Physical deletion happens during compaction.
Column, Version, and Family Delete Markers are three different markers that mark deletion of a Column, Version of Column and Column Family, respectively.
9. Explain the difference between HBase and Hive.
This is the advanced HBase Interview Question asked in an interview. HBase and Hive both are completely different Hadoop based technologies for data processing. Hive is a relational-like SQL compatible distributed storage framework while HBase is a NoSQL key-value store. Hive acts as an abstraction layer on top of Hadoop with SQL support.HBase data access pattern is very limited, with two primary operations-get and scan. HBase is ideal for real-time data processing where Hive is an ideal choice for batch data processing.
10. What are Hlog and HFile?
HLog is the write-ahead log file, also known as WAL and HFile is the real data storage file. Data is first written to the write-ahead log file and also written in MemStore.Once MemStore is full, the contents of the MemStore are flushed to the disk into HFiles.
This has been a guide to List of Hbase Interview Questions and Answers. Here we have covered the few commonly asked interview questions with their detailed answers to help candidates crack interviews with ease. You may also look at the following articles to learn more –