Difference Between HBase vs Cassandra
HBase is a database that uses Hadoop distributed file system for its storage. HBase is an important part of HDFS and runs on top of the Hadoop Cluster. HBase is not a traditional relational database, it requires different data modeling approach. Cassandra works on the data replication model so in case of the unavailability of any node there will be no loss of data. Cassandra is a distributed database means data can be accessed by a client from any cluster and from any node
It was started by Facebook for it’s always on the application requirement. Cassandra was started in 2005 and made available to the public in 2008. Cassandra was developed for always-on applications such as social networks like Facebook & Twitter.
Cassandra works on “always-on” architecture and having an Active-Active node model so there is no SPoF (Single point of failure). CQL (Cassandra Query Language) is Cassandra’s query language but having syntax same as SQL. It supports all major OS like Linux, Unix, OSX, and windows.
Cassandra is a database with a distribution model and all the nodes are the same within the cluster. Data is replicated on configurable nodes so in case of failure of some no. of nodes will not result in the loss of the data.
(Always on Model)
In Figure 1, All the four nodes are in sync with each other & replicating the data within the cluster. All are working on Active-Active Model so in case of any node failure will not result in loss of data. A Client can read the data from the rest of the available Node/Nodes.
HBase is a NoSQL based Database and designed for processing queries in large tables having billions of rows with millions of columns and run across a cluster of commodity/normal hardware. It provides you real-time query capabilities with the speed of a “key/value store“.
HBase actually based/works on a four-dimensional data model.
- Row ID/Row Key
- Column Family.
- Key-value pairs.
(Figure 2, Example schema of the table in HBase.)
In Figure 2, Table is the collection of Column Family & Column Family is the collection of Columns. Columns are the collection of Key-value pairs
(Figure 3, Sample Table in HBase)
In Figure 3, Column families are the collection of Alumni student’s data and Row IDs (Row Keys) are containing the Student’s Roll No.
In Fact, Row Keys hold the unique value against the Column Family data. By using the Row Key, one can extract the entire details, reasons why Column-oriented databases are much faster than traditional databases.
Apache HBase can be used for random read/write access and it provides failure support. It also supports replication & work on the distribution database model.
Head to Head Comparison of HBase and Cassandra (Infographics)
Below is the top 9 difference between HBase and Cassandra:
Key Differences Between HBase and Cassandra
Below are the lists of points, describe the key differences between HBase and Cassandra:
1) For internal node communication, Cassandra uses GOSSIP Protocol while HBase is based on Zookeeper. Services of GOSSIP Protocol are integrated with Cassandra other side Zookeeper is an entirely separate distribution application.
2) In Cassandra architecture, All the nodes work as Active Node while HBase architect follows Master-Slave Node model. In Active-Active Node model, there is No SPoF (Single Point of Failure). In HBase, If Master node goes down entire cluster will not be accessible.
3) HBase support Binary tree searching model while Cassandra doesn’t support B-Tree model Without B-Tree, you can’t search User’s Column Family for everyone with an Anniversary in April while you can search for everyone who lives in Beijing with an Anniversary in April.
5) HBase is having one feature called as coprocessors while Cassandra doesn’t have such feature as of now. Coprocessors provide a library and run-time environment for executing user code within the HBase region server and master processes.
8) Managing Cassandra is much easier than HBase. In Cassandra, A single Java Process needs to be run per node while for HBase, fully operational HDFS, Several HBase processes, and a Zookeeper system is required.
9) HBase does end to end checksums and automatic rebalancing while Cassandra doesn’t support the rebalancing of the cluster overall.
10) Based on “CAP Theorem”, Cassandra works on AP Model while HBase is CP Model.
This theorem is used for distributed systems. C stands for Consistency, A means Availability & P is Partition Tolerance. CAP theorem explained below:
C (Consistency): Consistency means that if someone has written a value to a database, others can immediately read the same value.
A (Availability): Availability means if some nodes are not available in your cluster (Nodes Went down/not live in the cluster because of some issue) will not impact the whole cluster and Distributed system/Database will be available to access the data. The Cluster will be accessible for all kind tasks.
P (Partition Tolerance): Partition Tolerance means if One Data Center goes down still that should not affect the data presents on the nodes and all the data should be accessible at any time. Means, Partition tolerance allows better replication of data to other Data Center as well within the cluster environment.
HBase and Cassandra Comparison Table
Following is the comparison table between HBase and Cassandra.
|CAP Theorem||Consistency & Availability||Availability and Partition Tolerance|
|Rebalancing||HBase provides Automatic rebalancing within a cluster.||Cassandra also provides rebalancing but not for overall cluster|
|Architecture Model||It is based on Master-Slave Architecture Model||Cassandra is based on Active-Active Node Modal|
|Base of Database||It is based on Google BigTable||Cassandra is based on Amazon DynamoDB|
|SPoF (Single Point of Failure)||If Master Node is not available the entire cluster will not be accessible||All nodes having the same role within-cluster so no SPoF|
|DR (Disaster Recovery)||DR is possible if Two Master Nodes are configured.||Yes, as all nodes having the same role|
|HDFS Compatibility||Yes, As HBase stores all meta-data in HDFS||No|
|Consistency||Strong||Not Strong as HBase|
Facebook & another social networking side would prefer HBase (earlier both were using Cassandra, refer Facebook post) because of its availability other side banking domain sector looks for security for its every financial transaction so they would select Cassandra over HBase.
Cassandra Key characteristics involve High Availability, Minimal administration and No SPoF (Single Point of Failure) other side HBase is good for faster reading and writing the data with linear scalability.
Companies like Verizon, Bloomberg, Bank of America and much more are using HBase vs Cassandra is being used by major social networking sites such as Twitter, Facebook etc…
We can’t conclude which one is best, HBase vs Cassandra both are having their own advantage and disadvantages. Actual performance of both HBase vs Cassandra Databases can be seen in the production environment.
This has been a guide to HBase vs Cassandra. Here we have discussed HBase vs Cassandra head to head comparison, key difference along with infographics and comparison table. You may also look at the following articles to learn more –
- Hadoop vs Apache Spark – Interesting Things you need to know
- How to crack the Hadoop developer interview?
- Top 5 Big Data Trends
- 5 Challenges of Big Data Analytics