What is NoSQL
Before going into Cassandra first let’s understand what is NoSQL. A NoSQL database is used to store and retrieve data which is in various forms. NoSQL databases are schema free with a simple design, horizontal scaling, greater consistency and can handle very huge amount of data.
What is Cassandra Developer
Apache Cassandra is a free open source distributed database system which is used to manage very large amount of structured, semi structured and unstructured data. It is highly scalable and it can be used as real time operational data store as well as a read intensive database for large scale businesses. Cassandra is designed in such a way that there is no single point of failure because of its peer to peer symmetric nodes. In a short period of time Cassandra has become so popular because of its outstanding technical features and consistent performance. It can store hundreds of terabytes of data and offers a schema free data model.
Features of Cassandra
Below listed are some of the key features of Cassandra
- Fault tolerant
- Architecture that has no single point of failure
- Linear scale performance
- Flexible data storage
- Easy data distribution
- Transaction support
- Fast writes
After successful completion of this course you will be able to
- Understand the basics of Cassandra
- Install and configure Cassandra
- Understand the architecture of Cassandra
- Learn about various Cassandra monitoring and administration techniques
- Have an understanding of the nodes in the cluster and how it works
- Understand the various modelling techniques and the key components of reading and writing data
Pre requisites for taking this course
This tutorial does not need much skills. You just need to have a basic knowledge in Java Programming. It is an added advantage if you have a prior experience in database concepts or SQL syntax or Linux commands.
Target Audience for this course
This tutorial will be very much useful for software professionals and anyone who has a passion towards Cassandra can also take up this course.
Section 1: Cassandra Developer
Agenda – Cassandra
This chapter gives a overall introduction to Cassandra. It includes the difference between relational databases and NoSQL databases, history of Cassandra, the features, Application of Cassandra in various fields and its uses.
The architecture of Cassandra is designed in such a way that it can handle big data workloads across multiple nodes. Cassandra has a peer to peer distribution system which is much more elegant and easy to set up and maintain. There is no concept of master node in Cassandra. Cassandra’s architecture is capable of handling perabytes of information. The components of Cassandra are explained in this chapter
- Data centre
- Commit Log
- Bloom Filter
Cassandra architecture also provides automatic distribution of data across all the nodes in a ring. Cassandra also provides built-in and customizable replication which can store redundant copies of data in all the nodes. Replication in Cassandra is easy to configure. This chapter deals with distributing and replicating data in Cassandra
The other topics included in this chapter are
- Multi Datacenter and Cloud Support
- Reading and Writing Data
- Data Consistency
Introduction to Data Model
The data model of Cassandra is very different from other data models of RDBMS. The Cassandra data model is designed to distribute data on a very large scale. In this chapter we will take a bottom approach to understand Cassandra’s data model. In Cassandra keyspace is the holder of your data and the keyspace contains one or more family objects. The topics covered under this chapter are
- Keyspace – Replication factor, Replica placement strategy, Column families
- Column Family – keys_cached, rows_cached, preload_row_cached
- Super Column
- Difference between data models of Cassandra and RDBMS
Data Model Queries
Planning a data model in Cassandra has different design considerations. The data model designing depends upon the data which you want to access and capture. The best way to design data modelling in Cassandra is to start with queries. This includes thinking about what actions needs to be taken, how the data can be accessed and then finally design column families.
In Depth CQL
This chapter gives an introduction to Cassandra Query Language and explains how to use the commands in CQL. CQLSH can be used to define a schema, insert data and execute a query. Cassandra query language can be started using the command cqlsh.
- Start cqlsh options – cqlsh –help, cqlsh –version, cqlsh –color, cqlsh –debug, cqlsh –execute cql_statement, cqlsh –file= “file name”
- Documented shell commands – HELP, CAPTURE, CONSISTENCY, COPY, DESCRIBE, EXPAND, EXIT, PAGING, SHOW, SOURCE, TRACING
- CQL Data Definition Commands – CREATE KEYSPACE, USE, ALTER KEYSPACE, DROP KEYSPACE, CREATE TABLE, ALTER TABLE, DROP TABLE, TRUNCATE, CREATE INDEX, DROP INDEX
- CQL Data Manipulation Commands – INSERT, UPDATE, DELETE, BATCH
- CQL Clauses – SELECT, WHERE, ORDERBY
Data modelling is one of the important step to ensure about the performance of the Cassandra applications. Data modelling is the process of identifying the pattern of data access and the queries that has to be performed. The other topics included in this section are
- Rules of Cassandra Data Modelling
- Data Modelling concepts, principles and methodology
- Time series data modelling
- Examples of data modelling
Under relational database system to join many tables into one table, you need to create more complex SQL statements and as a result the process becomes slow. Whereas in Cassandra complex queries and multiple joins becomes very easy with simple statements and it is also done fast. This section deals with how complex queries are handled in Cassandra.
Whiteboard is where the exact data model fits the database’s constraints without wanting to make any translation steps to the initial mapping out of the data model.
Section 2: Cassandra Administration
Introduction to Architecture
The Cassandra Architecture is very sophisticated and it depends on the use of several theoretical constructs. The topics included in this chapter are discussed in brief here
- System Keyspace – Cassandra has an internal keyspace called system which is used to store the metadata. The metadata of Cassandra includes the node’s token, cluster name, keyspace and schema definitions and migration data
- Peer to Peer – Cassandra has a peer to peer distribution model and so there is no master node. This peer to peer model improves the database availability. This design also makes it easy to add new nodes
- Gossip and Failure detection – A gossip protocol is used in Cassandra to support decentralization and partition tolerance. In this section you will learn about the working process of gossip.
- Anti Entropy and Read Repair – Anti Entropy is the replica synchronization mechanism in Cassandra. This topic also contains details about read repair and how it is performed.
- Memtables, SSTables and Commit Logs – The Commit log is a crash recovery mechanism in Cassandra. From commit log the value is written into a memory resident data structure called Memtables. When the contents in the Memtables reaches its maximum the remaining contents are stored in a file which is called the SSTable. Compaction operation is used to merge several SS Tables.
- Bloom Filters – These are performance boosters in Cassandra.
- Staged Event Driven Architecture (SEDA) – SEDA is a architecture for highly concurrent internet services.
- Managers and Services
When planning a Cassandra Cluster Deployment you should first have an idea of the amount of data you need to store and an estimation of the workload. This deals with selecting the hardware, RAM, CPU, Disk and Network. The other topics included in this section are
- Planning an Amazon EC2 cluster
- Capacity Planning
- Choosing Node Configuration Options
- Snitch Settings
- Choosing Keyspace Replication Options
Replication is the process of storing multiple copies of data in various nodes. The Replica Placement Strategy in creating a keyspace lets you decide how many number of replicas should be created and how it should be distributed. The total number of replicas created are called as the replication factor. The other topics included in this chapter are
- Replica Placement Strategy – Simple strategy, Network Topology Strategy,
- Snitches – Simple Snitch, DSE Simple Snitch, Rank Inferring Snitch, Property file snitch, EC2 snitch, EC2 multi region snitch, Dynamic Snitch
- Client requests
- About Write Requests
- About Read Requests
Sharding is used to scale a relational database. In order to Shard your data, you first need to find a way to order your records. Sharding can also be considered as a kind of shared nothing architecture where there is decentralization and each node in a distributed system is considered independent. Sharding will help you to scale horizontally as well as precisely based on the strategies you select. There are three basic strategies for determining the Shard structure
- Feature Based Shard or Functional Segmentation
- Key Based Sharding
- Lookup Table
Performance Monitoring Strategies
Performance plays an important role in the high sales of Cassandra. One of Cassandra’s hallmark is its performance in read and write operations. In terms of scalability Cassandra outpaces its NoSQL competitors. By monitoring the performance of Cassandra you will be able to identify the weaker sections and resource limitations. There are certain areas where the performance monitoring is a must in Cassandra. They are
- Read and write requests
- Read and write latency
- Disk usage
- Garbage collection frequency and duration
- Errors and Overruns
Performance Monitoring Strategies continued
There are a lot of performance metrics which can be used through a lot of tools. Few of the performance metrics are explained in detail in this chapter
- Throughput – Read throughput, Write Throughput
- Latency – Read Latency, Write Latency, Key Cache Latency
- Disk Usage – Load, Total disk space used, Complete Compaction tasks, Pending Compaction tasks
- Garbage collection – ParNew Count, ParNew Time, Concurrent MarkSweep count, Concurrent MarkSweep time.
- Errors and Overruns – Timeout Exceptions, Unavailable Exceptions, Pending Exceptions, Currently Blocked Tasks
FAQ’s General Questions
- Why should I get certified in Apache Cassandra ?
The demand for Apache Cassandra and NoSQL skills are skyrocketing. This has made the Cassandra developers demand for highest salaries among the other database technology. Getting certified in Cassandra will help to increase your confidence about the knowledge of Cassandra. This certification makes you an expert in Cassandra and you will have a growth in your career. You can add your name to the Certified Cassandra Developers group in various professional social networking sites like LinkedIn.
- What benefits will I get from this course ?
From this course you will learn how to perform the various operations that can be performed in the database like creating, inserting or deleting the data. You will know how to monitor your databases and the working of cluster nodes in Cassandra. After this course you will be able to perform database operations in Cassandra like an expert.
It was a wonderful experience learning from educba. This is a great course with all the basics of Cassandra developer and administration. Each topic in the course is well structured and well explained with examples wherever needed. This course helped me to enhance my knowledge about Cassandra and start my career in Cassandra with great confidence. I am greatly impressed with this course and would definitely recommend this course to others
I took this course two weeks back and it is an amazing course. The course starts with the basic concepts and flows into deeper concepts of Cassandra. This is suited for both beginners as well as professionals. One can start up working with Cassandra like an expert after taking this course. The topics of the course are self explanatory and easy to understand. Overall an excellent course for Cassandra developers.
|Where do our learners come from?|
|Professionals from around the world have benefited from eduCBA’s Comprehensive Cassandra Developer & Administration Training courses. Some of the top places that our learners come from include New York, Dubai, San Francisco, Bay Area, New Jersey, Houston, Seattle, Toronto, London, Berlin, UAE, Chicago, UK, Hong Kong, Singapore, Australia, New Zealand, India, Bangalore, New Delhi, Mumbai, Pune, Kolkata, Hyderabad and Gurgaon among many.|