Diving into Cassandra - Lamborghini of NoSQL

Shubham Kumar
5 min readJan 15, 2023

--

Apache Cassandra® is a distributed NoSQL database used by the vast majority of Fortune 100 companies like Apple, Facebook, and Netflix. These companies process large volumes of fast-moving data in a reliable, scalable way and that is made possible using Cassandra.

Why do we need NoSQL??

RDMS(Relational database management systems) has been dominating the market for ages. The rapid data demand skyrocketed nearly 15 times thus causing performance issues. Thus this gave to the birth of NoSQL.

Growing Data Demand

NoSQL not only solved the issues of handling massive data but also it also provides support for more speed requirements and also rapidly changing data type needs. Cassandra was originally developed at Facebook after that it was open-sourced in 2008 and after that, it become one of the top-level Apache projects in 2010.

Why Cassandra? What’s so special??

Cassandra is an infinitely scalable database that gives excellent performance at scale which makes it LAMBORGHINI of NoSQL. Cassandra is a peer-to-peer system and has no leader node database.

Features which make Cassandra so Powerful:

  1. Infinite Scaling: Distributed partitioning of data makes the database capable of scaling to handle data of any size, practically of petabyte size. We can further scale using more nodes.
  2. Excellent performance: A single node is very performant, but a cluster with multiple nodes and data centers brings throughput to the next level. Decentralization (leaderless architecture) means that every node can deal with any request, read or write.
  3. Infinite scalability: There are no limitations on volume or velocity and no overhead on new nodes. Cassandra scales with your needs.
  4. Highest availability: Theoretically, you can achieve 100% uptime thanks to replication, decentralization, and topology-aware placement strategy.
  5. Fault Tolerant And Healing: Operations for a huge cluster can be exhausting. Cassandra clusters alleviate a lot of headaches because they are smart — able to scale, change data replacement, and recover — all automatically.
  6. Global Distribution: Multi-data center deployments grant an exceptional capability for disaster tolerance while keeping your data close to your clients, wherever they are in the world.
  7. Platform Independent: Cassandra is not bound to any platform or service provider, which allows you to build hybrid-cloud and multi-cloud solutions with ease.
  8. Open Source: Cassandra doesn’t belong to any of the commercial vendors but is offered by a non-profit open-source Apache Software Foundation, ensuring both open availability and continued development.

Some of the Major Engineering feats achieved using Cassandra are :

  1. Netflix runs 30 million ops/second on its most active single cluster and 98% of streaming data is stored on Cassandra.
  2. Google Cloud Platform achieved one million writes per second using Cassandra and 330 VM and cherry on top, this took only just 70 minutes to achieve from level zero.
  3. Apple runs 160,000+ Cassandra instances with thousands of clusters.

How Does Cassandra Work??

Traditional Database works on leader-follower architecture whereas Cassandra is a leaderless peer-to-peer architecture making it fault-tolerant to the single point of failure. Cassandra distributes the load to each node in a cluster (rings).

Distributed Data across the Ring

Each Node is the ring that works as a follower and stores a percentage of data. Nodes communicate information about themselves to maintain consistency. This also insures availability 100% availability in case any of the nodes fails.

Data Replication is also one of its features. Replication is determined in Cassandra using the replication factor (RF). RF=3 means data is stored in 3 nodes and it is industry standard.

IS Cassandra AP or CP ??

Before moving to Cassandra let's find out what characteristics Cassandra holds. Every Distributed Database can only guarantee two out of three characteristics in case of failure: Consistency, Availability, and Partition Tolerance. You can read out them here.

Being a distributed database Cassandra should continue to work in case of system failures or data losses. Thus achieving a partial tolerance database can be either Highly Available or Consistent. Cassandra prioritizes availability over consistency which makes it an AP system. But we can fine-tune between these characteristics.

Cassandra Core Working

In Cassandra each node can read and write thus application can contact just any server and process data. Cassandra uses key-based partitioning.

The main components of the Database are:

  • Keyspace: A container of data, similar to a schema, which contains several tables.
  • Table: A set of columns, primary keys, and rows storing data in partitions.
  • Partition: A group of rows together with the same partition token (a base unit of access in Cassandra).
  • Row: A single, structured data item in a table.
Data Structure of Cassandra

Cassandra organizes data into partitions, each of which represents a set of rows in a table across a cluster. Each row includes a partition key, which is one or more columns that are hashed to define how data is divided throughout the cluster’s nodes.

After you choose a partition key for your table, a partitioner converts the partition key value to tokens (also known as hashing) and allocates a token range to each node.

Then, Cassandra automatically distributes each row of data by the token value across the cluster. Simply add a new node whenever you need to scale up, and your data will be redistributed in accordance with the new token range allocations. On the other hand, you can easily scale back.

Data architects need to know how to create a partition that returns queries accurately and quickly before they create a data model. Once you’ve set a primary key for your table, it cannot be changed. For the new partition key, you need to create a new table and migrate all data.

Conclusion

I hope you loved reading this article. Here are some of the interesting articles you can check out:

  1. How to build your own CDN
  2. AngularJS Vs ReactJS Vs EmberJS

Follow, clap, and share to get the latest article updates.

You can also connect with me on Linkedin or Instagram.

--

--

Shubham Kumar
Shubham Kumar

Written by Shubham Kumar

SDE-2 Expedia, ex-Airtel, Nagarro | OSSF London Scholar 2021 | HGF Scholar 2020 | Facebook F8 Scholar 2019 | Fossasia Finalist Winner 2019

No responses yet