The Cassandra Data Model Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

If you are a Facebook user then you must have seen the Inbox search feature. This is where Cassandra started. Cassandra is a distributed storage system which is highly scalable, fault tolerant,

decentralized, elastic, durable and proven. Why proven? It is being currently used in many solution some of them being Facebook, Digg, Rackspace, Raddit and many more to store data. It basically targets solutions that need to scale incrementally while remaining cost effective. This means solution for loads of data where there are countless incoming request resulting in random reads and writes. To achieve this Cassandra weakens the ACID properties while embracing the CAP theorem. Cassandra is a blend of Amazon's Dynamo and Google's Big Table.

The data model in Cassandra is a distributed multi-dimensional map indexed by a key. The value is a highly structured object. The row key in a table is a string with typically 16 to 36 bytes long. Every operation for a single row key is atomic for every replica regardless of how many columns are being read or written into.

Columns are grouped together into sets of columns called column families. In Cassandra there are two kinds of columns families, Simple and Super column families. Super column families can be visualized as a column family within a column family. Applications can specify the sort order of columns within a Super Column or Simple Column family. The system allows columns to be sorted either by time or by name. Any column within a column family is accessed using the convention column family: column and any column within a column family that is of type super is accessed using the convention column family: super column: column.

Mapping Cassandra Data Model to Java Class


A column in Cassandra is a node\set of name, value and a timestamp and is referred to as a tuple or triplet. This is the smallest structure or container of data that exists in Cassandra.

If the structure is represented in Java it would be as follows:

public class AColumn {

  Byte[] c_name;

  Byte[] c_value;

  Long c_timestamp;



A Super column is Cassandra is a container of the Columns. It is a tuple with a name and value where the value is a Map containing Columns entries. In this Map, the key has the value of the Column name and the value is the column entry itself. The super column does not have a timestamp like the Column entry.

SuperColumn in Java would be something as follows

public class ASuperColumn {

  Byte[] name;

  Map<Byte[] c_name, Column> value;


Example for the same would be:

ASuperColumn sc = new SuperColumn(); = "person1";sc.put("firstname", new Column("firstname", "Sagar");

sc.put("familyname", new Column("familyname", "Bhosale");


The Column Family is analogous to the table in RDBMS. The Column family is a container for the Super Columns. The Column family maintains a Map where the keys represent the row identifiers and values are Super Columns i.e. Maps containing Columns.

In Java a Column family would be represented as

public class AColumnFamily {

Byte[] name;

Map<Byte[] row_identifier, Map<Byte[] c_name, Column>> value;


Suppose we have an address book, the name of the ColumnFamily is Addressbook.

AColumnFamily cf = new ColumnFamily(); = "AddressBook";

Map<Byte[], Column> super_c = new HashMap<Byte[], Column>();

super_c.put("firstname", new Column("firstname", "Sagar");

super_c.put("familyname", new Column("familyname", "Bhosale");

super_c.put("city", new Column("city", "SJ");

cf.put("person1", row);

Map<Byte[], Column> super_c2 = new HashMap<Byte[], Column>();

super_c2.put("firstname", new Column("firstname", "Meetu");

super_c2.put("familyname", new Column("familyname", "Kour");

super_c2.put("city", new Column("city", "Pune");

cf.put("person2", row);


Keyspace is analogous to Schema in RDBMS. Normally there is one keyspace per application. A keyspace contains the ColumnFamilies. There may not be any relation between the column families within a keyspace.

So if we view the data model as a class diagram or a hierarchical diagram it would look as follows:

Architecture Overview

CAP theorem

C: Consistency

A: Availability

P: Partition

Ideally any system will thrive to have consistent data that is always available on a network irrespective of its partitions.

CAP theorem states that to achieve acceptable latency keeping in mind the requirements of the solution the designers\architects of the solutions should pick two of Consistency, Availability and Partition tolerance.

Cassandra gives Availability and Partitioning priority keeping Consistency and latency tunable. Consistency is achieved in Cassandra with increase in latency. Cassandra does not believe in ROW LOCKING.

Cassandra is adopted from the fine combination of Amazon Dynamo and Google Big Table.

Partitioning and replication in Cassandra is inspired from Amazon's Dynamo

Log-structured ColumnFamily data model is inspired from Google's Bigtable.

Cassandra highlights

High availability

Incremental scalability

Eventually consistent

Tunable tradeoffs between consistency and latency

Minimal administration

No SPF (Single Point of Failure)

Architecture employs an architecture with O(1) node lookup through explicit replication and eventual consistency.

Architecture layers:

The Architecture of Cassandra is separated into three layers namely Top Layer, Middle layer and Core Layer.

Writes in Cassandra

Write path

In Cassandra the write operation starts with writing the log entry to the Commit Log. The Commit Log is a kind of Crash Recovery Log. The write won't return unless the log entry is at least written to the commit log. When the log entry data is written to the Commit Log it is durable as is it is on the disc. The Commit Log manages the Commit Log Segments, each of which corresponds to a file on disk containing a fixed-size Commit Log Header followed by serialized Row Mutation objects. This write operation on the commit log is sequential.

After the log entry is made in the Commit Log the write is sent to the appropriate nodes. When each node receives the write it first records it in a local log, then makes update to appropriate Memtables. A Memtable is Cassandra is in-memory data structure or representation of key/value pairs before the data gets flushed to disk as an SSTable. There is one memtable for each column family.

Now when the Memtables reach a threshold the content of the memtable is flushed to the disk i.e. to the SSTables. The scenarios under which memtables are flushed are:

Memtables go out of space

There are too many keys (128 is default)

The time duration (client provided - no cluster clock)

Once the data is flushed to the SSTable a new memtable is started. When memtables are written out, two files are outputted:

Data File (SSTable). A SSTable stands for Sorted Strings Table and is a file of key/value string pairs, sorted by keys.

Index File (SSTable Index).

(Key, offset) pairs (points into data file)

Bloom filter (all keys in data file). A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free.

The commit log is deleted when all the column families are written to the disk.

Compaction: When data files are accumulated over time they are merged and sorted into new compact files with new indexes. This happens periodically.

Compaction of SSTables

Read in Cassandra

Read path

Read request on any node

Partitioner delegates the request to respective coordinator

Perform read repair if the replicas are in conflict

Cassandra read properties

Read multiple SSTables

Slower than writes (but still fast)

Seeks can be mitigated with more RAM

Scales to billions of rows


To achieve high scalability and scale incrementally data from the nodes in the cluster are partitioned in such a way that entry of any node or exit of any node in the cluster does not have an effect on the entire cluster. Cassandra achieves this through Consistent Hashing. The Hash function in consistent hashing returns values that fall on a circular space or ring where the largest hash value wraps around the smallest hash value. Now the nodes in the cluster are assigned random values within the key space (hash range generated from the hash function). This value represents the position of the node on the ring.

When a data item is put in the cluster, its position on the ring is determined by the hash function by hashing the key of the data item, and it is assigned a node. Now which node is it assigned to? If we walk the ring from the data items newly assigned position in a clock wise direction we will reach an node. Now this node is the node that is assigned to the data item. Thus each node is responsible for the region between it and its predecessor node on the ring. The main advantage of consistent hashing is that only immediate nodes affected by arrival or departure of a node

Partitioning by Consistent Hashing


To achieve high availability and durability, Cassandra replicates its data on multiple hosts. Each data item is replicated on N nodes where N is the number of nodes on which the data item should be replicated on. The co-ordinator is the node responsible for replicating the data item if it falls in its range on the ring. It replicates the data item in N-1 nodes ahead of it in clock wise fashion (Its successors).

The preference list is the list of nodes that store a particular key. These are the N adjacent nodes. This list is maintained so that each node can determine which node should be in the list for any particular key.


Failure Handling

There can be 'x' number of nodes in the cluster. In Cassandra data is replicated over 'N' number of node where N is the Replication factor.

Consistency Level

Consistency level determines how many replicas must respond in order to declare success. If W=2 then a success is returned when data is written to 2 nodes. If R=2 then success is returned if 2 nodes having the replica have the exact data.

Cassandra consistency levels

Consistency level in Cassandra is based on Replication factor

Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor).

How data is kept consistent over the nodes if it is replicated?

All replicas are queried when the read operation is performed on a node. Data is read from one replica and the checksum/timestamp from the other nodes is compared. If there is a mismatch data from the node having the latest time stamp is returned and the nodes with stale data are updated with the latest value keeping all the nodes in sync.

Weak Consistency

In weak consistency reads perform repairs after returning the results

Strong Consistency

In strong consistency reads perform repairs before returning the results

Hinted hand off

Read Repair happens only at the end of the get operation to retrieve the latest copy of data item by comparing the timestamps of the replicas. Now the replicas can be spread across different data center and reconciling them at the end of the get/read operation gets expensive. Hinted Handoff provides additional consistency by making the replicas of data item consistent before read repair takes place i.e. before the read operation.

Cassandra uses hinted handoff as a way to

1. Reduce the time required for a temporarily failed node to become consistent again with live ones

2. Provide extreme write availability when consistency is not required

Consider the example of Cassandra with two nodes X, Y and N=1. In this example, if node X is temporarily down or unreachable during a write operation and has consistency level ANY then the write operation is performed on the Y node, but this is not the normal write, a replica of the data item is sent to node Y. This guarantees availability and durability. The replica sent to Y will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case X). This hinted replica is stored locally by the Y node and is scanned periodically. The periodic scan detects if the hinted data item's original node is up. If it is up the hinted replica is written back to the original node where it is supposed to be and deleted from the local store.

Hinted handoff ensures that the read and write operations will be successful and not failed due to temporary node or network failures.

Membership and failure detection

In distribute environment node outages due to failures and maintenance tasks are often transient but may last for extended intervals. A node outage does not mean that the node is permanently unavailable and therefore should not result in rebalancing of partition assignment. Similarly an manual error may result in addition of a node to the cluster.

For these reasons, it was an explicit mechanism for addition and removal of nodes from a Cassandra ring is initiated. In Cassandra decentralization and partition tolerance, is supported by a gossip protocol. An administrator uses a command line tool or a browser to connect to a Cassandra node and sends a membership change to join a node to a ring or remove a node from a ring. The node that gets the request logs the membership change and the timestamp to a persistent store. Over a time these changes form a history as nodes are added and removed from the cluster. The gossip protocol propagates these changes like gossip and maintains an eventual consistent view of membership. Each node on the ring contacts any peer on random basis every second and the two node reconcile the membership information they have.

Partitioning and placement information also propagates by the gossip-based protocol and each storage node is aware of the token ranges handled by its peers. This allows each node to forward a key's read/write operations to the right set of nodes directly.

Gossip Protocol

Thrift API

This is the API that is used by Client Library Developers. It is a low level API for accessing Cassandra.


Keyspace: Consists of multiple Column Families.

CF: Represents Column Family.

SCF: Represents Column Family of whose type is "Super".

Key: Key is a String that uniquely identifies particular row in column Family. Thrifts java code assumes that strings are in UTF8 format.

Column: A triplet (tuple) of name, value, and timestamp; names are unique within rows.

Method Calls

login: Checks for appropriate rights to perform operations on intended key space

get: Returns the Column or Super Column at the given column_path.

get_slice: Returns the group of columns contained by column_parent 

multiget_slice: Retrieves slices for column_parent and predicate on each of the given keys in parallel

get_count: Returns the total number of columns present in column parent.

get_range_slices: Returns a list of slices for the keys within the specified KeyRange. This applies the given predicate to all keys in the range, not just those with undeleted matching data.

Insert: Insert a triplet consisting of (column_path.column, value, timestamp) at the given column_path.column_family.

batch_mutate: Executes the specified mutations on the keyspace. 

remove: Given a key removes data from the row specified by key.

truncate: Remove all rows from given column family.

describe_keyspaces: Retrives a list containing all key spaces configured for the particular cluster.

describe_cluster_name: Returns the cluster name.

describe_version: Gives back Thrift API version.

Cassandra Hector API for Java

This is a high level API. Application developers this API in case of java.

Hector is a Java Based API for Cassandra. The Java API that comes with the distribution is generated from Thrift and has fallbacks of enterprise features such as connection pooling, monitoring etc. Hector fills these gaps providing a enterprise level java API for Cassandra.

To a new user, this may seem like an oversight, but it is in fact a design decision made to easily support numerous programming languages with a common wire protocol. Specifically, Cassandra's client API is currently based on Thrift - a service framework with code generation capabilities for numerous languages.


Pools of pools per server

Failover Detection

Load Balancing

Pooling Configuration


To sum up Cassandra is a distributed storage system that is highly scalable, available with no single point failure. It embraces fault tolerance techniques to keep data eventually consistent.