Literature Survey About The Apache Cassandra Database Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This paper is intended to provide a detailed description of Apache Cassandra, an open source distributed database management system, including the services that the system provides, the architecture and the problems with the system. We also give an overview of how Cassandra compares with HBase and Riak and how Cassandra supports data storage and retrieval in Facebook.


Cassandra is a distributed storage system which is suitable for systems that need particularly large and scalable storage. It is an open-source system that has been described as a Bigtable model running on an Amazon Dynamo-like infrastructure[8]. It is capable of managing structured data that can scale to a very large extent across many commodity servers, with no single point of failure. Cassandra runs on top of an infrastructure of hundreds of nodes that may be spread across many datacenters. It aims to provide reliable and scalable storage. When there is large scale data that is growing continuously in size, small and large components fail continuously. Cassandra manages to provide reliability, high performance, high availability and applicability in such situations. Cassandra shares many design strategies with databases. But instead of supporting a full relational data model; it provides a simple data model that allows us to dynamically control data layout and format.


Each table in Cassandra is a map consisting of rows of keys and their respective values. Every row has a unique key which is typically 16-36 bytes long. The value is a column family which may consist of one or more columns. These columns can be defined by the user. The number of column families can vary from user to user but they must be fixed when the cluster is started. Each column consists of a name, a value and a user-defined timestamp. The number of columns that can be contained in a column family is very large. Each column family can contain one of two structures: supercolumns or columns. Supercolumns have a name, and an infinite number of columns associated with them. Just like columns, the number of supercolumns associated with any column family could be infinite and of a variable number per key. The data model is represented in figures 1 and 2.

Fig 1: Cassandra Column Family[10]

Fig 2: Cassandra Super Column Family [10]

Consistent Hashing and Data Replication

Cassandra partitions data across the cluster using consistent hashing but uses an order preserving hash function to do so. Consistent hashing technique provides a hash table functionality wherein the addition or removal of one slot does not significantly change the mapping of keys to slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped. By using consistent hashing, only K / n keys need to be remapped on average, where K is the number of keys, and n is the number of slots[11]. In consistent hashing, the nodes in the system are aligned to form a circular space or a ring. Each node is assigned a random value in this space which is the output of a hash function and usually of the size of 128 bits. This value represents the position of the node in the ring. Each data item is identified by a key and the data item is assigned to a node in the ring. The data item's key is hashed, the nodes in the ring are traversed in a clockwise direction, and the item is assigned to the first node whose position is larger than the key of the data item. This node is deemed the coordinator for this key. This key is provided to the application and is then used by Cassandra to route requests. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principal advantage of consistent hashing is that when a node leaves the ring only its predecessor and successor are affected and not the other nodes in the ring. Also, assuming that hashing is uniformly random, the load between the nodes is well balanced. It is also easy to trace the data items in the ring and access them. In order to ensure availability, each data item is replicated at N hosts, where N is the replication factor configured per-instance. Each key is assigned to a coordinator node and the coordinator is given charge of the replication of the data items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 nodes in the ring. The replication takes place as follows: - After the value is written to the coordinator node, a duplicate value is written to another node in the same cluster. Duplicates are then written to at least two other clusters. Conflicts if any, are resolved by examining the timestamp.

Gossip style membership algorithm

Cluster membership in Cassandra is based on Scuttlebutt[19], an efficient anti-entropy Gossip based mechanism. Scuttlebutt offers a very

efficient CPU utilization and very efficient utilization of the gossip channel. Within the Cassandra system Gossip is not only used for membership but also to disseminate other system-

related control state. When a node needs to enter a system, it is first assigned a hash value within the key space. Then a search is conducted in clockwise fashion to find the nearest nodes to the new node. The predecessor and successor nodes are located and the node inserts itself between the two nodes on the ring. The Cassandra system elects a leader amongst its nodes using a system called Zookeeper[13]. All nodes on joining the cluster contact the leader who tells them for what ranges they are responsible for. The metadata about the different nodes in the ring and the regions a node is responsible is cached locally at each node inside Zookeeper. This helps when a node crashes or leaves the ring suddenly and comes back up after a period of time. The cache can be used by the node to get information about its position and the ring and what ranges it was responsible for.


The major problem with using the consistent hashing scheme is that there is no real-world auditability. If a datacenter fails, it is impossible to tell when the required number of replicas will be up-to-date. This according to[13], "can be extremely painful in a live situation when one of the data centers goes down and you want to know exactly when to expect data consistency so that recovery operations can go ahead smoothly". Also, Cassandra relies on high-speed fiber between data centers. When a data center goes out and there is a need to revert to a secondary data center which is 20 miles away, the high latency will lead to write timeouts and highly inconsistent data. Also, there will be a cluster downtime associated with even minor schema changes.


Deinspanjer[12] provides a good comparison between Cassandra, Hbase and Riak. In HBase, the data is split into regions.  When a region file is split, HBase will determine which machines should be the owners of the newly split files.  Eventually, the new node will store a reasonable portion of the new and newly split data. In Riak, the data is divided into partitions that are distributed among the nodes.  When a node is added, the distribution of partition ownership is changed and both old and new data will immediately begin migrating over to the new data. In Cassandra, nodes claim ranges of data.  By default, when a new machine is added, it will receive half of the largest range of data. In terms of cost, Cassandra and Riak are much lighter on memory requirement compared to Hbase and hence much cheaper. To implement a firewall that simply focuses on payload inspection, a separate layer on the front end will have to be built in Cassandra and Hbase, This adds to the requirements and implementation time of the custom front-end layer. In Riak, this is much easier to implement within the Riak business logic itself. In HBase, Schema changes involving adding or altering column families require disabling the table, in Cassandra it requires a rolling restart of the nodes, whereas in Riak, new buckets and schema changes are completely dynamic. In terms of reliability, Cassandra and Riak have no single point of failure and most configuration changes can be handled using rolling restarts, whereas in HBase upgrades require restart of the entire cluster.


Many organizations are moving from relational databases to Cassandra due to the increasing difficulty of building a high performance, write intensive, application on a data set that is may grow by leaps and bounds. This growth has forced organizations to think about horizontal and vertical partitioning strategies that will be more suitable to providing this high scalability and reliability.


Digg[15] is a place for people to discover and share content from anywhere on the web. Digg builds lists of popular stories being shared across the web, each of which has a digg button. Along with that button is a number of people from the Digg community who have said they like or digg that story.

In an attempt to scaling their database infrastructure, Digg chose Cassandra to since it provides a highly available peer to peer cluster with more than simple key to column mapping. They use Cassandra to deploy their green badges functionality. Badges appear on the Digg icon for a story when a friend has dug it. They created a set of buckets per (user, item) pair with a list of friends who dugg the story.


Twitter migrated partly to Cassandra to support the growth rate of data. According to [16], Twitter uses Cassandra to store and query their database of places of interest. Also, the research team uses it to store the results of data mining done over their entire user base. Also, Twitter's analytics, operations and infrastructure teams are working on a system that uses Cassandra for large-scale real time analytics for use both internally and externally.


A social networking site like Facebook has lots of relationships between users and their data. In such services, data often needs to be denormalized to prevent having to do lots of joins when performing queries. However, denormalization causes increased write traffic. Cassandra has several optimizations to make writes cheaper. When a write operation occurs, it does not immediately write to the disk. Instead the record is updated in memory, the write operation is added to the commit log and the write is also updated on one machine. Periodically the list of pending writes is processed and all the write operations are flushed to disk on the other replicas. Additionally, the writes are sorted so that the disk is written to sequentially thus significantly improving seek time on the hard drive and reducing the impact of random writes to the system. Cassandra is also used to facilitate ease of search within a Facebook inbox. There might be a need to search in two ways - by term and by name of persons. For term search, the key is the user id and the super columns are the words that make up the message. Message identifiers that are related to that word make up the columns in the super column. For person-based search, the user id forms the key for each row of data, the recipient ids are the super columns and the message identifiers form the columns within the respective super column. According to [7], the term search ranges from 7.78 to 44.41 milliseconds, whereas the person-based search ranges from 7.69 and 26.13 milliseconds.