This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
The next level to the keyspace is column family. Many rows belong to a column family and have a same column family name. The column family is similar to the table of a relational database. However, the similarities between Cassandra and relational database stop at this level. The latter defines columns in a table, but for Cassandra there is still another level, row, connecting the column families and columns. As we mentioned before, the number of columns for the rows could be different.
A collection of columns is grouped into a row and assigned with a row name. The columns are stored and sorted according to their names. Therefore, slice queries could be performed in Cassandra . We can easily locate the range where we want to retrieve data for a specific row.
The smallest unit of the Cassandra data model is column. Each column has a name and a value. The name and value can be set up front, or dynamically set by users applications .
Figure 2b show an example of Cassandra keyspace, where the Blog keyspace has column families users, blog entries, subscribes_to, subscribers_of, and time_ordered_blogs_by_user. The column families have their own rows, and the rows have their own columns.
Besides the regular column, there is a special kind of column: super column. A super column is also in the format of name/value pair, but it stores a map of subcolumns; and the subcolumns, same as the regular columns, store byte arrays . Figure 2a shows the difference between the regular column and super column. Super column does not support secondary indexes, so its use is not encouraged by developers .
Design Differences Between RDBMS and Cassandra
In a relational model, you can retrieve data by using the column name of a table. While in Cassandra, to make a column family query, you will need two names: row name and column name.
No Referential Integrity
Unlike the traditional relational databases, Cassandra did not include the referential integrity into its design. Since there are no foreign key constraints, the join operations do not exist in Cassandra. The purpose of such a treatment is to make Cassandraï¿½ï¿½s schema flexible and easy to scale out.
The schema of Cassandra is more flexible compared to the relational database model. A relational model defines its schema up front and the schema can not be changed afterwards. In contrast, Cassandra does not require that and the users can change the schema dynamically.
Different from relational models, Cassandra has an additional storage level, row, between column family (corresponds to table in a relational model) and column. As we discussed before, the columns of a row are sorted and then stored. Hence it makes the slice query within a row very convenient.
Figure 2c: Cassandra Cluster
The main application of Cassandra is processing big data. It does this by using multiple nodes and it also needs to assure no single point of failure . Cassandra adopts the architectural aspects of Amazon Dynamo. The design is based on the acknowledgement that failure is inevitable. Cassandra handle the failure problem by implementing a peer-to-peer system. The system is distributed, and all of the nodes are used equally to store data in the system. There is no master or slave node. Figure 2c shows the architecture of a Cassandra cluster. The nodes form a ring, and each node is in charge of the region clockwise. Letï¿½ï¿½s look at how the nodes communicate with each other, partitioning and replication of data, how to save data, and the consistency in Cassandra in details.
In a Cassandra cluster, a participated node needs to where the other nodes are and their corresponding state. A gossip-based protocol is used to exchange such information among nodes . The information is transmitted in a peer-to-peer way. Each node gossips its state information as well as the information of other nodes it knows to the nearby nodes. Gradually, the topology of the cluster and the state information of all nodes are spread to every node in the ring. Every second a node exchanges its state information with the other nodes. Old news will be replaced by the latest news. Therefore, using the gossip protocol Cassandra lets all the nodes have the latest information in the cluster and the information is updated rapidly.
One essential feature Cassandra has is the ability to scale out gradually . This is extremely important for the big data cases. In order to scale out Cassandra needs to partition the data among the nodes and thus store the data distributedly. The approach Cassandra adopts for partitioning is consistent hashing. The consistent hashing works in the following way. Cassandra hashes the row name and column name (since they are the key used to retrieve data within a column family), and the hash values distribute among the nodes. As Figure 2c shows, each node is in charge of the region clockwise. The advantage of consistent hashing is that data movement is limited to region belongs to the added/deleted node in the case a new node added or a note failed.
Since system and hardware failures are inevitable, Cassandra employs replication to make the system reliable and work normally in case of failure. Cassandra stores redundant copies for data, and also the copies (replica is the term used in Cassandra)are distributed in the nodes based on replication strategy. Cassandra sets the granularity of replication as the rows. When the user create a keyspace, he/she can select the number of replicas. The replication strategy also helps on how to distribute the replicas among the nodes in the Cassandra system.
Commit Logs, MemTables, and SSTables
Figure 2d shows how Cassandra stores data for a write operation. Write operation uses three data structures Commit Log, MemTable, and SSTable. First a write operation is written to the commit log . The commit log assures the durability character of Cassandra. Then the value is recorded to MemTable which is seated in the memory. The read operation can retrieve data from MemTable. When the data stored in MemTable exceeds its limit, they will be written to a file SSTable on the disk (same with Commit Log) and the MemTable is cleaned up. The data on SSTable can also be read.
Figure 2d: Memtables, SSTables, and Commit Logs
Cassandra satisfies two guarantees: availability and partition tolerance. It relaxes the consistency requirement to be an eventually consistent database. On top of this eventual consistency feature, Cassandra also provides tunable consistency option for users balance the consistency and latency. For a write operation, users can set the consistency level to ANY, ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, and ALL from weak to strong; and for a read operation the consistency level can be specified these value except for ANY. Cassandra uses the consistency level to control the number of replicas to touch with. For instance, the ï¿½ï¿½QUORUMï¿½ï¿½ level requires at least half of the replicas are contacted to for a read or write operation.
 Apache Cassandra? 1.2 Documentation - DataStax, 2013.
 Cassandra: The Definitive Guide, Eben Hewitt, Oï¿½ï¿½Reilly, 2010.
 Consider the Apache Cassandra database, Srinath Perera, 2012.
 Cassandra - A Decentralized Structured Storage System, Avinash Lakshman and Prashant Malik, Facebook, ACM SIGOPS Operating Systems Review, 2010.
Mahalo.com is a web directory (or human search engine) and Internet-based knowledge exchange (question and answer site). Mahalo is a top 200 website with 12 million monthly visitors.
Mahalo.com has two data centers. Initially the relational database management system, MySQL, was used for Mahaloï¿½ï¿½s data storage. Because the crucial parts of the Mahalo.com application are write-intensive and its workload fluctuates uninterruptedly, it was difficult for administrators to tune the performance. Moreover, lots of queries need to join multiple MySQL tables, and the response time to those queries was becoming out of the range of acceptable tolerances. Given the performance and functionality limitation of MySQL, Mahalo decided to switch to Cassandra, a modern key/value store, to support their growing enterprise.
Amazingly the migration of Mahalo.com from MySQL to Cassandra was completed within two months. Cassandra processed the large amount of data with no problem from the beginning. With a small amount of effort, Mahalo.com used Cassandra to establish a data infrastructure to support the needs of future throughput and scalability .
Mahalo.com uses Cassandra to record user activity logs and topics for their knowledge exchange website. Cassandra supports approximately real-time use of data and has been able to digest the huge write operations created by all of these activities also. Using Cassandra database, Mahalo.com can figure out the correlation between the hundreds of thousands of tagged topics quickly.
 Why Migrate from MySQL to Cassandra?, DataStax white paper, July 2012.