Cache And Consistency In Nosql Storage Systems Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Recently a new type of database - NoSQL - has got significant amount of attention from big internet companies. The so called Non-relational databases have become very popular. They provide high availability and scalability for both read and writes. In these NoSQL databases caching and consistency are the two main concerns. This article will provide information about Cluster layer caching and data consistency using Plaxo. This article explains how node split technique in conjunction with consistent hashing can be used to reduce server load by partitioning client requests. It also explains how Plaxo works.


With the increasing amount of data in this Web 2.0 time, relational databases face significant challenges of scalability and reliability. Big companies like Google, Amazon, LinkedIn and Twitter tried to solve this problem of how to efficiently store and retrieve ever growing data. The notion of NoSQL databases has been around since many years. But it became popular after Amazon published a paper on "Amazon Dynamo" which is its implementation of key-value storage. Google's file bases Big Table is also very popular and an example of successful implementation of NoSQL database. Relational databases have specific use cases which mostly have structured data. But what about data which is unstructured? Relational databases face challenges here.

To solve the problems related to network efficiency caching is used. It saves a significant amount of bandwidth. Many caching solutions like IBM's solid DB, Hibernate L2 cache (EhCache). These caching mechanisms solve the network bandwidth problem but pose another problem in concurrent access systems. They don't work as expected in such systems. Voldemort is a very popular NoSQL database which is a distributed key-value storage system, fault tolerant and provides tremendous amount of scaling by sacrificing "querying " notion and convenience of SQL (or inconvenience?). Cassandra is another NoSQL database which stores data in the form of Columns and Column-Families. There also some Document storage and graph based storage databases.

In traditional scenario in relational database, there is a set of N servers which handle the traffic. Scalability wise they are not so good and if any server crashes the system faces serious issues in access of data. Memcached is a distributed memory object caching system which is used to reduce server load [1].

This paper is structured in two sections. First section explains the distributed caching system used by Memcached (which is also used in various other NoSQL storages). Memcached uses a distributed ms-mc (ms-server, mc-client) model. Next section introduces a data consistency mechanism, Paxos which is used in Google Chubby which is used Google Big Table. The section explains how the consistency algorithm Paxos works.

Cluster Layer Cache

Modern applications which are highly scalable and available use in memory distributed cache. Traditionally data is replicated over multiple servers to provide high availability but this approach requires low latency consistency protocols to propagate data amongst the different servers. Memcached has two core components:

ms - server

mc - client

When a Memcached query is fired the client first calculates the hash of the given key to determine the location of key-value in the ms. After the location is confirmed mc sends a request to the ms for the value and then ms finds the corresponding value and returns it to the client. Due to the absence of any communication protocol there is a minimal impact on network because of the interaction done by Memcached ms and mc. The figure below shows the view of the Memcached ms and mc system.

Figure 1 Structure of Memcached Distributed system

The ms can be divided into several virtual nodes. Each ms can be considered as a Physical Node (PN) which can have variable number of Virtual Nodes (VN). The structure is shown below.

Figure 2 MS structure

By using distributed cache the query efficiency increases to a great extent. When the client mc wants a value it asks the a proxy server which in turn will check whether the content mc is asking is in the ms. If content is not found then the query is turned to the next proxy server for processing. Some NoSQL storage systems use DHT - Distributed Hash Table scheme which enables proxy servers to directly deal with the query. In this case after a broadcasting system broadcasts the query to all nodes on DHT, each node searches for the queried content locally and returns the result in parallel.


When thinking of consistency, it's a tradeoff between availability and consistency. NoSQL storage systems tend to sacrifice consistency a bit for high availability for writes. NoSQL storage systems are designed to be available 99.9% of times for write and read.

The data in NoSQL databases needs to be partitioned and replicated using a good consistent key hashing algorithm [2]. Let's look at one of the consistent hashing algorithm. In this consistent hashing mechanism the range of the output of the hashing function is considered as a circular space - ring. That means the highest hash value is near the smallest hash value. Each node which is included in the system is assigned a random hash value which decides it position in the ring structure. A hash is calculated for the key associated with every data item that is to be stored. Its position in the ring is determined by the hash value by walking through the ring clockwise and finding the first node with the hash value larger than the assigned hash value. [3]

To have minimal load on the servers lets revisit the ms architecture we discussed above in caching. We can use n servers, each consisting of virtual nodes (VN). Each physical node (PN) will be connected in a ring and their corresponding virtual nodes connected to the ring through the network. On a user query the first node can be accessed from the ring. If it fails the next virtual node will be accessed. If all the virtual nodes fail, then virtual nodes from the next physical node will be accessed. As shown in figure, if hash value is located in a VN and if it fails then the value is access from VN b. If all the VN fail then VN of the next PN is accessed.

Figure 3 How hashing search works?


Let's now look at a very good and well known consistent hashing way: Paxos. Paxos is being used in Google Chubby used in Google Big Table. The Paxos algorithm is used to implement fault tolerant distributed hashing system. Paxos is a simple consensus algorithm which ensures that one value is chosen and learned by the learners. The safety requirements of the Paxos to ensure proper functioning are as follows:

The chosen value must be proposed first

No more than one value is chosen, and

A learner process never knows that a value has been chosen until it actually has been chosen. [4]

The goal of the system is to ensure that a value is chosen and if a value is chose it is consequently learned by the processes. There are three main agents in Paxos:

Proposers - The starting point. They propose any new value.

Acceptors - Typically there is a quorum of acceptors who accept or reject values proposed by proposers.

Learners - They are the ones who eventually learn the value accepted by acceptors. They can be acceptors as well as non-acceptors.

The algorithm consists of two phases. Phase 1 is the prepare requests to the sent by the proposers to the set of acceptors and Phase 2 consists of accept requests sent by proposers to acceptors and learning by learners. The algorithm can be summarized as follows:

Phase 1:

Proposers first selects a proposal number n and sends a prepare request to a majority of acceptors.

Upon receiving the prepare request the acceptor accepts the request with proposal number n if it's greater than the proposal numbers of the requests it has already accepted. It responds back to the proposer with a promise that it won't accept any requests that are numbered less than the proposed request just received and the highest numbered proposal number it accepted.

Phase 2:

If the proposer receives the response for proposal number n from majority of acceptors then it sends back an accept request with value v to the set of acceptors, with v being the proposal number of highest numbered request. It can also be any other value if the responses didn't contain any proposals.

Upon receiving the accept request by the acceptor, it accepts the request if it has not yet responded to any prepare requests that have a number greater than the accept request. [4]

A proposer can propose multiple times as far as it follows the algorithm and sends every new request with a new number. The acceptor can accept as many requests as far as the number it accepts are higher than the last accepted request.

Learning a value

There can be various ways in which value can be learned by the learners. One way is the acceptors send the proposal to all the learners as soon as it accepts the proposals. This way requires ample amount of communication and the requests can be the product of the number of acceptors and number of learners.

An alternative approach is to have a distinguished learner. The acceptors could send the proposal request to the distinguished learner and then it can be passed by the distinguished learner to all the other learners. This method has a single point of failure. The system fails if the distinguished learner goes down.

Another way is to have more than one distinguished learners accepting proposals from acceptors and communicating it to the rest of the learners. This method involves complex communication mechanism but has a higher probability that the message actually reaches learners and that a value is successfully learned by the learners.

In other scenario, if a message is lost then there is no way a learner can learn a value. In this case the learner can query the acceptors about the value it has accepted. If the acceptor goes down then it has to wait until the next request is accepted by the acceptor and then communicated to the learners.

Problems Paxos can face and solution

It can be easily be the case that this algorithm goes into an infinite loop of prepare requests sent by two different proposers. The proposer can issue a request. The acceptors accept the request after which the proposer fails. The acceptors accept another request with a higher number. The second proposer fails. The first recovers and again proposes. It again fails and proposer 2 recovers and sends the request and so on. The scenario can be seen below.

Figure 4 Paxos Infinite loop

This situation can be solved by having a distinguished proposer propose values. If the distinguished proposer can communicate successfully to a majority of acceptors then the value can be accepted. Only the distinguished proposers can issue requests. The proposers can be elected by suitable algorithm and safety be ensured regardless of failure to elect leaders by using timeouts. [5]


Amazon Dynamo

Amazon Dynamo sacrifices consistency for availability and reliability. It has eventual consistency. All the nodes will eventually see the update at one point of time and will eventually agree on all values. The updates are not atomic. [6] This system allows inconsistent reads which have different values but guarantees that eventually every replica will have same value. The replicas will be able to infer the updates and will achieve a consistent state. There is a problem in determining which write was executed first in case of two concurrent writes. Amazon Dynamo follows a different design approach and allows inconsistent values from concurrent writes to be stored in the system side by side by doing versioning of the objects. It forces the client to reconcile the values and update Dynamo.

The rationale behind forcing client to do the updates is that, Dynamo is not aware of the object signature and thus does not know how to reconcile updates. Therefore the responsibility is passed back to the client to do the updates which knows the object well. More information about this can be found here. [6]


Voldemort follows a similar approach to that of Amazon Dynamo with doing inconsistent writes and solving the inconsistencies at read time, better known as "Read-Repair". Voldemort has an additional mechanism known as Hinted-Handoff. Hinted handoff allows consistency to be reached before read-repair occurs. The reason behind having this additional step is to reduce expensive cross data center communications.

How it works?

Say a put/delete fails due to some kind of failure like node unavailable or timeout.

The list of nodes on which the writes have failed is kept in the pipeline.

For every write that has failed, take a live node somewhere on the network and perform a synchronous write on the slop store. Slop store is storage other than the main storage and stores only these hints.

If the write still fails after the handoff occurs, exception is returned to the client telling that the handoff has occurred but the write has failed.

The hints are checked periodically for writes and removed from the slop store.

This is different from Dynamo. Dynamo does not have any sloppy writes or any slop storage.


In the age of Web 2.0 there is continuous flow of unstructured information. Users want to access more and more data every day. The traditional relational database cannot handle unstructured data efficiently and also cannot scale enough. Relational databases can be distributed in true sense but compared to NoSQL storage systems they deal consistency and availability in a weaker manner. It's also difficult to add nodes/servers or remove from the relational database systems because they are designed keeping limited node structure in mind. NoSQL storage systems cannot and will not replace the traditional relational database systems completely, but they suit better in some specific use cases. This paper explained a popular caching system and explained various data consistency approaches followed by big NoSQL storage systems like Google BigTable, Voldemort and Amazon Dynamo.

Future works include exploring different variations of Paxos like Ring Paxos, Fast Paxos and Multi Paxos. I also plan to explore different querying mechanisms available for NoSQL storage systems and also develop new options for existing storage systems.