Auto Sharding Of Mongodb Computer Science Essay

Published:

Sharding basically means that if one machine cannot handle or store the data in terms of size, it needs to spanned or distributed horizontally amongst different machines. MongoDB essentially shards to scale horizontally for data size, index size, write and consistent read scaling. To enable horizontal scaling across multiple nodes MongoDB supports automated sharding architecture. It also takes care of migration of data from one shard to another. For applications that exhaust the resources of a single database server, MongoDB can convert to a cluster of shards, automatically managing failover and balancing of nodes, without affecting the original application code. MongoDB allows the distribution of databases, collections or objects in collections.

MongoDB's Auto-Sharding

Sharding is the arranging or partitioning of data among multiple nodes\machines in such a manner that their order is preserved.

For example, let's imagine sharding a collection of users by their email address. For instance if we have three shard servers the user will be divided up and assigned shards in the following manner.

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

Shard 1

Shard 2

Shard 3

Andy → andy@abc.com

Dawne → Dawne@abc.com

King → King@abc.com

Andrew → andrew@abc.com

Deshp → Deshp@abc.com

Ricky → Ricky@abc.com

Berry → berry@abc.com

Eline → Eline@abc.com

Sagar → Sagar@abc.com

Bason → bason@abc.com

Germane →Germane@abc.com

Sandy → Sandy@abc.com

Candy → candy@abc.com

Ganesh → Ganesh@abc.com

Raj → Raj@abc.com

 

Happy → Happy @abc.com

Tarun → Tarun@abc.com

Table 1: Sharding done on shard key as email-id

It is observed that each machine stores chunks of data based on the email address of the user. This data is stored in a way that it is evenly distributed amongst the available machines.

One of the key principles of MongoDB's sharding is that it is Range Based. Data is divided into ranges depending on the shard keys. Now if the shard key is email address then it can happen that emails ranging from A to C can lie in shard 1, D -H can lie in shard 2 and K -Z fall in shard 3 (Just an example). Collection is broken down into chunks by Range. Chunks default to 200mb or 100,000 objects (This is configurable in MongoDB).

Now the good think here is that the sharding mechanism only starts when the amount of data stored in a single shard reaches a threshold. Once the threshold is reached data is distributed amongst the shards evenly.

Key things in MongoDB Sharding are:

One can choosing how to partition data

One can convert the system from single master to sharded system with no down time: This means that you realize at a later time that the data you are storing in a single master system is increasing substantially and there is a need to shard, the sharding system can be incorporated with easy and no\insignificant down time: there is no need to change a single line of code after switching from single master system to sharded system.

Sharded system has the same feature as the non sharding single master.

It is Fully Consistent.

The mongos process is an interface between the application and the MongoDB. It is routes operations to the appropriate shards depending on the partitioning factors defined in the metadata. The sharded cluster of MongoDB looks like single-node database to the application and the application is not concerned in which shard the data falls or lies. The queries are efficient as they too are distributed. Querying on the shard key hits only those shards containing the shard key.

Sharding occurs not on the basis of database but the collections in MongoDB, this is because it is likely that some collections in a database grow extensively while the others hardly have few entries. For instance I have a db with 3 collections A: collection -small fixed data, B: collection - small data that hardly changes and C: collection that grows extensively in heavy magnitude of operations happening on it. Now it is logical for MongoDB to shard the C: collection and put collections A and B in a single shard and this is exactly what it does.

Architectural Overview

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

A MongoDB shard cluster consists

Two or more shards: hold the data

One or more config servers: hold the Meta data

Any number of routing processes to which the application servers connect. 

http://www.mongodb.org/download/attachments/2097393/sharding.PNG?version=2&modificationDate=1267724627656

Figure 1: MongoDB Shard Cluster

Shards

Can be master, slave or replica set:

A single shard holds one unique piece of data. Data is not duplicated across shards. Shard usually is a replica set. If there are 10 shards each shard hold 10% of the data, essentially data is not duplicated across shards. If there are 3 servers in a replica set then for a 10 shard system there would be in total 30 servers and each shard would have three copies/replicas of data. So scaling is achieved by adding more shards and availability is achieved by adding more servers to the replica set.

Each shard stores data using mongod processes These mongod processes are the core MongoDB database process). So the set of servers/mongod process within the shard comprise a replica set. 

Shard Keys

Shard keys are keys that are used to partition a collection. This is basically one or a combination of one or more fields upon which the data in the collection is distributed.

An example shard key patterns include the following:

{ name : 1 }

{ _id : 1 }

{ email : 1}

{ timestamp : -1 }

MongoDB's sharding is order-preserving; That is if I sort the data by shard key then the adjacent data tends fall in the same server. The config database stores all the metadata indicating the location of data by range:

Collection

Min-key

Max-key

location

users

Email starts with A

Email starts with G

Shard1

users

Email starts with H

Email starts with M

Shard2

users

Email starts with N

Email starts with R

Shard3

users

Email starts with S

Email starts with Z

Shard4

Table 2: Distribution of shard keys over shards

Chunks

A chunk is a contiguous range of data from a particular collection.  Chunks are triple of collection, min-Key, and max-Key.

Chunks have a maximum size which is usually 200MB. Once the chunk reaches its maximum size it slips into 2 chunks. When a shard has excess of data, chunks from this shard migrate to other shards in the system. If a new shard is added to the system chunks migrate to the new shard keeping the data in all shards proportionate.

It is very important to choose proper shard key because it influences the size of chunks. It can happen that a chuck has only a single entry or it become too large and find itself unable to split because of no proper selection of shard key. It is important to take into consideration that key should be chosen such that the values should be granular enough to evenly distribute data across shards.

Thus, if it's possible that a single value within the shard key range might grow exceptionally large, it's best to use a compound shard key instead so that further discrimination of the values will be possible.

Config Servers

The config servers (generally 3 or more) store the cluster's metadata; this is the basic meta data about each shard server and chunks. So basically the store the mapping information or the rule that states what chunk should go to which shard server.

Each config server has a complete copy of all the meta information as they play a very critical role. A two-phase commit is used to ensure the consistency of the configuration data among the config servers. If any one of the config server fails you cannot write to the config server and the cluster's meta-data goes read only. This does not stop the read/write operations on the MongoDB though.

Routing Processes (mongos)

Mongos is a router, it takes the request from the application and figures out what to do. So basically mongos process can be thought of as a routing and coordination process that makes the various components of the cluster look like a single system.  It routes the request to the appropriate shard and is also responsible for merging the results of query from multiple shards if required.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

Mongo processes pull their state from the config server on startup as they don't have persistent state. Any changes that occur on the config servers are propagated to each mongos process.

Operation Types

Operations on a sharded system fall into one of two categories: global and targeted.

When the mongos communicates with very small number of shards the operation falls under targeted. These operations are efficient. In the above example where email-id is the shard key. Getting users with email-id starting with 'A' would be a target operation as the mongos process would only hit the dedicated servers known from the config servers.

When the mongos server communicates with all the shards the operation falls under global. Now there are two type of operations here. One where the data is loaded serially from all the shards one by one, an example for the same would be getting all the usurers with age > 30. Here data is retrieved from each shard in a serial manner and returned to the client. The Second global operation type is where data is retrieved from all the shards simultaneously and then merged. These operations less efficient than the target as the results from all the shards are merged and sent back to the client. An example for the same would be getting all the users with age > 30 in a sorted order. Now in this query the mongos process needs to hit all the shards and get the result sets from each shard. Once it gets all the result set these result sets are merged and sent back to client.

Server Layout

Machines in MongoDB architecture can be organized in any manner. You can have the separate machines each for config server processes, mongod processes and mongos process. But, having the congif server on separate servers is not a good idea as the load on them is not as much as the other servers. So it is recommended that config servers processes should be combined with other processes to reside on physical servers. Here, is an example where some sharing of physical machines is used to lay out a cluster.

http://www.mongodb.org/download/attachments/2097393/machines.PNG?version=1&modificationDate=1251314800364

Figure 2: Configuring servers in combination to distribute processing load

However more configurations are imaginable, especially when it comes to mongos. For example, it's possible to run mongos processes on all of servers 1-6. Alternatively, as suggested earlier, the mongos processes can exists on each application server (server 7).  There is some potential benefit to this configuration, as the communications between app server and mongos then can occur over the localhost interface.

Balancing and Failover

Balancing of data i.e. distributing of data amongst the shards evenly is managed by MongoDB. It is necessary because it can happen that the load on one shard node grows out of proportion with other nodes. In these situations data is redistributed such that the load is equalized across the shards.

It is desired that each shard node is always available and never down. In practice each shard node more than one replicas know as the replica set. Normally there are n>=3 servers in the replica set. These server contain a replica of the entire data set for a given shard. One of the servers in the replica set is a master. If the master fails the remaining replicas elect a master from themselves.

Splitting and Migration

The figure below shows how splitting of chucks and migration takes place in Auto-sharding in MongoDB. Splitting not necessarily mean that there is migration. Chunks of data split when they exceed in size. Now the can remain on the same server or can be migrated to other server. This is decided by the balancer.

Entries fall in the first shard. As the data is less sharding is not yet required and data falls in the same chunk.

When the chunk exceeds approximately 200MB, it splits into two chunks. Now splitting happen on the median value i.e. if we have the shard key as email address ranging from A-Z it wont split on A- M …..M-Z. It would find the median and split on it. This handles the case where one can have all email ids starting with 'C'.

1.

2.

Documents with the shard key values which are nearby in ordering tend to be in the same chunk and therefore on the same server.

5. 6. Now if the data starts falling in the second shard depending on the shard key. The balancer observes that there is a load on the second shard. The chunk on the second shard splits after it reaches it max size and the new chunk is migrated to the third shard. Keeping the shards balanced.

3.

4.

5.

6.

Conclusion

Sharding in MongoDB to enable it to scale horizontally. This is similar to Big Table and PNUTS scaling model. Data is split into ranges based on the shard key and distributed across multiple shards.