Survey On Distributed Hash Table Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

BitTorrent is one of the most common large file sharing protocols peer to peer which gained significance with increased sharing of data between the users over internet. Initially file sharing over the internet was done through client/server model where in client establish a connection with server (which is the central point) for the network and exchange data with it. Single server always causes serious bottleneck when the number of clients increases vastly. However, replacing single server with number of cloned servers may reduce the bottleneck to some extent but there are other issues like replication, distribution of the load and transaction recovery [1].

So, in order to provide a better model for supporting large file transfers, peer to peer sharing was introduced where in each client also act as server. Here the file is divided into chunks of data and peers (clients) share these data among themselves by uploading and downloading these chunks. For this trading of file chunks, each peer must know discover other peers which are downloading the same file in the overlay network. There are two approaches for

discovering the peers 1) Centralized Tracker 2) Using decentralized discovery approach which uses either Distributed Hash Tables (DHTs) or Peer Exchange (PEX) [2].


Peer to peer sharing using centralized tracker

In this section, I will discuss the approach how peer to peer sharing using a centralized tracker is implemented which was used for file sharing prior to DHTs. The following are the entities present in the BitTorrent architecture using centralized tracker. A static metainfo file or a "torrent file", a "tracker" , an Original Downloader (usually called "seed") and an End User Downloader (usually called "leech") [4].

In order to publish a file using BitTorrent, the basic step is to create a metainfo file for that file which is called "torrent". This is the basic file needed by any user to download the actual file needed. It has the extension .torrent and it contains the information like filename, size of the file, hashing information and the URL of the "tracker" [4].

While creating the torrent file, the original file is broken into small chunks usually of the size

256kb or 512kb which also contains SHA-1 hash codes used for verification. There are several BitTorrent clients which support a standard BitTorrent protocol, available to create this files which help users finding the seeds. The download of the file starts when the torrent file is opened in the client [4].

Figure1: BitTorrent with central Tracker [10] Tracker is the central server like system which

does not have the whole copy of the file but contains the information about the peers and also the information needed to help the peers finding other peers who are downloading the same file. The exchange of information between the tracker and the downloading peers is done using a simple protocol on top of HTTP. Here, a group of users who are downloading the same file

usually represent a swarm. First, the users send the information about the port it is using , the file it is downloading and other information required to the tracker. The tracker, in response, sends the peer list to that user with all the peers downloading the same file [4].

The download of the file for a user will only be initiated after the "seed" starts. A seed is the user having the entire file. Users who have only parts of the file to be downloaded are called "leeches". The seed that have the entire file must upload at least once so that whole of the file is distributed once and is available in the swarm. Once the entire copy of the file is distributed, the seed may stop uploading, depending on the numbers of users who are downloading that file. If the number of leeches is more, more number of seeds who keep uploading entire file is needed or else the seeds may stop after uploading once. Even if the seed stops uploading, the download of the peers continues as they exchange pieces depending on their requirement [4].

The SHA-1 hash codes in the torrent file and the corresponding pieces are compared and checked for the correctness of the chunks received. Once a user gets a chunk, it informs all other peers about the availability of the chunk it received so that they can request the download to this particular peer. So, which chunk to select from which peer must be decided efficiently as it makes a huge impact on the performance of the protocol. There are several policies for the

selection of the chunks that are to be replicated quickly and distributed among the peers as soon as possible to make them easily available for download. Some of the policies include rarest first, random first piece etc [4].

Among the peers to decide upon which peer should upload, BitTorrent makes use of tit-for- tat strategy. The peers who have high upload rate are allowed by maximum number of peers to download. Few peers who do not upload are penalized by not allowing them to download for a small period of time. This is called choking. Each user may choke only a fixed or limited number of users. After 10 or 20 seconds, the choked users are again unchoked [4].

From above, we can analyze that the bandwidth needed by the publisher is less if there are many peers who are downloading the same file. It is also less even when there is low number of peers when compared to the traditional client-server model. Choking and unchoking are the striking features provided by the BitTorrent protocol which assist only those users who are willing to both upload and download but not only just download as it penalizes the users who just download without uploading [4].

Now we will discuss how BitTorrent protocol is implemented with decentralized approach using distributed hash tables.

BitTorrent protocol using Distributed

Hash Tables (DHT)

The main problem with a centralized tracker is it is not very fault tolerant. There is no other way the peers can communicate other that the tracker. So, if the tracker goes down, then whole of the peers and network will fall into jeopardy. Another disadvantage with the centralized tracker is the network becomes more vulnerable to denial of service attacks where in a particular peer may never get a chance to download the chunk it needed. So, a better approach which uses a decentralized tracker was needed.

Figure2: BitTorrent with decentralized

Tracker [4]

Now, instead of a single centralized tracker, we use a kind of distributed tracker where in, each node acts as a lightweight tracker. This approach

uses distributed hash tables (DHTs). The DHT system consists of set of nodes and keys where in it performs the function of a hash table. The key-value pairs are stored and values are looked up given a key. The distinctive feature of distributed has tables is that the storage and lookups are also distributed among the nodes and the nodes can join and leave the network at their wish without disturbing whole of the process.

The DHT structure consists of keyspace and the overlay network. Keyspace partitioning: Each node is assigned with a key with which it is identified in the network. Consistent hashing technique is used to lookup the nodes with the keys given. Consistent hashing technique helps the system to be in consistent state even though the nodes are added and removed at any point of time. Very less reorganizing is needed which makes it possible to add or remove as many nodes as possible. The scalability of this type of hashing is further improved by avoiding the requirement that each node stores the information about the all other nodes in the network.

The following are two widely used approaches that use the DHTs. 1) Kademlia 2)Chord DHT. Here we will discuss about how the DHTs are implemented in Kademlia protocol and how indexing is done on DHTs.


Kademlia protocol does not assume any super nodes as it considers all nodes to be equal. Kademlia uses XOR metric in order to assign the <key, value> pairs to the nodes in the overlay network. So, the distance between two nodes x and y can be defined as d(x, y)=x XOR y where XOR is a bitwise XOR. Considering a binary-tree based structure system, the distance between two nodes in a fully grown tree would be the height of the smallest sub tree containing both the nodes. In an incomplete tree, the closest node would be one which shares longest common prefix with it.

Each node stores a list of triples containing IP address, UDP port and Node ID for the purpose of querying messages. Each list constitute as k buckets. Each node keeps such list for nodes of distance 2^i to 2^(i+1) where i is less than 160. Each k bucket is always stored in a sorted order in such a way that the head contains least recently seen nodes and the tail with most recently seen. Here k is system-wide replication parameter which is chosen in such a way that the probability of k nodes failing within an hour of each other is very low. The k buckets effectively implement the least recently seen policy and provide good resistance to denial of service attack.

The information in the buckets is updated as follows. When any node receives a message other, one of the following applies. 1) If the

receiver already contains the sender's node ID in the k-bucket it moves that node id to the tail of the bucket indicating it is most recently used. 2) If the receiver does not contain the sender's node ID in the appropriate k-bucket and the node has fewer than k buckets then the new node information is inserted into the bucket and it is moved to the tail. 3) If the receiver does not contain the sender's node ID and the k-bucket is full then, receiver pings the least recently used node that is at head. If it responds, the new node is discarded or else, the least recently node is deleted and the new node added at the tail.

The structure of routing table in Kademlia protocol follows a binary tree structure whose leaves are k-buckets. The position in the binary tree is defined by the prefix of the node ID and each k-bucket contains the nodes with common prefix and covers some range of ID space. So, overall k-buckets cover the 160 bit ID space.

Figure 3: Kademlia Tree [11]

Kademlia protocol uses the following remote procedure calls for the sake of communication and lookup between the nodes in the overlay

network. It uses PING, STORE, FIND_NODE, and FIND_VALUE [3].

Any node can send a PING message to another node which may or may not respond. Whenever a sender sends a ping message to the receiver, the receiver must update its bucket as described above and if the receiver responds then the sender must also update its corresponding bucket. Any node can call STORE rpc along with a key and block of data and the receiver stores it and makes it available for the later retrieval. FIND_NODE call includes a 160-bit key where in the receiver must return k or fewer triples of <IP address, UDP port, Node ID> which are closest to the key sent. FIND_VALUE also includes 160-bit key but the receiver returns only the data corresponding to that value. If that is not present, it returns k triples similar to FIND_NODE [3].

The node lookup procedure in Kademlia uses an iterative approach that returns k nearest nodes to the given key. First, the search finds alpha nodes from the k-bucket which is closest to the given key and if the k-bucket contains fewer than alpha nodes, it selects nodes from other buckets. Then it selects and stores the closest node of these alpha nodes. Then the node sends an asynchronous FIND_NODE to these alpha nodes in parallel and each may respond with k triples. Different parallel methods can be employed like strict parallelism, bounded parallelism, loose parallelism etc. If any of the nodes doesn't respond the sender discards that

node. Then, it updates the closest node depending on the nodes it received. After this, the initiating node again queries other k nodes and repeats the same procedure. The end result it would get is the list of k active nodes with their IP address, port and Node ID. The lookups are performed in hourly interval in general so us to update the buckets every time in order to know which are present and which nodes have left from the network [7].

Chord DHT

The key distinctive features of chord DHT which have made one of the solutions for using it as a protocol in peer to peer systems are improved performance, correctness, simplicity and robustness [8].

The nodes for the Chord ring are assigned with IDs which are formed using consistent hashing where the IP addresses are hashed with their port numbers. Keys are assigned and distributed uniformly such that the clustering of the keys is minimized using the above generated hashes which are known to be SHA-1 hashes and the values that depend on the applications. Here, consistent hashing SHA-1 is being to have well distributed keys which are random so that the load balances over the several nodes in the ring.A Chord ring has 2n nodes and number of

keys range from 0 to 2m-1. Once the IDs and keys

are arranged in the key address space, a node with ID that closely follows key k is responsible for that key and is called successor node [8].

Each node in the chord ring uses a data structure called finger table which is typically the routing table that contains the information about its successor where in it contains the node ID and an offset. Each finger table has only small number of entries as each node should have just have the information about its adjacent successor [8]. In effect, the finger table contains information about logN nodes. When a lookup is performed, the query is forwarded to only those nodes which are power of two (inverses) far from the key it is being looked up from. So, the lookups take O(logN) time where N is the number of nodes in the chord ring [9].

The node joins and departures are done using stabilization where the finger tables are updated to guarantee the integrity and correctness of the routing information. It is the simplicity of the finger tables that enable the lookups quick and also the node joins and departures very easy.

Although Chord DHT is one of the successful peer to peer sharing protocols using

Figure 4: Chord ring showing nodes with finger tables [12]

decentralized approach, it has few limitations. Since the Chord DHT uses the IP addresses and port numbers of the nodes, it does not allow the client/nodes to be anonymous. The query send is not reliable enough to reach the destination in few cases. If the update of finger tables take place after the query, then there is a chance of query leaving from the Chord ring and miss the destination. Although Chord can handle node failures, large number of node failures at the same time can make it unstable.

Suggestive improvements for Chord DHT

Use of SHA-1 hashing is not proved to be completely secure and it may also lead to collisions. So a better hashing mechanism, probably SHA-2 hashing is suggested. But it causes the overhead of carrying messages which are larger in size.

By increasing the number of predecessors in the finger table, large node failures can be handled better.

Building an overlay over the Chord ring can make it possibility to provide the feature of anonymity with improved security but it makes it little complex and adds overhead [8].

Implementation of BitTorrent protocol using DHTs

Many clients have emerged using this BitTorrent protocol to enjoy the benefits of decentralized approach. Generally, for the present day client, peers that implement BitTorrent protocol listen on TCP nodes while the DHT nodes which implement DHT protocol listens on the UDP port. Whenever a node sends a request, it is responded with the nodes IDs that are the closest to the SHA-1 (or the hash value generated) value that is being sent. Here, it makes a iterative search until the node finds closest node with the hash value as specified in the above two protocols.

The protocol uses tokens along with the SHA-1 hash values in order prevent the malicious nodes competing for the torrents. These values are changed at small intervals and are sent to the receiver node every time the message is send to ensure security. Current BitTorrent protocol is extended to assist the start of the download when there are no nodes in the routing table initially, by exchanging UDP port numbers between the peers [15].

Challenges for Distributed Hash Tables

Although, DHT based system solves the problems posed by the centralized tracker, it has got few challenges to face it while implementation. The first challenge it search the torrent file that is needed by the user. Since, it is difficult to implement keyword search for querying, the user must specify almost the exact information. It is difficult because it may cause load imbalance and also extra effort for storage and redundancy for multiple keywords [13].

The latest BitTorrent clients overcome this problem using various protocols. For example, Tribler which is the first to implement decentralized keyword search uses gossip protocol which uses sharing important information or most discussed information between two peers. Another BitTorrent client BitComment uses Torrent Exchange feature for implementing keyword searches by making use of previously downloaded torrents [2].

The second challenge the decentralized approach poses is against the malicious nodes which may cause various kinds of attacks such as 1) Eclipse attacks where the identities of malicious nodes are filled in the routing table by the attacker 2) Sybil attacks where in a single node pretends to be many nodes with different identities causing redundancy and 3) Routing and Storage attacks where it the malicious node corrupts the routing

table information and redirects to malicious routes or nodes [5].

Various programs implementing Kademlia make use of secured authenticated IDs given to nodes to prevent the eclipse attacks and routing and storage attacks. But, most of the current DHT systems still suffer from Sybil attacks [11].


Using a centralized tracker system for peer to peer file sharing was an improvement over traditional client server architecture but posed numerous problems of loading the tracker with high load. So, a distributed tracker was much needed to share the load where in it used distributed hash tables (DHTs) were used and each client acted as a light weight tracker. Kademlia and Chord are the two trackerless protocols implemented using DHTs and are most widely used now. Chord DHT had few problems like making use of rigid finger tables and does not learn useful information from the queries received which is unlike how kademlia protocol was implemented. DHTs implemented initially had few drawbacks like they were easily prone to the malicious nodes and there were problems with searching the torrent files. But the current implementations have solved most of the problems ensuring a reliable and quick file sharing. Current ongoing research suggests that the clients are trying to implement a new feature Security Enhanced Transfer (SET) which may

improve the speed of file sharing and making content distribution systems share [14].