With the increasing demand of computing devices and automation, mostly all the applications and servers need to capture tons of data. This type of data is mainly stored on distributed file system to make it globally available for sharing among the servers. These data should always be reliable and the data integrity should always be maintained. This papers mainly talks about the Hadoop distributed file system (HDFS) which stores the data in small chunks called data sets or blocks. It was mainly designed to store huge sets of data reliably and consistently. HDFS got popular for streaming such data sets at high bandwidth. It streams the data and divides them into blocks which can be maintained on distributed servers. With parallel computation across servers, the data resource can grow with having the load divided on distributed storage. This paper includes the architecture and various concepts of Hadoop Distributed file system with its advantages and disadvantages to applications.
Distributed File System (DFS) was implemented mainly for accessing files from multiple hosts. By this implementation many users have the ability to share the files in multiple systems. The best example for implementation of distributed file system is the Internet search engines. Hadoop distributed file system is one type of DFS which is designed to run on commodity hardware. Even though it has many similarities with the traditional DFS it differs itself in terms of fault tolerance which is very high. HDFS can also be easily deployed on low cost commodity hardware. The HDFS has its metadata and application data stored separately. The metadata is stored in a dedicated server called Name node and application data stored in the other servers called as Data nodes. TCP/IP protocols are mainly implemented for communication between the nodes. Since the file contents are replicated in all the data nodes the data reliability is very high in HDFS when compared to other DFS. 
ASSUMPTIONS AND GOALS
Hardware Failure is an important aspect of any distributed file system. Typically a HDFS is formed of hundreds or thousands of server machines. Each machine stores a part of a file system data. As there are huge numbers of components involved and each component has high probability of failing, at any given point there would be large number of servers in HDFS that are non-functional. So detection of such faulty servers and recovering the important data from them remains the significant architectural goal.
Significantly Large Datasets
HDFS supports files with large data sets. These files are typically of gigabyte and terabyte in size. HDFS scales hundreds of nodes in a cluster and thus is provides highly aggregated data. This architectural goal helps to support thousands of files in a single instance.
Basic Coherent Model
HDFS supports read many and write once access model for files. This type of data access models increases the data throughput and thereby simplifies the data coherence issues. This model seamlessly supports appending writes to the file in future.
Cheaper data computation
When the size of the data set is huge, it is desirable to move the computation closer to the data set. This improves the overall throughput of the system. HDFS allows the application to move the requested computation closer to the data sets by providing appropriate interfaces.
Interoperability across various Hardware and Software Platforms:
Portability is the most important feature of any distributed file system. At today's pace, data sets grow many folds and the heterogeneous applications run on a machine. To cater such need we need a DFS to be easily portable from one platform to another. HDFS facilitates such interoperability and hence widely accepted as a platform for large set of application.
The HDFS has typical master slave architecture wherein single NameNode forms the master server and the multiple data nodes forms the slave servers. The hierarchal structure of files and directories constitute to form a namespace for HDFS. These files and directories, also known as inode of a NameNode, keeps a track of attributes like permission, modification and access times. The content of the file is divided into blocks of data and each block is ideally replicated thrice at corresponding DataNodes. The NameNode manages the file system namespace, controls the access of the files to the clients and also manages the block mapping assigned to the DataNode. The HDFS cluster has a star like topology with single NameNode for each cluster and thousands of DataNodes and Clients. The NameNode handles filesystem namespace operations such as closing, opening and renaming the files and directories.
The important feature of DataNode is that the size of the data file is equal to the length of the block. Unlike other traditional filesystem, HDFS DataNode does not require extra space to round it up. Initially when the DataNode communicates to the NameNode, it exchanges the information such as namespace ID and the software version of the DataNode. If either of them does not match, then the DataNode is barred and will not be allowed to communicate with the NameNode Thus preventing the DataNode of heterogeneous operating system to communicate with each other and protects from corruption.
When a DataNode registers with a NameNode, storage ID is assigned to the DataNode and this ID is never changed once it is assigned. When the filesystem instance is formatted, a new namespace ID is assigned to all the nodes. So the nodes with same namespace ID can communicate with each other and nodes with different namespace ID are rejected, thus protecting the integrity of the file system. When a new DataNode is created, it can join to any cluster and the cluster's namespace ID is attached to it. 
The DataNode executes operation such as serving the read and write request from the clients. After receiving the request from the NameNode, the DataNode performs block creation, replication and deletion operation. 
A diagram of the HDFS architecture
Fig 1. HDFS ARCHITECTURE 
The connectivity between the NameNode and the DataNode is another important factor which determines the smooth operation of HDFS. There are several means that can cause the loss of this connectivity. The NameNode should be kept informed about the status of the DataNode, so that the incoming requests from the client can be handled effectively. Therefore the DataNode sends periodic heartbeats to its dedicated NameNode so as to confirm that the DataNode is still active and holds valid block of replicated data. By default, the DataNode sends a heartbeat to NameNode in an interval of three seconds. If a NameNode does not receive the heartbeat for about ten minutes then it would consider the respective DataNode as inactive, also known as DeadNode, and will not forward any client request to these nodes. The NameNode then schedules to replicate the blocks of data present on DeadNode to other active DataNodes. 
Heartbeats typically holds information such as storage capacity of DataNode and number of data transfers currently in progress. These statistics helps NameNode to manage the block replication and load balancing for incoming client requests.
A diagram of the HDFS heartbeat process
Fig 2. HDFS HEARTBEAT PROCESS 
Relationship between the NameNode and DataNodes:
NameNode and DataNode are software components that can be installed independently on the same or different servers across heterogeneous operating system. HDFS is built on Java, so typically any machines or servers that support Java can facilitate HDFS. Ideally a HDFS cluster has one server which contains a dedicated NameNode and possibly a DataNode and several servers which run one DataNode each.
NameNode does not directly communicate to a DataNode. It generally replies to the heartbeats sent by a DataNode. These replies include the instruction to replicate a particular block of data to another node, remove a particular block of data or shut down the node. 
The HDFS client has a library that exports the HDFS filesystem interface and forms a medium to the user application to access HDFS file system. HDFS, like traditional filesystem, provides support to create, delete, read, write and rename the files and also provides support to create and delete the directories. Just by refereeing to the path of file and directories in namespace a user can access them. It is not necessary for a user application to know how the filesystem manages these files on different servers or where it keeps multiple replicas of data. 
Process request flow
Initially when the user application sends a request to read a file, the HDFS client first prompts the NameNode to provide the list of all the DataNodes that can host the replica of data blocks. These DataNodes are then sorted accordingly with reference to the client. After the reception of appropriate DataNode ID, the client contacts DataNode directly and requests the transfer of desired blocks to the user application. When the application requests for a write operation, the client prompts the NameNode to select the appropriate DataNode to host the first block of data. The client then forms a pipeline to transfer a block of data from one node to another. Once the first block is full, client asks NameNode to select another DataNode to host the new block of data. HDFS allows an application to choose the replication factor, ideally it is three times, but for critical files we can increase this factor thus improving the fault tolerance and increases the read bandwidth. 
[HDFS Client Creates a New File]
Fig 3. Process request flow between the HDFS client, NameNode and the DataNode 
The NameNode can act as a CheckpointNode or BackupNode, other than performing its primary role of serving client request and managing the DataNodes. This role is specified to a node before the startup. 
The checkpoint and journals are created to save the recent state of the namespace stored in a filesystem. The CheckpointNode combines all the old checkpoints and creates a new updated checkpoint along with empty journal. This helps HDFS to control the uncontrolled growth of journals and it is also a good practice to create a daily updated checkpoint. 
The BackupNode can be referred to as read only NameNode. The BackupNode periodically creates a checkpoint but additionally it also maintains up to date image of namespace of particular filesystem. Its state is always synchronized with an active NameNode. The BackupNode also stores the journal which stores all the transactional data. The BackupNode does not have information of the location of the DataNodes. So once the NameNode fails, the updated BackupNode can be used to restore the latest state of the particular filesystem namespace. This improves the fault tolerance of HDFS. 
File I/O OPERATIONS AND REPLICA MANAGEMENT
File read and write
HDFS uses the concept of single writer multiple reader model for its read and write operation. The permissions for accessing the files in HDFS are similar to the concept of leasing. If a client in HDFS opens a file for writing, the lease is given to the user for that file. The client renews this lease periodically by sending a heartbeat to the name node. Once the file is closed the lease is automatically revoked and a new client can take it up for writing into the file. The duration of the lease is determined by the 2 limits
Soft limit: When the lease is bounded by the soft limit user has access to write into the file and if the client fails to close the file or renew the lease, it gives a chance for another client to acquire the lease. 
Hard limit: When the lease is bounded by the hard limit and if client fails to renew the lease, the system comes to a conclusion that the client has quit and will close the file automatically in order to recover the lease. 
Even though when writer acquires the lease of the file, other clients have access to read the files as HDFS can have concurrent readers. The files present in HDFS will be as blocks. When client wants to write data into a file it requires a new block. This process is taken care by name node, where it allocates a block with unique block ID. Along with this it also determines the list of data nodes where the replica of the created block needs to be placed.
The distance of the last data node and the client is reduced by formation of pipelines by the datanode. The data is transmitted in the form of packets using this pipeline. Even though there is an acknowledgement for each packet, it will not wait for the acknowledgement of the previous packet; it will continue to transmit the next packet.
Fig 4. Representation of data nodes using pipeline 
The user application of the client side has to call the hflush operation for immediate visibility of the data written by the client before closing the file. The data corruption which arises on replication of data on data nodes is overcome by the checksums generated by HDFS for each block at the time of creation. The checksum each block sent to the data node along with the data. The computed sum is stored in a metadata file in data node. When the file is read the client computes the checksum for the data received and compares it with the checksum it received. If an error occurs the client notifies the name node and fetches the replica of the block from different data node. 
Each block located in the DataNode has a threshold beyond which it cannot store the data. The NameNode keeps a track of space utilized by each block from the block report sent by the DataNode. If the NameNode detects that certain blocks are under or over utilized then it either choose to remove or add more replica of data. The NameNode will never opt to reduce the number of racks that hosts the replica instead it will remove a replica from the DataNode. The primary goal of NameNode is to balance the storage utilization across the DataNode without reducing the block availability.
When the block is under-utilized, it is placed in a replication queue. These queues are divided based on the priority. A block with the least number of replica get the highest priority and chance of it being selected for hosting the next replica increases. A thread keeps on scanning the head of the queue to decide where exactly the new replica should be placed. Also if the NameNode finds two replicas are on same rack then it would place the third replica on a different rack or it would place the replica on different node of same rack.
If the NameNode finds all the replicas of a block are located on one rack, then it would mark the block as misreplicated and would replicate the block on the different rack. Once the replication of the block is finished, then it would delete the old replica because of over replication policy instead of reducing the number of racks.
Whenever large number of clusters binds several nodes to form a file system, it is not advisable to place them in one flat topology. These nodes should be spread across multiple racks. The nodes are connected to each other through a switch and the racks are in turn connected to a rackswitch. So whenever the node of rack 1 communicates to a node of rack 2, it has to traverse through several switches. So the communication between two nodes of same rack occurs at higher network bandwidth as compared to the nodes located at two different racks. The below shown figure illustrates the cluster topology which includes two racks with three nodes each.
Fig 5. Cluster Topology 
The HDFS calculates the network bandwidth between two nodes by estimating their distances. The distance between two nodes can be calculated by adding the distances to the closest common ancestor. Shorter the distance between the nodes better the network bandwidth to transfer the data. The administrator of HDFS can configure a script that returns a node's rack identification. The NameNode is centrally place and keeps a track of rack location of each DataNode. 
The appropriate placement of replica determines the data reliability, availability and read/write performance. The HDFS block placement policy provides a better data readability, availability and maximum read bandwidth along with minimum write cost. If the HDFS finds that the replica of block 1 is placed in node 1 then it should place the second and third replica on different nodes of same rack or different rack possibly. To further improve the network bandwidth the three replicas are placed in two unique racks instead of three because the chance of rack failure is far less as compared to the node failure. 
PROS of HDFS
The HDFS deals with large amount of data usually in terms of megabyte or terabyte. It becomes very important for HDFS to have systematic fault tolerance mechanism in place, so as to minimize the loss of data. The replication management helps to deal with this issue and makes the system robust.
The HDFS has several cluster rebalancing schemes in place. This helps to move and replicate the data from one DataNode to another if the storage space falls to a certain extent. In case if there is a sudden surge in the demand for particular block of data, then the scheme helps to replicate the data and rebalance other data in cluster.
The HDFS client implements checksum error checking technique on the contents of HDFS files and stores them locally. So when the data is received from the DataNode, HDFS client compares the received data with the pre stored data and looks for any corruption of data. The corruption of data occurs because of errors in the network or faulty storage devices. If the data does not match, then it can opt to retrieve the block of data from another DataNode which has a replica of that same block.
CONS of HDFS
Name Node Failure or One point failure
The HDFS architecture heavily depends on the functionality of NameNode. It forms a master server of the HDFS serving the client requests and manages all the DataNodes. So NameNode becomes a single point of failure for a HDFS cluster. If the NameNode fails, then if requires a manual intervention. Presently, the automatic restart of NameNode and creating a failover NameNode is not supported by HDFS. BackupNode can be used to retrieve the recent state of the filesystem namespace in such cases.
COMPARISON OF HDFS WITH OTHER DFS 
Clustered based , asymmetric , parallel, object based
Clustered based, asymmetric , parallel, Object based
Clustered based, asymmetric, parallel, Object based
Clustered based, asymmetric, parallel, Object based
Clustered based, asymmetric, parallel, Object based
Clustered based, symmetric parallel
RPC/TCP and UDP
RPC/UDP Multi RPC
RPC/TCP and UDP
Central metadata server
Central Metadata server
Central metadata server
Central metadata server
Central metadata server
Metadata distributed in all server
Write-once- read many, give locks on objectsto clients, using leases
Write-Once -read many, give locks on objects to clients, using leases
Write-Once-read many, givelocks on objects to clients, one copy Serializability
Hybrid locking mechanism, using leases
Write-once-read many, Multiple producer/single consumer, give locks on objects to clients, using leases
Write -Once-read many, give locks on objects to clients, using leases
Consistency and Replication
Server side replication, Asynchronous replication, checksum
Client side caching, Server side replication, Asynchronous replication, checksum
Client Side caching, Server side replication
Server side replication Only metadata replication, Client side caching,
Server side replication, Asynchronous replication, Checksum, relax consistency among replications of data objects
Server side Replication, Only Metadata Replication
No dedicated security mechanism
Use of Kerberos for authentication and implementation of ACL
Authentication, Encryption and ACL
Security in the form of authentication, authorization and privacy
No dedicated security mechanism
Security in form of ACL
APPLICATION OF HDFS AT YAHOO
Yahoo has implemented a large HDFS with 3500 nodes, each cluster node includes a 2quad core Xenon processors @ 2.5 Ghz , using Red Hat Enterprise Linux OS , Sun Java JDK 1.6.0_13-b03, with 4 SATA drives attached directly , 16 GB RAM and 1-gigabit Ethernet. HDFS occupies the 70 % of the disk space and the remaining is taken by OS and output of map tasks. 
An IP switch is shared by forty nodes in a single rack, rack switches are connected to each of eight core switches. This switch provides connectivity between out of Cluster resource and racks. The Backup node and NameNode of each cluster are equipped with 64 GB of RAM, these hosts do not engage in application tasks. With 4000 nodes in a cluster there will 11 PB of storage available as blocks which are duplicated three times to yield a 3.7 PB of storage for each user application. The clustered nodes have been improved over the years thanks to the improved technologies; new cluster nodes have better processors, higher capacity in disks and better RAM speed. Older nodes are used for development and testing of Hadoop. 
Consider an example of 4000 cluster nodes with 65 million flies and 80 million blocks, each block will be replicated thrice, 60000 blocks of replicas are hold by the every data node. For each day two million new files will be created by the user applications on the cluster. So there will be resultant of 40 PB of online data storage for 40000 nodes in Hadoop.
The key component in Yahoo's technology suite is dealing with the technical problems that are the difference between being research project and being the custodian of many petabytes of corporate data. Robustness and durability of data are considered as important issues. Other issues which have to be considered nevertheless are economic performance, provision for resource sharing among members of the user community and ease of administration by the system operators. 
To conclude, hadoop distributed file system is a data structure platform that stores the relational data on different machine while maintaining data integrity and consistency.HDFS have several features similar to the distributed file system like Architecture, synchronizing methodologies are inherited from the classic distributed file system. The main feature that was added to HDFS is read-many write-once where multiple read access were satisfies the concurrency and parallel processing requirements and provides better through put with coherency by providing single write access. HDFS is a fault-tolerant and scalable distributed system. It can be also integrated with cloud software. Thus HDFS covers most of the featured requirements of a distributed file system.