This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Distributed system consists of multiple autonomous computers that are connected via a communication network. Due to wide availability of high performance, low-cost computer and network devices, distributed system have become commonplace. When a distributed system is large, management infrastructure always does not scale well. Therefore choice of the network topology and method used to build distributed system are consideration so as to lower the cost of linking nodes together , minimize the message delay in transmission, optimize the reliability and scalability of the system, and simplify system resource management.
In my report I will describe a new distributed management system that is able to recover, detect the unexpected failure of system service, handle the dynamic increase of system size, and manage system resource. Tree-structured network and ring-structured network are the topologies that i used in the system.
A distributed system is a collection of autonomous computers that connected through a communication network. Generally it is difficult to provide software in large-scale distributed system that is reliable, manageable, fault-tolerant and easy to use compare to small-scale distributed system. U.S. Department of Energy has established Scale System Software Center  to solve problem of lack of software for utilization and effective management of computational resources. The goals are to develop an integrated suite of machine independent, scalable system software components need for the Scientific Discovery through Advance Computing (SciDAC) to provide open source solutions which work in small and large systems. Actually my work is a part of SciDAC project.
The distributed management system that I described have one master node and a lots of slave nodes. The master node as manager in the system that responsible for construct and recover the system. And each slave node's role is resource usage, detecting, reporting its working status, reporting its neighboring node's failure, and follow instruction from master node to dynamically adjusting its position. In this report will show the result that tested recovery and construction performance, and communication performance in the ring-structured network and tree-structured network.
In this report, background is presented in section 2. System components that including the different connection types and different node types used in the system is described in section 3. In section 4 is about system design and implementation. Test environment and result is presented in section 5. Finally, section 6 will conclude with summary.
2.1. Distributed systems
The definition properties are commonly used are there are several autonomous computational entities, each of which has its own local memory, and the entities communicate with each other by message passing. In this report, the computational entities are called nodes or computers. A computer accesses remote resource via the communication network. Because of the problem CPU overhead to process communication protocols and communication delays, access the remote resources more expensive than access to local resource. Figure 1 show that architecture of the distributed system.
Figure 1: Architecture of distributed system
The motivation to development of distributed system includes high performance, low cost computers and devices. Computing power is enormous when few powerful computers communicate and connected each other. Some system's performance/price ration is higher than a single supercomputer.
2.2. The distributed system communication network
There are two connected network can be used to construct distributed system, fully connected network and partially connected network.
In a fully connected network, all pair of computer is linked directly. Adding a new node to the system result in the increase of each node's degree is problem which result more file descriptors opened and increase complexity to implement the connection of each node. Thus capacity of each node to open file descriptors and ability to handle the new connection is limit the scalability of the system.
In a partially connected network, there are not all pair of computer is link directly each other. Only some pair of computer linked directly. Multi-access bus network, ring structured network, star structured network, and tree-structured network are partially connected network.
Some of the traditional distributed systems use a star with temporary connection as their network topology. The central node becomes the bottleneck and the establishment and termination of connection are a significant overhead of the system is a problem.
In a multi-access bus network, all the nodes in the system are connected to a shared bus link. The bus link becomes the system bottleneck.
In my report topologies that I used are the tree-structured network and the ring-structured network. I will show result of testing construction and recovery performance, and communication performance in the system.
Construction event: The construction event happens when a new node wants to join the system. The new node send s registration request to the master node and the master node computers the position for the new node.
Recovery event: The recovery event happens when a slave node fails and the neighboring nodes of the failed node report the failure to the master node. The master node computer a method for recovery the system structure.
My report focuses on the design and implementation of a distributed system that can detect and recovery a system failure automatically, can handle the dynamic increase of the system size, simplify the management of the system, and lower the communication delay and system overhead. An optimal method to construct and recovery the tree-structured and the ring-structured and to minimize the data transmission latency and network traffic are aimed in this report.
3. System structure and components
3.1. System Structure
As stated before, my work exams the construction and recovery performance, and the communication performance in the tree-structured network and the ring-structured network.
Complete n-ary tree: a complete n-ary tree is a tree in which all the nodes have at most n child nodes and all the level are full except for the bottom level and the bottom level is filled from left to right. We call the node with the maximum ID in the tree the last node. Figure 2 shows an example of a complete ternary (n-3) tree.
Ring: In a ring, all nodes are connected to each other in the shape of a closed loop, so each node is connected directly to two other node, one on either side of it.
In both the ring-structured network and tree-structured network, there are two types of nodes in the system, one master node and many slave nodes. In the tree-structured network, the master node of the system is the root of the tree. In the ring-structured network, the master node of the system is the head of the ring.
Figure 2: An example of a complete ternary tree
3.2. System Components
To provide reliable service to the end users, both the master node and slave nodes are responsible for resource management and system structure maintenance.
To monitor and manage the system status and the system resource, the master node gathers the working status and resource usage information of all slave nodes and sends instructional messages to slave nodes periodically. Each slave node is responsible for generating messages to report its working status and resource usage. These messages are called the resource management messages. Each slave node is also responsible for forwarding the message from its parent node to its child nodes and merging the resource management messages from its child nodes and forwarding the merged message to its parent node.
To maintain the system structure, the master node has data storage to store the system structure ad it is responsible for keeping the data storage up-to-date. It is also responsible for computing optimal methods to construct and recover the distributed system and give slave nodes instructions to maintain the given system structure. Each slave node is responsible for its registration to be added to the system, detecting and reporting it's neighboring node's failure to the master node, and dynamically adjusting its position in the system according to the instruction form the master node.
3.3. System Connection
To manage the system resource and maintain the system structure, there are two types of messages transmitted along the network links, the resource management messages and the system structure maintenance messages. Since both of these messages require reliable delivery services, TCP is used as the transport protocol. To transmit there two kinds of messages, there are two types of connections in the system: permanent connections and temporary connections.
The permanent connections are used to transmit the resource management messages. All nodes in the system are connected to their neighboring nodes by permanent connections. The reason I use the permanent connections to transmit the resource management messages is the overhead of TCP three-way handshake and four-way termination. The resource management messages are periodically exchanged between the neighboring nodes. Once a connection is established between the neighboring nodes, it is persistent. There is no need to establish a new connection for each resource management message. Using permanent connection to transmit the resource management message decreases the overhead of the establishing and closing of the network connection.
The temporary connections are used to maintain the system structure. For both the connection and recovery event, there is a group of system structure maintenance messages transmitted between the master node and a slave node. A temporary connection is for transmitting a given group of messages. After the transmitting a given group of message, it is not necessary for the temporary connection to exist. Using temporary connections to transfer the system structure maintenance messages can release the load burden of the master node.
4. System design and implementation
4.1. System Design
Figure 3: Software modules
As discussed in section 3, the system consists of a master node and up to hundreds of slave nodes. Nodes in the system have to communicate with other to accomplish the system goals. The messages transmitted in the system are in XML format. Network communication and XML document processing module are needed for the master node and slave nodes to accomplish their goals. The network communication module and XML document processing module were abstracted out and shared by both the master node and slave nodes. Figure 3 shows the software modules and their relationship in this system.
4.2. System Implementation
4.2.1. The XML Document Processing Module
XML stands for eXtensible Markup Language. XML documents are used in this system as the data storage in the tree-structured network and message formats transmitted in the system. The reason why I use XML documents in my work is that the markup tags in XML are used to describe and store data, and allow the application to store structure data in XML documents and extract data from XML documents.
The XML document processing module provides all the method needed by the master node and slave nodes to process the XML messages. The Xerces-C++ parser is used in our system as the XML parser. There are 2 C++ classes in the XML document processing module, the BuildMsg class and the ParseMsg class. The BuildMsg class provides methods to create and modify XML documents. The ParseMsg class provides methods to extract information from XML documents.
4.2.2. The Construction and Recovery of the Tree-Structure Network
In the tree-structured network, the network is constructed as a complete n-ary tree. Each node in the system has a position ID that is determined by the browse order by breadth first search while the root of the tree has position ID 1.
Construction: When the system starts, there is only the master node that is the root of the tree. New nodes are added to the system dynamically in the order of their registration requests. The master node is the manager of the system. When it starts, the master node initializes the system structure database, and opens a port to listen for requests from slave nodes.
The following illustrates the steps involved in constructing the tree-structured network.
The new node sends a registration request to the master node. In the registration request, the new node sends its network-based information.
The master node queries the structure database, and computes the parent node for the new node.
The master node sends a response message that contains the parent information to the requesting node.
The new node tries to connect to the parent node specified in the response from the master node.
Figure 4: The process of adding a new node to the system
If the new node successfully connects to its parent node, it sends an acknowledgement message to the master node.
After receiving the acknowledgement message from the new node, the master node adds the new node to the system structure database.
If the new node cannot connect to the assigned parent node, the master node will not add the new node to the system database. The new node will send another registration request to the master node.
Recovery: When a node in the tree-structured network system fails, all its neighbors will report the failure to the master node. To minimize the number of nodes involved in the recovery event , when the master node receives the first report message of the event, it chooses the node with the maximum ID that is alive to replace the position of the failure node (see section3). After the parent node reports the failure, it will close the connection to the master node while after a child node reported the failure, it will expect the new parent information from the master node. The process of recovering a tree-structured network can be illustrated as figure 5.
Figure 5: The process of recovery the system
The reason why all the neighbors instead of only a single neighbor will report this failure can be illustrated as follows:
Figure 6: A segment of a tree when node B fails
The neighbors of the failed node do not know each other. There is no connection between any pair of the failed node's neighbors, when a node reports this failure, it cannot notify other nodes. In figure 6, there are no connections between node A and node C, node A and node D, and node C and node D. When node A reports the failure of node B, it cannot notify node C and node D.
It is possible that two adjacent nodes fail simultaneously, so one node cannot rely on others to report the failure. In figure 6, if node A and node B fail at the same time, and node C and node D rely on node A to report the failure node B, node B's failure will not be reported to the master node.
All the child nodes of the failed node need to get instructions from the master. It is reasonable for a node to get instructions from the master node after it reports the failure. In figure 6, node C and node D get new parent information after they report the failure to the master node.
4.2.3. The Construction and Recovery of the Ring-Structure Network
Figure 7: The construction of the ring-structured network
Construction: As in the tree-structured network, when the system starts, there is only the master node that is the head of the ring. New nodes are added to the system dynamically according to the order of their registration requests. The steps involved in a construction event are the same as in the tree-structured network. In the ring-structured network, the new node is always added as the tail of the ring. Figure 7 shows the construction of the ring-structured network
Figure 8: The recovery of the ring-structured network
Recovery: During the execution of the system, when a node in the system fails, both its previous node and next node will report this failure to the master node. Since it is possible that during the recovery process the failed node's previous node and next node may fail, the master node will find two nearest neighbors of the failed node that are alive and link them together. Figure 8 shows the recovery of the ring-structured network.
4.2.4. The Program Design of the Master Node
The master node is a concurrent server that can accomplish multiple tasks simultaneously. The concurrence of the master node is implemented using POSIX Threads. There are four types of threads in the master node: the main thread, the resource management thread, the system structure management thread, and the connection handling threads.
Main thread: It is the entry point of the program. It is responsible for creating working threads to handle the different functions of the master node.
Resource management thread: It is used to accomplish the resource management tasks of the master node and sends the instructional messages to its child nodes.
System structure management thread: It is the most important part in the system. It is responsible for computing an optimal method to construct and recover the system so that the system has good scalability and reliability.
Since each node in the system may have more than one neighbor, in case of one node failure, all of its neighbors need to report the failure to the master node. In the process of handling a recovery event, it is possible for new requests, including new registration requests and new failure report requests, to be received. To clearly describe these situations, we define the following terminologies.
Normal State: The system is running normally.
In process State: The master node is handling a recovery event.
Process Time: The beginning time when the master node starts to handle a new recovery event.
Maximum Delay: The longest time that the system can be in "In process state".
Timeout State: The system stays in the In process State longer than the Maximum Delay
For example, here is a description of the system state transformation from the normal state. In the normal state, the master node is expecting both the registration and report request. After handling the registration request, the system stays in the normal state. If the incoming request is a report request and the failed node only has one neighboring node, after handling this request, the system stays in the normal state. If the incoming request is a report request and the failed node has more than one neighboring node, after handling this request, the system changes to the in process sate. In the in process state, the master node only accepts report requests that report the current failure. If the incoming report request is the last request to report the current failure, after handling this request, the system changes back to the normal state. If the incoming request is not the last report request to report the current failure, the system stays in the in process state. If the system stays in the in process state longer than the maximum delay and not all the neighbors of the current failed node report the current failure, the system will change to the timeout state. In the time out state, the master node will compute a method to wrap up the current recovery event.
The steps involved in wrapping up the current recovery event are as follows:
The master node detects if the neighboring node that should have but has not reported the current failure is still active.
If it is active, the master node will send an instructional message to tell it to connect the new parent node.
Otherwise, a node in the system will be chosen to replace the position of the unreported node. In the tree-structured network, the node with the maximum position ID that is still active will be chosen while in the ring-structured network, the nearest neighbor node that is still active will be chosen to replace the unreported node.
Connection handling thread: The main thread creates a connection handling thread for each directly connected child node. It is responsible for handling the communication with the child node, detecting the status of the child nodes. Since the resource management messages are transmitted along the permanent TCP connections periodically, while the peer is down, the sender will get a failure signal while it is trying to send message to the receiver.
4.2.5. The Program Design of the Slave Node
Similar to the master node, the slave nodes are concurrent servers. The concurrence of the slave node is implemented using POSIX Threads. There are five kinds of threads in the slave node: the main thread, the resource management thread, the client thread, connection handling threads, and the message merge handling thread.
Main thread: The main thread is the entry point of the program. It is responsible for creating working threads to handle the different tasks of the slave node
Resource management thread: It is responsible for generating and reporting its working status and the resource usage information.
Client thread: It is used to handle the communication with the parent node. It is responsible for detecting and reporting the failure of the parent node, and connecting to the new parent node according the instructions from the master node.
Connection-handling thread: It is used to handle the communication with the child node. It is responsible for detecting and reporting the failure of the child node.
Message-merging thread: The slave node uses this thread to merge the resource management messages from all the child nodes and its own resource management message.
5. Test environment and result
5.1. Test Environment
My report were conducted on the PowerPC G4 cluster in the Scalable Computing Laboratory, Ames Laboratory of U.S. Department of Energy. The G4 Cluster is a 32 node "Beowulf" style cluster computer consisting of 16 single processor G4s with 512 MB RAM and 16 dual processor G4s with 1GB RAM, all running Debian Linux. They use Ethernet and Myrinet for network access.
The master node is the manager of the system. It requires more system resources than a slave node. To test the construction and the recovery performance of the system, each slave node starts at a different time and runs for a random time period. This is implemented in a script file that is submitted to the batch system.
5.2. Test Result
We tested two aspects of system performance. First, we tested the time used for a new node to be added to the system and the time used to recover to the given structure in case a node fails in the system. Second, we tested the Round Trip Time (RTT) for messages transmitted in the system. The time unit used in the following results is milliseconds.
Figure 9 shows the average time used to add a new node to the system with different network topologies and different system sizes. We can see that there is no significant difference in the time used to add a node to system. The reason is that the steps involved in adding a new node to the system are fixed. The system structure management thread of the master node is an iterative server; it handles the construction and recovery event in a sequential way. Only after it finishes handling a registration request, will it handle a new registration or report request.
Figure 9: The construction performance
Figure 10: The recovery performance
Figure 10 shows that the average time used to recover the system when there is a node failure in the system. We can see that the time used for a recovery event is related to the network topologies. The time used grows with the degree of the node. The reason is that when a node fails, all its neighbors have to report this failure to the master node. The more neighbors one node has, the more report requests the master node has to process. The time used to process a recovery event in a 5-ary tree-structured network is much longer than that used in a ring-structured network. In the tree-structured network, the time used also grows with the number of slave nodes in the system. When a node fails in a tree-structured network, the master node has to find the active node with the maximum position ID to replace the failed node. The more nodes one system has the more complex it is to find a node to replace the failed node.
Figure 11: The maximum RTT with zero payload
Figure 11 and Figure 12 show the maximum RTT for zero payload and XML payload messages (252 bytes) transmitted with different network topologies and different system sizes. From these figures we can see, the RTT for XML payload (252 bytes) messages is much higher than that of zero payload messages. This is because when receiving an XML message, each node has to process the XML message and processing an XML message is a time consuming task. Another point we can see from these figures is that the maximum RTT in a ring-structured
Figure 12: The maximum RTT with XML payload
network grows significantly with the system size. The reason is that the diameter of the system grows linearly with the system size. The diameter of an n-ary tree with m nodes is log m(n 1) + 1 1 n . Thus in an n-ary treestructured network, the maximum RTT grows much more slowly than in a ring-structured network. In our tests, when the system has 100 nodes, the longest RTT for XML payload (252 bytes) messages in the 5-ary complete tree is 10% of that in the ring-structured network. The growth rate decreases as the degree of the tree increases.
This paper compared the network topologies used to construct distributed systems, and presented test results for systems using tree-structured networks and ring-structured networks as the network topologies. From the discussions in the previous sections and the test results in section 5, I draw the following conclusions:
It is easier to maintain the system structure in a ring-structured network. However, the longest RTT grows linearly with the system size. For large systems this can be a significant limitation and thus limits the overall scalability of the system. Ring-structured networks are suitable for small to medium sized systems with small messages.
The scalability of a tree-structured network system is related to the node's capacity and the height of the tree. We can carefully choose an appropriate degree n to construct a complete n-ary tree-structured network such that the height of the tree balances the growth in the tree size versus the resource requirements for a node to communicate with additional child nodes. Tree-structured network is superior to the ring-structured network when the system size and the messages transmitted in the system have large payloads.
6.2. Future Work
In our system, there is only one master node. It is responsible for managing the system structure. If the master node fails, all the information about the system structure will be lost and there will be no node left to manage the system. One of our future tasks is to construct a backup master node that will backup the system structure information continuously. In case that the master node fails, the backup master node can be used to assume control and manage the system structure.