Fault Tolerance And Recovery Techniques Supercomputers Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Supercomputing systems come in the form of large numbers of systems linked together into a computing cluster just like any distributed system, can have large numbers of independent hardware components cooperating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.


Computational tasks continue to become more complex and require more amount of processing time. At the same time, high performance computer systems are composed of number of failure-prone components. Additionally, when an application fails, the cost is high since because more computation is lost. It is imperative that both distributed applications and parallel systems support mechanisms for fault-tolerance to ensure that large-scale environments are usable. It focus specifically on cluster systems and the applications that run in these environments. Cluster systems are composed of a large number of identical, centrally managed computation nodes that are linked by one or more network infrastructures such as Ethernet or Myrinet. Nodes employ software such as MPI to facilitate their integration into a larger, unified, cluster system. Often, some nodes are dedicated to management, user interaction, and storage. In this environment, there are many possibilities for failure. Any component in any compute node could fail. It is not limited only to the processor, disk, memory, or network interface on the node. A hardware or software failure on a management node could affect the entire system and synchronization data is lost. Failures external to individual nodes are also possible. Many failures could remove a large number of nodes from the system, such as an air conditioning failure or a network switch failure. Any of these failures will cause the applications running on the affected nodes to crash or produce incorrect results. An overview of the theoretic models used to describe the types of faults are explained in Section II. There are two fundamental classes of faults that can occur in cluster systems. First, a centralized component such as a storage node or management software can fail as due to a software bug or a hardware fault. These centralized components are typically few in number, spanning only a small percentage of cluster nodes. Protecting against these failures typically involves redundancy. Critical functionality is replicated over several nodes such that if one node fails, a backup can step in, to take over the responsibilities of the primary. The techniques for handling failures of centralized components are discussed in Section III. The second class of failure is a crash or hang of software on one of the many computation nodes in the cluster. This is due to a software bug in an application, a hardware fault on the node, or a problem in the operating system local to the node. All of these failures gives the same end result: the application running on the node can no longer function properly, but the other nodes participating in the same computation can continue unaffected excepting that they will no longer receive output from the failed node. The standard technique for handling application failures is to periodically checkpoint the computational state of the application such that it can be restored in the event of a system failure. Other nodes participating in the computation may need to be rolled back to earlier checkpoints in order to provide consistency with the recovery state of the failed node. Rollback recovery techniques and other methods of protecting against failures and faults on compute nodes are provided in Section IV.


There are numerous ways which leads the computing systems and applications to fail. These failures can be categorized by abstract models that explain how a system will behave in the presence of faults. A fault tolerance technique assumes a certain model of failure when discussing about the types of faults it can handle. We consider the two most common failure models and a relative new model that is accurate and complex representation of its work.

Byzantine Faults

The Byzantine fault model represents model of failure. This model allows nodes that are considered as failed nodes to continue interacting with the other nodes of the system, which can lead to generate inconsistent output , and these failed nodes produces more malicious output which is not consistent. The correct operating nodes cannot detect the failure by itself; even if they can , they do not know which nodes in particular have failed and what should be done in order to overcome it. This model explains the random system failures and also the malicious attacks made by an intruder. It has been proven that no guarantees can be made concerning correct operation of a system of 3m + 1 nodes if more than m nodes are experiencing Byzantine failures [2].

Fail-stop Faults

The fail-stop fault model is much simpler than the Byzantine model. This model allows any node to fail at any time, during the failure it stops generating output and interacting with the other nodes of the system .Therefore, all other nodes automatically can know that the node has failed. This model represents common modes of failure like a system hang or crash, but can not handle more subtle failures like random memory corruption.

Fail-stutter Faults

The Byzantine fault model is extremely broad, and is difficult to analyze. The fail-stop model is commonly used when fault-tolerance techniques are considered, but it is considered as over all simple since it has failed to represent many other types of real-world failures. The fail-stutter fault model is just an attempt to establish a middle ground model between these two models. The fail-stutter model is an extension of the fail-stop model. It helps to maintain the tractability or a record of that model in order to expand the set of real-world faults that it includes. The fail-stutter model includes all properties of the fail-stop model, but it also allows for performance faults. A performance fault is a situation in which the performance of a component is low, but continues to operate correctly by considering its output. This extension allows the model to include faults like poor latency performance of a network switch due to high traffic load.

Fault-tolerance of centralized components

Cluster systems rely on many centralized components to function properly. Management nodes handle job scheduling and node monitoring responsibilities. Storage nodes provide access to high-capacity disk arrays. Head nodes allow users to interact with the system. It is feasible to dedicate extra attention to these resources and possibly allocate additional hardware to ensure they are robust in the event of system failures due to the critical importance of these components and the small fraction of the total system they compose.


The most common way of providing fault-tolerance for centralized system components is to use replication. It provides the use of redundant resources to improve reliability, fault-tolerance, or performance This can be of two forms. In active replication, a secondary machine receives a copy of all inputs to the primary node and generates a similar system state by running its own copy of all necessary software. In addition, the backup node monitors the primary node to check the incorrect behavior. If any unexpected behavior is observed like system crash, the backup node promotes itself to primary status and takes charge over the critical functionality for the system. Since its system state is already similar to that of the primary, this change requires a little amount of time. This type of replication is infeasible for compute nodes since duplicating the number of these nodes would leads to high cost of the system. All replicas are provided with input messages. When the final output is obtained, all replicas compare their results using a Byzantine algorithm in order to find out the correct output . If more than one node generates incorrect output beyond a threshold number of times, it can be considered as faulty and ignored until they can be repaired by maintenance procedures in order to be back to its original position. It is capable of handling Byzantine faults; where as the earlier implementation of active replication can only tolerate faults in the fail-stop model. In passive replication, a “cold spare” machine is maintained as a backup system to the primary. This system will continue in an idle state, but it will be provided a copy of necessary system software used by the primary. If the primary machine fails, the cold spare takes control over the primary. This may result in interruption of service. Unless and until an extra feature checkpointing is added, the internal state cannot be easily recovered. Instead of duplicating the entire machine it is enough to replicate components. By carefully monitoring the correctness of each component, the component can be automatically disabled if there occurs any failure and can be replaced with the backup in order to maintain the performance level of the system.

Reliable Communication

In order to have reliable communication and to guarantee that all replicas receive the same input we need to use any multicast protocol. One of the multicast protocol uses a token-based system. A virtual token is passed from node to node, and only the node having the token is allowed to send messages where as other nodes are not allowed to generate any kind of messages just to avoid generation of multiple messages at the same time. Messages are sent from node -to â€"node in order to avoid the confusions regarding the order of messages. Sequence numbers are provided to maintain the order of messages even if multiple tokens are used. There are techniques to maintain the sequence of order, first technique, all messages are sent to the sequencer, which then multicasts the message to all the other intended recipients. The sequencer acts as an intermediate node which maintains the order of all messages. In the second technique, a sender node multicasts a message directly to all recipients, and also sends the message to the sequencer. The sequencer then multicasts a second message to all recipients of the first message indicating the order in which the first message should be received. Reliability in this protocol is provided using a Byzantine agreement protocol to guarantee that all recipients receive the same message.


In order to maintain the system properly and to activate the backup during the component failure, we need to monitor the system which provide the common approaches ,basically monitoring is considered as a simple heartbeat system. A monitor process listens for periodic messages and indicates that the component continues to function correctly incase if it fails to receive a message from a component within a certain time, it considers the component as failed. This kind of monitoring is necessary for detecting failures in a fail-stop model. More subtle faults like erroneous computation are not detected . Messages that are not sent to a single monitor, but are gossiped throughout the system or are forwarded up to a hierarchy of nodes. Byzantine consensus among component replicas is another form of monitoring. The replicas vote on the correct output based on the provided inputs. This method is used for the detection of Byzantine faults, resulting in a much powerful monitor.


The real value of a cluster system lies in the execution of large-scale parallel applications that would require intractable amounts of time to run in a serial environment. It uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications.

A. Checkpointing and Rollback Recovery

Checkpoint is probably the most typical approach to tolerate failures in parallel and distributed systems. By having checkpoints into stable storage periodically, the check-point approach is able to tolerate the failure of the whole system. Checkpoint could be performed either from a system level or from an application level. Each process is also independently checkpointed at times during its execution. Each checkpoint remains in a stable storage until no longer needed for recovery from any possible failure of the system. The rollback-recovery part of the protocol combines the low failure-free overhead of optimistic rollback-recovery with the advantages of pessimistic rollback-recovery, such as fast output commit, limited rollback, and failure-containment. The process replication part of the protocol features a new multicast protocol designed specifically to support process replication.Processes are assumed to have access to some kind of stable storage that will survive even in the event that the process fails. Periodically during the execution of an application, the system . Checkpointing and rollback recovery techniques generally assume a fail-stop model of faults. To recover from a process, the recovery procedure must ensure that the internal state of the recovered process conforms to the observed state of the system before the failure. This is obtained by observing the most recent set of consistent checkpoints and restoring the system to the state recorded in this set. Rollback recovery techniques can be subdivided into two broad categories: checkpoint-based, and log-based. Checkpoint-based rollback recovery is popular of fault tolerance techniques, which is of having a simple idea that is the system state with error-free portions of the system execution is observed and saved, if any error occurs, use the saved state to rollback the system in order get back to consistent state. After recovery the system need not execute from the first state but rather it can start its execution from that recovered state. Independent Checkpoint explains the method where the components of a system have checkpoints without synchronizing or depending with each other. The Coordinated Checkpoint explains the method where the components of a system take checkpoints

after synchronizing with each other.

B. Virtual Processors

A fault-tolerant computer system has primary and backup computers. Primary and backup virtual machines running on the computers are controlled by the respective virtual machine monitors. The virtual machines exhibit user-mode instructions. Each system has a recovery register which produces an interrupt. When a failure occurs, the backup virtual processor re-connects in order to reissues an operations so that it will be free of failures.Virtual processors allow parallel application execution to be subdivided across a larger number of processors. In these systems, each virtual processor is mapped to a physical processor, which handles one or more virtual processors. This technology has applications to load balancing and grid computing, and also to fault tolerance. The key to the fault tolerance application is the migration feature which means if a computation node should suffer impaired performance, the virtual processors can be

migrated to other nodes that remain fully operational.


This survey helped to have a clear understanding on different types of faults, fault-tolerance methods and techniques that can be implemented in large scale cluster computing systems. This helped to understand and to provide protection for components or nodes and also for hardware and software infrastructure. Replication provides the redundant resources to improve reliability, fault-tolerance and high performance of the system. Checkpoints and rollback are used to avoid faults. The variety of protocols has been developed to determine when processes should record checkpoints and how to restore the application state.