Comparison Based System Level Fault Diagnosis Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Mobile ad hoc networks (MANETs) are increasingly becoming the basis for civilian as well as military communication with a wide range of applications, and therefore, there is an urgent need to diagnose, manage, and deal with faults in such networks. System-level fault diagnosis, a one step of fault tolerant, is focused on detecting and identifying faulty nodes in MANET. Many approaches were proposed to achieve this goal. The Comparison approach, in particular, is believed to be one of the most effective and promising. In this approach, all nodes in the network are assigned the same task. Each node performs the task and reports the result. By comparing the results from the different nodes, faulty nodes are detected. Although there are many models that have been proposed based on this approach, Chessa and Santi model seems to be the most relevant as it is the only model for ad-hoc networks which considered the communication shared nature of MANET. In this paper, we provide a review of the known protocols and algorithms in the literature, which are based on Chessa and Santi model. We also provide a comprehensive comparison of these protocols and illustrate their limitations. The diagnosis of dynamic faults in MANET is well known in the literature to be a challenge due to MANET characteristics. In future work, we are going to propose and implement a new protocol that able to diagnose the dynamic faults in MANET using OMNeT++ simulator.

Keywords: MANET, System-level fault diagnosis, the Comparison approach, OMNeT++.


Day by day, computer networks are becoming more complex and the need to manage them is a necessity. Network management, which is concerned with managing the network component in order to deliver the expected services from the network [24] [12] [25] [26] [27] [29], have multi-tasks to do. There are five functional areas defined by TMN standard: Fault, Configuration, Account, Performance and Security Management, which are known by their acronym, FCAPS. Every management' area has a list of functions and tasks as shown in table 1, see [16] [17] [18] [19].

Table 1: FCAPS Model functions

Management Area


Fault Management

Detect, isolate, and correct malfunctions in a telecommunications network.

Configuration Management

Maintain an accurate inventory in a changing environment and consistency of operating parameters.

Accounting Management

Measure network usage and compute charges for that usage.

Performance Management

Concerned with gathering statistics, measuring utilization of resources, tuning, etc.

Security Management

Control access to services, network infrastructure and the management system itself.

MANET is a special type of ad-hoc networks where the nodes of a network are mobile and therefor the topology of the network changes dynamically [20]. MANETs were extensively used for many critical applications in different environments such as military environment, emergency operations, and civilian environment. MANET, due to some of its characteristics as shown in Table 2, is exposed to many challenges such as security issues, performance decade, and faults occurring [28].

Table 2: MANET's Characteristics



Dynamic topology

The topology of a network may change as its nodes are moving in different directions with varying speeds.

Constrained Resources

The nodes' resources (, battery power, CPU, and storages) have limitations.

Constraint and varying bandwidth

The bandwidth is varying and usually low due to wireless link nature and the effect of noise, interference and congestion.

Limited physical security

Due to their wireless links, MANETs are exposed to physical layer attacks.

The nodes mobility, dynamic topology, resources limitations, and bandwidth scarce increase the probability of faults occurring in MANET. Therefore, the dependability of communication, i.e., the ability of a network to deliver the required service successfully, will be affected [13] [30]. So, needless to say, there is an urgent need for efficient fault management protocols to diagnose, deal, and manage faults as they occur in a MANET.

Fault management is considered as one of the most important component in FCAPS model as shown in Table 1. In general, fault management has many functions such as fault detection, fault isolation, fault correction, error handling, error logging, alarm handling, alarm filtering, and alarm generation. One of the main functions of fault management is fault diagnosis, which is responsible of detecting, identifying, and localizing faults in MANET [31].

A fault is any defect in a system which interrupts its communication or degrades its performance [32] [13]. Faults usually lead to errors where a system component gives an incorrect output. Error also may lead to failure when a component no longer meets its specification, which could magnify the fault or led to yet another fault. This flow of events is depicted in Figure 1. In general, fault management pays attention for faults because they are the source of other problems in system.

Figure 1: fault, error, failure flow of events

Faults can be classified based on three criteria, as shown in Table 3. On the basis of the time duration of faults, a fault is classified as transient or permanent. Transient faults disappear eventually without any intervention. Permanent faults, however, need external intervention to remove them. The second criterion is the behaviour of the failed component. If a node cannot communicate with other nodes, the fault is called hard. If a node can communicate but its behaviour is altered, the fault is called soft. The Final criterion is based on the occurring time of the fault. A fault is considered static if it is allowed to occur during diagnosis sessions; otherwise it is considered as a dynamic fault [23].

Table 3: Fault classification

Fault class

Fault type

Based on duration

Transient fault

Permanent fault

Based on how a failed component behaves

Hard fault

Soft fault

Occurring during the diagnosis session

Static fault

Dynamic fault

Comparison-based system-level fault diagnosis

As we mention above, due to the nature and characteristics of MANAT, the probability of faults-occurring is a real problem as it could diminish the dependability of MANET significantly. A fault tolerant is a method used to enable the system to deliver the services even while faults are there [13] [30], System-level fault diagnosis, a one step in fault tolerance that was introduced by Preparata, Metze and Chien [8], aims to automatically detect and identify faulty and faulty-free nodes. Once the system detects and identifies faulty nodes, it isolates them and minimizes their effects. It is clear that for MANET a system-level fault diagnosis is needed to build automatic fault diagnosis and dependable systems [10] [22].

Many models were proposed to solve the system-level fault diagnosis problem, the comparison model is considered as the most effective one, "since it relies on more realistic assumptions about the system under consideration" [21]. In a comparison model, in general, the same task is assigned to each node, and the output results from the nodes are compared. The outcome of comparison tests is used to identify the faulty nodes.

The earliest known model is called the asymmetric comparison model and was proposed by Malek [1] in 1980. A year later, Hikimi and Chwa [2] proposed the symmetric comparison model. The main difference between asymmetric and symmetric models is that in computing the result of comparing test for two faulty nodes; the asymmetric model reports a mismatching result while the symmetric model reports a unreliable result. In both cases, the reported results are presented by 0 or 1 as shown in Table 4. Both models assume that there is a central observer to diagnosis the system as a whole.

Table 4: The Symmetric vs. Asymmetric comparison model

Asymmetric comparison

Symmetric comparison

Faulty-free tests faulty-free



Faulty-free tests faulty



Faulty tests faulty-free

X (0 or 1)

X (0 or 1)

Faulty tests faulty


X (0 or 1)

Meang and Malik [7] extended the asymmetric model to what is now known as the MM model, where it is assumed that there is a distinct comparator node which compares the results produced by the nodes under comparison. The comparison is distributed and done by the nodes themselves, however, the diagnosis are done by a central node. In [9], The MM model was generalized to gMM model, which assumes that the comparator node is actually one of the nodes being compared. In [14], Broadcast comparison model BCM was proposed by Blough and Brown, BCM is a distributed diagnosis model where each node acts as a comparator of the other nodes and sends the comparison results to all nodes using a weak reliable broadcast protocol.

The mentioned above models were designed for wired networks where the communication paradigm is one-to-one. Recently, Chessa and Santi [3] proposed a comparison-based diagnosis model for wireless ad-hoc networks. The proposed model exploits the communication shared nature of ad-hoc networks. In the next section we present a detailed analysis of Chessa and Santi model.

Chessa and Santi model

In Ad-hoc networks, the communication paradigm is one-to-many, i.e., if node A transmits then all other nodes in its transmission range, namely, the nodes B, C, and D, will receive the message, see Figure 2. The previous models, reviewed in Section 2 above, did not take that into account, which is precisely why they are not suitable for ad-hoc networks as they were designed for network with one-to-one communication paradigm, which is clearly not the case in Ad-hoc networks.

Figure 2: communication paradigm in Ad-hoc networks

To the best of our knowledge Chessa and Santi model is the only comparison model for Ad-hoc networks in the literature that assumes the one-to-many communication paradigm. In this model, a system consists of N nodes that communicate with each other using a 1-hop broadcast. At any given time t, the communication can be represented by an undirected graph G (V, L (t)), where the vertices, V, are always the nodes, and L is the set of logical links between the nodes at time t. If there is a logical link between two nodes then they are considered neighbours and are connected by an edge at that time. Each node in the system can be either faulty or faulty-free. The judgment of the node status is based on the invalidation rule of the gMM model that is shown in Table 5.

Table 5: The invalidation rule of gMM model


Nodes under comparison

Comparison outcome


None is faulty



One is faulty



Both are faulty




X (0or1)

In order to diagnose the system, a fault-free node X sends a test request to its neighbours and starts the timer. The timeout should be chosen carefully to give all neighbours enough time to respond. Once the nodes responded and the timeout elapsed, the diagnosis process is accomplished using the invalidation rule in Table 5. In particular, none responding nodes are considered faulty.

The model takes advantage of the communication shared nature of Ad-hoc networks in two situations; the first one, when it sends the test request to all neighbours of the testing node; and the second one, when the neighbours respond; they send the test result to their neighbours as well.

Ad-hoc comparison-based system-level diagnosis protocols

In this section, we are going to describe, in detail, all protocols which were proposed based on Chessa and Santi model as shown in Figure 3.

Figure 3: Chessa and Santi model 'protocols

Fixed topology comparison-based diagnosis protocol

Chessa and Santi [3] proposed a diagnosis protocol for fixed topology Ad-hoc networks based on their model. This protocol has three stages: generation, testing, and dissemination.

The generation stage: in this stage, the protocol does many things; first, a faulty-free node X generates a test, sends a test request to its neighbours, computes the task results, and finally, starts the timer. A faulty-free node may start a diagnosis protocol either periodically or when any diagnosis message (test request, test response, dissemination message, and timeout message) is received from other nodes.

The testing stage. A testing node X diagnoses the state of its neighbours based on their test's responses and timeout messages. Then it disseminates its local diagnosis result. At the end of this stage, the node should have the status of its neighbours.

The dissemination stage. A testing node X receives the dissemination messages from other faulty-free nodes and completes its diagnosis. At the end of the dissemination stage, the node X should know the status of all nodes in the system. As the mode of the network are mobile, nodes which are far apart at one time might end up neighbours at another time.

Notice that the network topology is assumed to be fixed during the diagnosis session. This, however, does not mean the nodes are static, it actually just requires that the nodes do not change their neighbours during the diagnosis session. To overcome this rather strong assumption, Chessa and Santi proposed to minimize the diagnosis latency by using flooding in the protocol to propagate the diagnosis messages.

Dynamic Distributed Comparison-Based Protocol

In [4], a comparison-based diagnosis protocol was proposed based on Chessa and Santi model. This protocol has two phases, namely, the testing and dissemination phases. It is called dynamic just because it uses a dynamically built spanning tree in the dissemination phase, and has nothing to do with its ability to deal with dynamic faults or dynamic MANET.

Testing phase: Testing may start periodically by any node or as a response to receiving a diagnosis message from another node. In particular, once a test request is received by a node, it responds by generating its own test request. The main contribution in this phase is that a node responds to at most σ+1 tests only (where σ <=KG-1, where KG is the minimum connectivity of graph), so a node informs at least one faulty-free node about its status. In fact, this contribution leads to decreasing the number of broadcast response messages. At the end of this phase, all responded neighbours will be diagnosed based on the invalidation rule of gMM, and none responding neighbours will be diagnosed as suspect.

Dissemination phase: A spanning tree, which consists of all faulty-free nodes in the network, is constructed using a spanning tree message. After construction is complete, the leaves in this tree start sending their local diagnosis to their parents and so on until all local diagnosis reach the root of the tree. At the end of this phase, all faulty-free nodes will know the status of all nodes in the networks.

Observe that, in this protocol, the number of response messages is small as every node sends messages only to its parent in the spanning tree, leaving out all other neighbours in the network. In fact, using flooding to response messages leads to redundant broadcast messages. Also, only faulty-free nodes participate in the dissemination phase. Furthermore, using a spanning tree to send messages decreases the number of broadcasts during the diagnosis session, at the same time; however, it increases the diagnosis latency as building a spanning tree adds more overhead.

Adaptive Distributed Self-Diagnosis Protocol

In [5], an adaptive comparison-based diagnosis protocol was proposed. This protocol is called adaptive because it uses a spanning tree which is constructed in advance with the MANET. It consists of five phases: maintaining phase, testing phase, gathering phase, repairing phase, and disseminating phase. The spanning tree is used to achieve two goals. It maintains the connections in a MANET while the nodes are moving or in case a problem arises affecting the communications in a MANET. At the same time, the tree is used to propagate the local diagnosis of nodes.

Maintaining phase: During this phase, the protocol maintains the connections of a spanning tree of MANET's nodes. In order to keep the spanning tree connected the nodes periodically check their connection to the spanning tree. If a node discovers that it is no longer connected to the tree, then it connects to one of its neighbours in the spanning tree by sending a broadcast message to its neighbours. If any new node joins the MANET then it will connect to the spanning tree, by sending a broadcast message to its neighbours. This phase terminate once the diagnosis session started. In particular, no new node is allowed to connect during the diagnosis session.

Testing phase: It starts either periodically or if an altered behaviour is detected in MANET. Then a node sends a test message to its neighbours, start a timer, and wait for the response messages from the neighbouring nodes. Actually, nodes should reply to all test requests.

Gathering phase: Once a response message has arrived, a node diagnoses the responding node as faulty or faulty-free based on the invalidation rules of gMM model. When a time out message is received, a node terminates the gathering phase and diagnoses the nodes that did not respond as faulty. After this phase, each node will know the status of a subset of its neighbours.

Repairing phase: Since all test requests should be replied to by nodes, a node should know the status of its neighbours. In this phase, the spanning tree is reconnected based on the status of the nodes where just the faulty-free nodes are considered as nodes now. In particular, faulty nodes are no longer nodes in the tree.

Dissemination phase: The leaves nodes start sending their diagnosis view to their parents. Once the parent collects its children diagnosis view, it sends the aggregated diagnosis view to its parent and so on.

Observe that, Adaptive-CBD uses a prebuilt spanning tree; the spanning tree used to achieve firstly, maintain the connected of MANET. And secondly, propagate the local diagnosis of nodes. However, the tree is repaired often and it could be modified in response to the existence of faulty nodes. This decreases the diagnosis latency time as the numbers of broadcasts is decreased.

Cluster-Based System-Level Fault Diagnosis

In [6], Li used a clustering idea to decrease the number of messages exchanged in diagnosing a MANET, while at the same time can deal with dynamic faults as clusters are more stable than nodes. In this work, the authors suppose that nodes in MANET are grouped into clusters as shown in figure 4, each cluster has a cluster-head, and all nodes in the transmission range of a cluster-head (i.e.,1-hop neighbours) belong to the same cluster. Any node belongs to more than one cluster is called a gateway. Clusters-heads play a main role in the diagnosis process, and therefore they should be faulty-free. To this end, clusters-heads are diagnosed based on a fixed-topology comparison-based diagnosis protocol. This phase is done based on Cluster head Gateway Switch Routing CGSR algorithm [33].

Figure 4: An example of clustering in a MANET

In this algorithm, two phases used to diagnose faults in MANET; Testing phase, and Dissemination phase.

Testing phase: Any node needs the diagnosis state of others sends a diagnosis triggered request to its neighbours, when a cluster-head receives this request it sends a test request to its members. When a gateway receives a test request, it sends a diagnosis triggered request to the clusters-heads of the other clusters where it belongs. Once a cluster-head receives a triggered request, it diagnoses its members. Every clusters-head, after this phase, should maintain the diagnosis information table, which contains two attributes: the host identity HID, and the host state STATE. HID represents the node id, while STATE contains the diagnosis result for the node. STATE may have one of four values: faulty-free, soft-fault, hard-fault, or uncertain. At the end of this phase, every clusters-head should have a diagnosis information table about its members. Furthermore, if its diagnosis information table contains nodes whose status are uncertain, it will send a new test request.

Dissemination phase: each cluster-head propagates its results to its cluster-head neighbours via the gateways and receives other clusters' results. Each cluster-head should have the state of all nodes in the MANET; any cluster member may ask a cluster-head about the states of other nodes.

Observe that, the test request messages and dissemination messages are generated and propagated only by cluster-heads. Without a doubt, this will lead to a decrease in the number of broadcasts during a diagnosis session. Another advantage of this protocol is its ability to diagnose nodes that change their status during the first diagnosis session. This is done by starting a new diagnosis session for the nodes with the uncertain state. Using CGSR algorithm to cluster and group nodes causes some drawbacks in Cluster-CBD such as cluster-head selection, cluster-head bottleneck.

Discussion and limitation of previous works

Chessa and Santi proposed a comparison based fault diagnosis model that takes advantage of a MANET's communication nature to conduct the comparing process with least possible communication overhead and time.

Based on their model, they develop a fixed comparison based diagnosis protocol which we called Fixed-CBD. Here they focused on minimizing the diagnosis latency as much as possibly in order to overcome the assumed static topology of a MANET during the diagnosis time. To this end, Fixed-CBD used the flooding-based in testing, gathering and disseminating phases. Using the flooding to exchange the diagnosis messages results in a minimizing the diagnosis latency and making this protocol more time-efficient. On the other hand, flooding produces redundant broadcasts and increases the number of exchanged messages which may affect the energy efficiency of MANET's nodes.

The Dynamic-CBD protocol was proposed based on Chessa and Santi model, the main objective of this protocol is to minimize the number of exchanged messages as much as possible in order to make this protocol more energy-efficient. To this end, the protocol uses a spanning tree, which contains only faulty free nodes in a MANET, to disseminate the diagnosis results among faulty-free nodes. Building the spanning tree dynamically and before each dissemination phase, however, is time consuming.

To overcome this, an Adaptive-CBD protocol was proposed. This protocol uses an adaptive spanning tree (i.e., a prebuilt spanning tree) in order to disseminate the diagnosis messages. Using an adaptive spanning tree make this protocol time-efficient, compared to the Dynamic-CBD protocol, where in adaptive-CBD the spanning tree is built before the diagnosis session and, during diagnosis session, only repairing for this spanning tree might be done. In Adaptive-CBD, the number of broadcasts (i.e., the number of exchanged messages) is increased where a node has no time-limit on the number of responses for a test request.

Recently, Cluster-CBD protocol, based on Chessa and Santi model, was proposed. It uses clusters to diagnose and disseminates the diagnosis results. Here, cluster-heads initiate test requests and disseminate the diagnosis results among themselves through gateways. The idea of using clusters is not new, clusters were used before in MANET for many purposes such as management and routing [12][33], but indeed, MANET environment should be taken into consideration when clustering the nodes to make the clustering process more useful and less overhead.

As a result of our comparison of the mentioned above protocols we have summarized our findings in Table 5. It shows a comparison among Fixed-CBD, Dynamic-CBD, Adaptive-CBD and cluster-CBD algorithms in terms of numbers of messages and diagnosis latency. In Table 5, n represents the number of nodes, dmax is the maximum of, kG is the connectivity graph, nc is the number of cluster-heads, m is the number of gateways, l is the total number of cluster-members, ζ is the number of the diagnosis periods triggered due to the movements of the active mobile hosts, and η is the number of the active mobile hosts that have triggered the above ζ diagnosis periods. ΔG is the graph diameter, Tgen an upper bound to the elapsed time between the reception of the first diagnostic message and the generation of the test request, Tf is an upper bound to the time needed to propagate a dissemination message, Tout is the timeout time, dST is the depth of the spanning tree used in the disseminating phase. ΔGC is the diameter of the graph which consists of cluster-heads.

Table 5: Comparison among Fixed-CBD, Dynamic-CBD, and Adaptive-CBD and Cluster-CBD protocols





No. of Messages

ndmax + n(n + 1)

nkG + 3n - 1

ndmax + 5n - 6

nc2+ζnc +mnc+ζm+l+ζη

Diagnosis latency

ΔGTgen + ΔGTf+ Tout

ΔGTgen + 3dSTTf + 2Tout

ΔGTgen + (2dST + n − 1)Tf+ 2Tout


Communication complexity


O(nkG )


O (max(nc2,ζnc,mnc,ζm,l,ζη ))*

Time complexity

O(ΔG (Tgen + Tf)+ Tout )

O(Δ GTgen+ dSTTf+ Tout )

O (∆GTgen + (dST + n − 1) Tf+ Tout).


Permanent fault





Transient fault





Static fault





Dynamic fault





Hard fault





Soft fault





*O (max (nc2, mn, l)) is the communication complexity for cluster-CBD if the network topology is fixed

Observe that, based on Table 5, the cluster-CBD protocol outperforms other protocols in term of communication complexity and time complexity. Only Fixed-CBD, Dynamic-CBD, and Adaptive-CBD protocols have succeeded in diagnosing static faults, with varying time and communication efficiency. However, three protocols have did not succeed in diagnosing dynamic faults. The Cluster-CBD protocol succeeded in diagnosing MANET status in static as well as dynamic topologies. In fact, using clustering leads to making the network more stable as clusters as a whole move less than nodes and also allows hierarchical diagnosis (i.e. the diagnosis processes are conducted by cluster-heads and gateways.) Furthermore, as fewer nodes are involved in the diagnosis process, the number of the exchanged messages is smaller and hence a more efficient diagnosis.

Table 5 also, shows the types of faults which could be diagnosed by Fixed-CBD, Dynamic-CBD, Adaptive-CBD and Cluster-CBD. In general, Cluster-CBD outperforms other protocols in the ability to diagnose dynamic faults in finite amount of time.

From the previous overview of the available comparison-based fault diagnosis protocols in MANET, we list some limitations of these protocols below:

They cannot deal with transient faults which actually happen frequently in MANETs.

All protocols but Cluster-CBD assume that the MANET's topology is static during the diagnosis session, and cannot diagnose a MANET if its topology is dynamic. Cluster-CBD can deal with dynamic faults but using CGSR algorithm to cluster the nodes is a drawback and using Fixed-CBD to diagnose cluster-heads is also a limitation.

All protocols assume that all nodes have the same abilities and resources. This assumption limits their usefulness, affects their usage, and hinders their ability to be used on heterogeneous MANETs.

Although Cluster-CBD outperforms other protocols in many aspects, more attention should be given to cluster-head selection and MANET nature and fault management criteria should be considered.

The direction of future work

As we observed above there is an urgent need for diagnosis protocol that can effectively diagnose faults in MANET environments. Such a protocol should take in account the advantages of existed work such as the clustering idea, while at the same time it should be more suitable for MANETs in terms of mobility, energy, and bandwidth.

Toward this goal, and in a forthcoming paper, we propose a new protocol. To evaluate and simulate our protocol we propose three scenarios. In the first scenario, we assume that the network topology is fixed and static during the diagnosis session (i.e., Nodes movement do not affect the topology of network.) This scenario will be used to study the ability of our proposed protocol to diagnose static faults and compare it with existing protocols. In the second scenario, we assume that the network topology is dynamic during the diagnosis session (i.e., node' neighbours are changing during the diagnosis session where new nodes may be joining and other nodes may be disjoining.) This scenario will be used to check the ability of our protocol to diagnose dynamic faults. Finally, in the third scenario, we assume that we have a large-scale network; a dynamic network consisting of a large number of interacting nodes where nodes are joining, and leaving the networks. This scenario will be used to test the scalability of our proposed protocol.

The OMNeT++ simulator will be used to simulate and evaluate our protocol; we already designed the first scenario in the OMNeT++ simulator. In this scenario, we assumed that the network topology is fixed during the diagnosis session, note that the nodes are not necessarily static but their movements don't change the topology. For this scenario we will use the topology which was used in [6] as shown in Figure.5.

In the first scenario, there are 30 nodes, all nodes are allowed to move and change their places with some constraints such as the node should not change its neighbours during the diagnosis sessions.

Figure 5: Static network topology scenario


In conclusion, we discussed system-level fault diagnosis problem in MANET based on the comparison approach. System-level diagnosis is a method to build fault tolerant systems. Chessa and Santi proposed a model for Ad-Hoc networks, and based on their model many protocols and algorithms were proposed. This paper is basically a review of those protocols. In particular, we observed some limitations of these protocols such as dealing with dynamic faults and intermittent faults, and the need for more effective protocol is evident. In a future work, we will develop a comparison-based system-level fault diagnosis protocol; it will take the advantage of clustering idea. Our protocol will be implemented in OMNeT++ simulator where we will use several metrics to evaluate the protocol achievements.