Overview Of A Reverse Engineering Tool Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract-Program understanding is an integral part of software maintenance that can be enhanced using reverse engineering technologies. The understanding process is heavily dependent on both the cognitive abilities of individuals and on the set of facilities provided by the program understanding environment. Rigi is a mature research tool that provides functionality to reverse engineer software systems. By virtue of the visualization techniques provided by Rigi, large systems can be analysed, interactively explored, summarized, and documented. Two contrasting approaches are available for visualizing software structures in the Rigi graph editor. The first

approach displays the structures through multiple, individual

windows. The second approach, Simple Hierarchical Multi-Perspective (SHriMP) views, employs fisheye views of nested graphs. In this paper we describe Rigi's main components and functionalities, and assess its impact on reverse engineering research. This paper also discusses tool extensibility features supported by Rigi and current industry practices. Furthermore, an integration of Rigi with other reverse engineering tools is described.

Keywords- program understanding, software maintenance, reverse engineering, visualization, extensibility

1. Introduction

Manual interpretation of large legacy systems is a cumbersome process. One way of understanding such a process is through computer-aided reverse engineering. Although there are many forms of reverse engineering, the common goal is to extract information from existing software systems to better understand them. The subject software system is represented in a form where many of its structural and functional characteristics can be analysed. This knowledge can then be used to improve subsequent development, ease maintenance and aid project management. This helps defend against brittle software systems that resist graceful change. Problems can be exposed and corrected if reverse engineering is applied preventatively during evolution. As maintenance and re-engineering costs for large legacy software systems increase, the importance of reverse engineering will grow accordingly. Numerous reverse-engineering tools have been developed to structurally assist in program understanding and software maintenance by providing methods to uncover the original (or existing) design of software systems. The usability of these tools is critical to their effectiveness.

Reverse Engineering

The reverse engineering process involves parsing a subject software system, resulting in a graph where nodes represent system artefacts such as functions, data types, and arcs represent dependencies among the artefacts. A hierarchy is then imposed on the flat graph by building subsystem abstractions .Software maintainers can subsequently browse

and annotate these software hierarchies to aid in program


1.2 Program Understanding & Software Maintenance:

The application of reverse engineering technologies to large-scale legacy software systems offers the opportunity to regain a measure of control and understanding of the software. There are diverse program understanding techniques such as structural re-documentation, pattern matching at various levels of abstraction, and run-time analysis. Effective understanding is also necessary for factoring and optimizing code, porting to different platforms, exploiting reusable components, and adding new features. Applications such as medical instrumentation and process control require a level of quality that is only possible with a thorough understanding of the software. Because of changes during evolution, the software architecture often degrades in that the code no longer follows the original design criteria. This is especially true for legacy

software systems, which are typically large, complex, poorly

structured, and resist change. Some software systems are

badly written from the start because of the enormous pressure

on developers to ship the product out the door, even if the bugs and unsound structure later cause maintenance nightmares. Somebody still has to understand these high entropy systems. Thus, program understanding is an important

part of software preservation. Program understanding can thus leverage the software developer's actions and knowledge, and augment the software development and maintenance processes.

Section 2 describes available visualisation techniques in Rigi followed by the extensibility feature of Rigi. Section 3 outlines an integrated reverse engineering tool-kit for program understanding. Section 4 elaborates the current practices in the industry. Section 5 highlights the ongoing research associated with Rigi. Section 6 is conclusion.

2. Rigi

Rigi is a system for extracting, analysing, visualizing

and documenting the structure of evolving software systems.

Software structures are manipulated and explored using a

graph editor. The following two subsections describe two

alternative approaches for exploring software hierarchies in


2.1 Visualisation

Graphs are particularly suitable for visually presenting software structure. Nevertheless, as the size of software systems increase, so do their representations as graphs. Advanced graphics and abstraction techniques are needed to manage the visual complexity of these large graphs. The Rigi reverse engineering system provides two contrasting approaches for presenting software structures in its graph editor. The first displays the structures through multiple, individual windows. The second (newer) approach, Simple Hierarchical Multi-Perspective (SHriMP) views, employs fisheye views of nested graphs. We compare and contrast these two interfaces for visualizing software graphs, and provide results from user experiments.

2.1.1 Multiple Window Approach

In the original Rigi approach, a subsystem containment hierarchy is presented using individual, overlapping windows

that each display a specific portion of the hierarchy. For example, the user can open windows to display a particular level in the hierarchy, a specific neighbourhood around a software artefact, a projection or flattening of the hierarchy, or the overall tree-like structure of the entire hierarchy.

Figure 1 shows the multiple window approach in Rigi for presenting the structure of a small sample program. The program root node, entitled src, is displayed in Fig. 1(a). A user displays the next layer in the hierarchy by double clicking on the src node, see Fig. 1(b). This layer consists of the main

function and two subsystems, List and Elements. Arcs in this

window are called composite arcs and represent one or more

lower level dependencies in the graph. The List subsystem has been opened in Fig. 1(c). Nodes in this window are leaf nodes and directly correspond to functions or datatypes in the software. Arcs in this window represent either call or data dependencies. Figure 1(d) shows an overview of the software hierarchy and provides context for the other windows. Arcs in the overview window are called level arcs as they represent the parent-child relationships in the hierarchy. Finally, Fig. 1(e) shows a projection from the src node. This operation has the effect of flattening the hierarchy and displays all of the lower level dependencies and artifacts in a single window.

However, for larger systems, the hierarchy may be very deep

and many windows may need to be opened. Positioning and

resizing these windows to keep pertinent information visible

can be tedious. Since the relationships between windows are

typically implicit, it is easy to lose context and become disoriented.

2.1.2 SHriMP

SHriMP, a tool for visualizing large graphs, uses the nested graph formalism and a fisheye view algorithm for manipulating large graphs. A basic incentive for writing this tool is to provide a mechanism for visualizing detail of a large information space and at the same time to provide contextual cues concerning its context. When visualizing any large information space, it is necessary to be able to create different views of the information where each one provides a different perspective. SHriMP goes one step further by providing

a mechanism to create views that can show multiple

perspectives concurrently. SHriMP is implemented in the Tcl/Tk language and is currently a library that can easily be integrated into systems that have the Tcl/Tk language available in it. The following subsections provide some background on the nested graph formalism and the fisheye

view paradigm ,used by it. Nested Graphs

Nested graphs were first introduced by David Hare1 in 1988. Nested graphs, in addition to nodes and arcs, contain composite nodes which are used for denoting set inclusion. The containment or nesting feature of composite nodes implicitly communicates the parent-child relationships in a hierarchy. In SHriMP a non-leaf node is open when its children are visible and closed when its (children are hidden from view. Due to limited screen space, nodes and composite nodes need to be resized as information needs change. The following describes an automatic strategy to zoom (enlarge or shrink) nodes which will assist in managing the screen space available. SHriMP FISH-EYE VIEW

The fisheye view used by SHriMP has several nice features as follows. The zooming technique is highly interactive, even for very large graphs. When one node is enlarged, the other nodes smoothly decrease in size to make room for the selected node. The zoom-out operation is the inverse operation of the zoom-in operation. Therefore different areas of the graph may be inspected without permanently altering the layout of the graph. A user may zoom multiple focal points and focal areas in the graph. For visualizations of many information spaces, there is no notion of geometric distance. Nodes that are close to the focal point are no more important than nodes far away. The SHriMP fisheye view uniformly resizes nodes when more or less screen space is requested. The SHriMP fisheye view is flexible in its distortion technique. For a grid or tree layout, nodes that are parallel remain parallel in the distorted view. However, in other layouts, where nodes adjacencies are important the proximity of nodes is maintained. Fig 2 is a graph that visualizes parts of IBM's SQL/DS system code.

Red nodes are variables, yellow nodes are modules, and purple nodes are types. All arcs have been filtered out except for the yellow ones, which represent calls between modules (i.e., Figure 2 shows the call graph). The full graph contains 923 nodes and 2395 arcs, but in this filtered view only 189 yellow arcs are actually visible.

The graph visualization and manipulation are implemented with Tcl/Tk. This implementation exhibits several drawbacks. The rendering speed is rather slow and does not scale up well for larger graphs. Furthermore, the GUI's look and feel is perceived as rather crude compared to the state of the art. Lastly, images cannot be exported (except for taking screenshots). The last point is a severe drawback, because Rigi graphs are the main result of the reverse engineering activity, which become essential system documentation. Even if Rigi screenshots are cumbersomely integrated into the system documentation, they are mere static bitmaps that allow no interactive exploration.

2.1.3 Recommendation

Based on observations, several improvements to the Multiple Window and SHriMP interfaces are recommended. In Multiple Window, users often forget (or never discovered) the context of individual windows. They often open several windows of the same view, failing to recognize that these views were already available. Some way of emphasizing the relationship of the open windows to the corresponding composite nodes is needed. There is also confusion between the interpretation of the general windows and the hierarchy overview. Some users misinterpreted the parent-child relationships in the overview as call or data dependencies. The appearance of the overview window should differ from the general windows. This might be achieved by simply having different background colors for the different window types.

The single most important problem with SHriMP views was the slow response of the interface. Since SHriMP views are based on direct manipulation, users expecting immediacy are disturbed by the slow response. Another problem with SHriMP is that, it is possible to become intimidated by the large number of arcs revealed by opening several composite arcs. Methods to make it easier to identify arcs of interest and filter uninteresting arcs are required.

In general, both the Multiple Window and SHriMP interfaces have advantages and disadvantages. Future versions of Rigi should include the ability to seamlessly switch between the two interfaces when reverse engineering a software system.

Fig. 3 ilustrates a hybrid approach to improve the overall functionality of Rigi. The figure shows a call dependency tree routed at the main function in a small program written in the C language. This program implements a list data structure. One of these nodes, mylistprint has been expanded by the SHriMP layout adjustment algorithm using the hybrid layout strategy suitable for tree layouts. By zooming the node in this fashion, a software engineer can read the source code of the mylistprint function and at the same time maintain his mental map of the location of this function in the call dependency tree.

2.2 Extensibility

Most reverse engineering tools provide a fixed palette of extraction, selection, and organization techniques. However a programmable reverse engineering approach which supports customization of user-interaction highly improves the usability of the system. This approach uses a scripting language that enables users to write their own routines for common reverse engineering activities such as graph layout, metrics, and subsystem decomposition, thereby extending the capabilities of the reverse engineering toolset to better suit their needs. A programmable environment supported by this approach subsumes existing reverse engineering systems by being able to simulate facets of each one. No rigid program understanding environment will ever be suitable for all users in all software maintenance tasks. Users' disparate cognitive abilities and their diverse approaches to program understanding preclude the use of a static suite of extraction, selection, and organization techniques. Thus, we should provide an environment that supports the users' view; tools and interfaces that support the natural process of program understanding- not hinder it.

To meet this goal, a successful reverse engineering environment must provide mechanisms through which users can provide additional functionality. Instead of writing yet another command language, Rigi makes use of Rigi Command Language (RCL).

RCL is based on the Tcl/Tk scripting language. It provides an extendable core language and was specifically written to be embedded into interactive windowing applications. Tcl (Tool Command Language) is a good example of an application-independent embedded "universal scripting language". It provides an extendable core language, and was specifically

written as a command language for interactive windowing applications. It also provides a convenient framework for control integration among Tcl-based tools. Each application

extends the Tcl core by implementing new commands that are indistinguishable from built in commands, but are specific to the application. Tk is a toolkit companion to Tcl

that implements the Motif look-and-feel.

The SHriMP interface is implemented in the Tcl/Tk

language and is currently a library that has been integrated

into the Rigi system. Since SHriMP (through Rigi) is end-user programmable, the layout strategy can be dynamically changed for one or more nodes. The user can experiment with a variety of hybrid strategies based on the graph layout hierarchy. As a result, extending the Rigi editor with new visualization techniques, such as SHriMP, is feasible

3. Integrated Reverse-Engineering Tool Kit

As the amount of legacy code currently in use increases, the importance of program understanding grows accordingly. Program under- standing is the process of developing mental

models of a software system's intended architecture, purpose, and behaviour. There have been numerous research efforts to develop tools that provide assistance during the understanding

process. These tools adopt a number of different approaches, including pattern-matching, visualization, and knowledge-based techniques. Despite successful results from each of these approaches, it is clear that no single tool or technique

is sufficient by itself and that the software engineer can best be served through a collection of tools that complement each other in functionality. Over the past three years, researchers have been developing a toolset, called RevEngE (Reverse

Engineering Environment) based on an open architecture for integrating heterogeneous tools. The toolset is integrated through a common repository specifically designed to support program understanding. Individual tools in the

kit include Ariadne, ART and Rigi.

3.1 Ariadne

Ariadne is a set of pattern matching, design recovery, and program analysis engines. It includes techniques that perform code localization and system clustering. Code localization is achieved by comparing feature vectors calculated for every program fragment of the system and by incorporating a dynamic programming pattern-matching algorithm to compute

the best alignment between two code fragments (one considered as a model and the second as input to be matched). Clustering is achieved by analysing global data flow and data artefacts (data types, variable names, and other informal information) shared by program fragments, suggesting the use of similar concepts. Data flow is analysed by computing the data bindings between two code fragments. This type of analysis examines the use of global variables and variables passed by reference. Data artefacts are examined by analysing variable names and data types defined and used in every function (common references analysis).

In Ariadne, source code is represented within the knowledge repository as an aNnotated abStract syntax Tree (AST). An AST was chosen as the program representation scheme because the AST maintains all relevant information from the source level. Several fine-grained analysis algorithms (for example, slicing, control flow graph analysis, and metrics calculation) can be directly applied. Nodes in the AST represent language components (statements and expressions) and arcs represent relationships between these language components. For example, an IF-statement is represented as an AST node with three arcs pointing to the condition, the THEN-part, and the ELSE-part. During Ariadne analysis, each AST node is annotated with control and data flow information. It can be a

vector of software metrics; a set of data bindings with the rest

of the system; or a set of keywords, variable names, and data types. The support environment is used to analyse and store the AST and its annotations, which are computed for every expression and statement node in the AST.

3.2 Art

ART is a prototype textual redundancy analysis engine. It generates a set of substrings ("snips") that covers the source (that is, every character of text appears in at least one substring). A set of substrings is extracted from each local

context to ensure that redundant matches are not missed due to different generation processes. Matching snips are then identified. The resultant database of raw matches is translated

into a form that more concisely expresses the same information to facilitate deeper analysis. Task-specific data reduction is then performed and the results summarized. This basic process may be enhanced by doing pre-processing of the

source or synthesis of partial matches from short exact matches.

3.3 Rigi

Rigi is a prototype realization of the PHSE (Programmable Hyper Structure Editor): an architecture for a meta reverse engineering environment . It provides a basis upon which users construct domain-specific reverse engineering environments. It is instantiated for a particular application domain by specializing its conceptual model, by extending its core functionality, and by personalizing its user interface.

Rigi is used as part of the RevEngE toolset primarily as a visualization interface for the data produced through analyses by other tools.

However, a wide variety of analyses can also be performed using Rigi itself. This is primarily due to its scripting facility, which enables it to integrate third-party tools to perform specific reverse engineering activities, such as specialpurpose

graph layout, to aid program understanding. This section shows how the reverse engineering capabilities provided by the RevEngE toolset are used to aid program understanding. This is done by applying the integrated toolset to selected reverse engineering analysis scenarios representative of actual program understanding tasks. No prior knowledge of the subject system's source code is assumed.

4. Current Practices in Industry

If one views maintenance as "reuse-oriented software development", reverse engineering can benefit everyone involved in software production (e.g., maintainers, developers, documenters, managers, and testers). There have been several areas identified as critical to improving software maintenance and (re)development, and recapturing technologies are one of them. Simply put, recapturing technologies attempt to recover the original design in an existing software system by using reverse engineering and various program understanding tools. This knowledge can then be reused for further maintenance.

With the cost of software maintenance routinely consuming upwards of 50% of a product's life-cycle and budget, any savings in maintenance will have a significant impact on lowering the overall project cost. It can also affect the quality of the software by reusing tested components, domain knowledge, and information-something that is becoming increasingly important in today's competitive marketplace. Some of the greatest benefits of reverse engineering a software system can be realized by management personnel. Project management and planning at most corporations is a complicated process. The software systems for which they are responsible exist in various life cycle stages: new product development, testing, maintenance of existing code, and different versions. They must also manage the human element of the project: identify the strengths of team members, allocate resources based on various needs (both personal and financial), incorporate new personnel into the project, and compensate for the departure of experienced staff. Other considerations include: funding, experience and talents of the people available, schedules, impact on other products and development groups, and market analysis. All of these things make management very difficult. This problem is exacerbated when the complexity of the project, both technical and organizational, threatens to overwhelm even the most prepared managerial personnel.

Benefits produced by reverse engineering include aiding system comprehension, change analysis, and recovering lost information. Rigi offers a semi-automated alternative to the above problems. The Rigi system uses views to direct the focus on visual data and guide the exploration of spatial data. A view represents a particular state and display of a constructed software model. Different views of the same software model can be used to address a variety of target audiences and applications. Hence the same software system can be used by all those involved in the project, including development, testing, communications, and management.

5. Ongoing Research in Rigi:

Rigi programmable environment can be integrated with a popular World Wide Web browser to support hyperstructure hotlists: an approach to managing link complexity, organizing conceptual themes, and aiding Internet navigation through

the use of multiple virtual webs.

5.1 Hyperstructure Hotlists

When one attempts to understand a large body of information, the overall structure of the information space is just as important as the inner structure of any single artefact if not more so. This is especially true when the number of artefacts in the domain is much larger than the size of each artefact. This process can be termed hyperstructure understanding (HSU): identifying artefacts and understanding their structural understanding relationships in complex information webs. HSU is an

objective, not a well-defined process. The prefix hyper is used to distinguish HSU from the in-the-small activity of understanding the internal structure of any single artefact; the focus is the analysis of overall system structure.

Existing WWW browsers provide limited functionality to address the HSU problem. A hotlist (bookmark) mechanism is typically used to store a semi-ordered collection of links to WWW pages of interest. This mechanism is useful for a small set of links, but it provides no other structuring mechanism. Moreover, few browsers enable users to associate much context with the entries in the hotlists. As the hotlist grows, it

quickly becomes unwieldy. Users forget why a certain page is referenced from their hotlist. More importantly, there is no way of conceptually clustering links to represent common themes (other than authoring HTML documents). Some browsers provide a hierarchical tree-structured hotlist mechanism which allows users to collect entries into meaningful taxonomic groups. However, unless the user puts duplicate entries in the file to associate the same location with

two or more concepts, there is no way to build multiple semantic threads through the hotlists. In addition to the complexity this introduces, duplicate entries are highly undesirable because of the configuration management problem.

An example of a multiple-threaded subject index is the Yahoo directory at http://www.yahoo.com. It presents a main index hierarchy, but the index has many cross-references embedded in it. For example, the main category "Computers" has an entry "Computer Science", which actually is a cross-reference to another branch, "Science: Computer Science." While Yahoo represents a good example of the semantic indexing function that is desired, it achieves this only by extraordinary manual effort and is still based on the judgment of the proprietors, similar to what professional librarians do to

create catalogue entries for books and other publications.

Additionally, Yahoo presents a purely textual view of the data base to users; it does not capitalize on the capabilities of graphical user interfaces that can present a richer view of the spatial interconnection topology of the underlying information space.

Hyperstructure hotlists provide two valuable enhancements to conventional WWW indexing tools. The first is a powerful and easy-to-use method for constructing arbitrary indexing schemes which capture the semantics of the users' interest in any particular set of Web locations. These semantic networks can be arbitrarily large and arbitrarily complex, permitting the user to build as many threads through the information space

as desired. The second is a graphical interface (Rigi), that can

display threaded structures in two-dimensional space with functions for expanding and collapsing sub-nets, and for creating and maintaining cross-reference links among leaf nodes and sub-nets. This interface employs shape and colour to convey details about semantic concepts which would be cumbersome in a purely textual presentation.

5.2 Integration of Rigi with a WWW Browser

The integration of Rigi reverse engineering environment and the Netscape WWW browser provides the support for hyperstructure hotlists. The result is an improvement over the simple hotlist mechanism provided by most browsers and an alternate perspective of the underlying information space. Hyperstructure hotlists are constructed by retargeting Rigi to support HTML. A simple relational model is used to represent HTML at the data level, while a scripting language is used to enable the gathering of this data from HTML documents, and to communicate with Netscape. A conceptual modeling language is used to organize knowledge and reduce cognitive over-head concerning the (potentially very large) information space of the WWW. A graphical representation of the semantic network used to model the information artefacts facilitates the interactive navigation, analysis, and presentation of the information space. In summary, the information space is constructed out of data artefacts, structured through the construction of conceptual models, and interactively explored in a hypertextual manner by representing neighbourhoods as

(parts of) semantic networks. The approach encourages the integration of additional tools into the reverse engineering environment, building on other research work in an evolutionary manner.

6. Conclusion

Rigi is an interactive, visual tool designed to help developers better understand and re-document their software. Rigi includes parsers to read the source code of the subject software and produce a graph of extracted artefacts. However the parsing process still poses the biggest problem to the Rigi tool suite. Better feedback of the parser to the user during the parsing process and tighter integration into the rest of the Rigi tool suite might help to avoid similar problems in the future.