Computational Model For Evolvable Grid Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Grid computing systems are the latest computing environments and have been gaining popularity for the past few years. They can be considered as extensions of distributed computing systems, but in which the number and heterogeneity of the systems are much higher. Users, by plugging their systems to Grid computing systems can potentially use the vast number of services that are available in the Grid similar to the way in which electrical appliances can draw power from the electrical power grid. Grids also provide opportunities for collaborative computing, in which users across the grid can collaborate towards solving large applications.

The computational model is a blueprint of a computation performed on a particular architecture. Our unit of computation is an object perceived by an object-oriented paradigm. These objects cooperate, communicate and collaborate with each other. The computational model comprises objects, models and strategy; the objects comprises of grids, nodes, data, application software and jobs, while the models consist of communication, termination, timing, failure and migration models.

This chapter describes the computational model's objects in Section 4.2, the models in Section 4.3, and the strategy in Section 4.4.

4.2 Objects

The main objects for the computational grid will be described in this section. These objects include grids, nodes, application software, data, jobs and applications.

4.2.1 Grid

A "grid" is a collection of computing resources that perform tasks. In its simplest form, a grid appears to users as a large system that provides a single point of access to powerful distributed resources. In its more complex form, which is explained later in this section, a grid can provide many access points to users. In all cases, users treat the grid as a "single" computational resource. Resource management software such as N1 Grid Engine 6 software (grid engine software) accepts jobs submitted by users. The software uses resource management policies to schedule jobs to be run on appropriate systems in the grid. Users can submit millions of jobs at a time without being concerned about where the jobs run.

No two grids are alike. One size does not fit all situations. The following three key classes of grids exist, which scale from single systems to supercomputer-class compute farms that use thousands of processors:

Cluster grids are the simplest. Cluster grids are made up of a set of computerĀ hostsĀ that work together. A cluster grid provides a single point of access to users in a single project or a single department.

Campus grids enable multiple projects or departments within an organization to share computing resources. Organizations can use campus grids to handle a variety of tasks, from cyclical business processes to rendering, data mining, and more.

Global grids are a collection of campus grids that cross organizational boundaries to create very large virtual systems. Users have access to compute power that far exceeds resources that are available within their own organization.

In other words, A grid is a collection of nodes connected via a network and managed by a resource broker. It promotes the sharing of services, computing power and resources such as disk storage databases and software applications. In the computational model and design, a grid can allow communication with other grids in order to exchange application software and/or data, and even the job if the chosen grid does not contain the job requirements in its domain. In this case it sends the job to another grid for execution. In the model, each grid has an ID and its own policy.

4.2.2 Node

A node is the most basic component in grid computing. It is a collection of work units that can be shared and that can provide some capabilities. The grid nodes usually differ in speed, capacity, architecture and operating systems. Communication between nodes is achieved via network capabilities such as LAN and WAN.

Nodes are responsible for receiving, executing and returning the results of jobs. Each node has its own application software and data that fulfils the user's requirement in order to execute the job. The model assumes that all grid nodes can communicate with each other in grid environments in order to exchange jobs, application software and data that can migrate between grid nodes when required. They must therefore contain middleware application software. Each node in the grid sends a periodic heartbeat to the monitor responsible for managing and controlling these nodes. This heartbeat contains a timestamp, node name, job status (if the node is running a job) and other optional information.

Node can be either logical node or physical node; the latter can contain one or more logical nodes [80]. Logical Node

Logical nodes are designed to perform management roles in grid environments; they can manage different kinds of grid components such as Condor Pool, Distributed File System (DFS) or Distributed Computer Pool [37]. as well as the relationships between them.

In the model, the resource broker and grid monitor acts as a logical node. Physical Node

Physical nodes are categorized according to their function in the grid. The two most common types of physical node are computation and storage nodes. Sometimes these nodes are also called "resource", "members", "donors" or "host" among many other names, but they all mean "node".

1. Computation Nodes

Computational nodes are the machines deployed and exploited according to their hardware and processing capabilities. The node could be a cluster, mainframe, high performance computer or a desktop PC. These capabilities are provided mainly by the various kinds of processor architecture. Each computation node has its own processor architecture and its own hardware specifications such as speed, software platform compatibility and internal memory.

Computation nodes can be exploited in many ways; a grid job can be executed on a grid node instead of running on a local machine outside the grid. Parallel jobs can be executed on many processors, whether they are located on the same grid node or on many. Parallel jobs are executed in this way because they have been designed so that their work can be divided into separate parts in order to run on different processors. Some other jobs may require repeated execution on many computation grid nodes [49]. Parallel jobs by nature play an important role in increasing scalability: the number of separate work units produced by the job will consequently increase the number of computation grid nodes which can most efficiently be used for the job's execution, which saves some of the time taken for the job's execution. In other words, the type of parallel jobs and the role of the grid computation nodes in executing the jobs work units improve the scalability of the grid.

2. Storage Nodes

Since the function of the computation node is the processing of jobs, other machines are responsible for storing and providing data. These machines are called storage nodes. The most common storage type is secondary storage using hard disk drives or other permanent storage media such as tape drives.

Implementing secondary storage in a grid has the advantage of improving the capacity, performance, sharing and reliability of data exchange within that grid. There are different kinds of file systems which will handle the storage and organisation processes for the data across the nodes of the grid network. These popular and common network file systems include Network File System (NFS), Distributed File System (DFS) and General Parallel File System (GPFS). It is advisable that the data capacity of all the nodes, especially the data storage ones, be mounted on a particular type of network le system.

3. Special Nodes

Grid computing can use resources of types other than those pertaining to job execution. A grid administrator may sometimes create new artificial resources to execute certain jobs. For example, some machines may be designated for use only for medical research.

4.2.3 Application Software

Application software is a software program or group of software programs that are installed on grid nodes. These programs are collections of instructions describing tasks or set of tasks to be carried out by a node. Application software is loaded into the node's RAM and is executed by the CPU.

The most basic function for application software is to carried out the required out by using the available resources on the node. Application software is responsible for executing and running the job. It allows the end user to accomplish one or more specific jobs by employing the capabilities of a computer in fulfilling the requirements of a job the user wishes to perform. Each application software requires disk space, CPU speed and operating system.

In the model, application software has to be able to receive the job, run it and produce the result. It can also migrate between nodes inside or outside the grid depending on its policy in order to enable the required node to meet the user's requirements for the execution of the job. This capability reduces the number of rejected jobs. The software also may be uninstalled from the node by the grid if it affects its performance; the grid can nonetheless save the original copy of it without installing it. When the need for this application software arises again, the grid reinstalls it. This need may arise when the job requires this application software and no other node has it, or the node that does have it is busy.

4.2.4 Data

Data is used by application software to perform a specific task. Data can be defined as a piece of information stored in a grid node. The application software can therefore access data either from a local node or a remote node. The data may already be stored on one or more nodes, or it may come with the user job. In the model, each stored data has a unique name in order to allow users to describe it. The data is able to migrate from node to node within or outside the grid, depending on its policy, which may allow data to migrate, copy itself, be updated and changed, or be static.

4.2.5 Grid Job

A grid job is typically an assignment submitted for execution by a node on the grid. It may perform a calculation, execute one or more system commands, operate machinery or move or collect data. Sometimes the job has one of many other names such as "transaction", "work unit", "submission", all of which mean the same thing.

In the system, users need to describe their job requirements. These include job name, required software applications, required data, execution time and resource specifications (CPU count, speed, operating system, physical and virtual memory). In the system, jobs can also migrate from node to node within the grid environment in order to complete the job if a failure occurs.

The collection of jobs that fulfill the whole task is called the grid application. For example, a grid application can be the simulation of business scenarios such as stock market development that require a large amount of data as well as a high demand for computing resources in order to calculate and handle the great number of variables and their effects [49].

The emergence of grids is due to the needs of large-scale computing infrastructures for solving major computing and data-intensive problems in the fields of science, engineering, industry and business. The grid infrastructure provides different kinds of support to a wide range of applications which can be categorised as follows [11]:

Distributed supercomputing support allows applications to use grids in order to reduce their completion time.

High-throughput computing support allows applications to use unused processor cycles in grids for loosely coupled or independent tasks in order to increase aggregate throughput.

On-demand computing support allows applications to use resources in the grid that cannot be cost-effectively or conveniently located locally.

Data-intensive computing support allows applications to use grids to gather information from distributed data repositories and databases.

Collaborative computing support allows applications to use grids to establish human to human interactions via a virtual space.

Multimedia computing support allows applications to use grids to deliver contents assuring end to end QoS.

4.3 Models

Models manage and control system objects to achieve all the desired objectives of the grid system. These are communication, termination, timing, failure and migration models. The role of each describes as follows.

4.3.1 Timing Model

Time is an important and interesting issue in grid computing, the aim of which is the exploitation of underutilised resources to achieve faster job execution times. Each node in the design has its own internal clock. Every node must therefore periodically synchronise its clock with that of the resource broker using network time protocol, and when they join the grid they also specify availability time; the period during which they will be available. All cooperating grids are available continuously from the time they join the grid until they need to leave it. If any participating grids need to do this, they must complete all jobs that are referring to the original grid.

The timing model is responsible for maintaining and controlling system time, thus preventing jobs from running for longer than they are allowed to, as well as assisting in handling failure. The execution of a job needs a certain period of time; most resources use time as a charging unit because it is easily quantifiable. It is therefore possible for users to provide expected times for job completion.

4.3.2 Communication Model

All objects in a grid system need some form of communication in order to perform their activities with the desired flexibility. Whether it is as simple as transferring a single packet between a client and server, or as complex as advanced coordination issues carried out over a network between hundreds of nodes, communication is essential.

There are two fundamental ways for grid nodes to communicate with each other regarding jobs: by passing messages known as Remote Procedure Calls (RPCs) [79], and by using shared memory. Message passing is a more popular paradigm than shared memory because it allows easier communication between multiple processor architectures, and has a larger number of supporting applications and software tools. When coordination is involved, each of these communication forms has a complimentary coordination style; message passing uses control-driven coordination (procedures and function calls are made between two processes, irrespective of whether they are local to a single machine or are hosted on different machines), and shared memory uses data-driven coordination (communication carried out by placing data inside the shared memory).

In the model, communication is responsible for transmitting a message from one object to another. The purpose of this communication is to exchange and transmit information and data between these objects. Information transmission is also necessary when dependencies are present. Communication also helps discover failures. Communication between sending and receiving objects is performed via RPC which may be either synchronous or asynchronous.

The model uses the latter type, which means a non-blocking send. In asynchronous communication, the use of send and receive operations do not block synchronous communication. Asynchronous communication is an alternative form that may be useful in situations when it is possible for an object to retrieve replies later.

Communication provides the means of coordination and cooperation between objects.

Because the design is client/server, the client makes requests to a grid service using a remote procedure call. When the request has been carried out, notification is sent back to the client, who can meanwhile make a new remote procedure call to that same service. RPC is a message-passing protocol that provides high-level communications. It is built on the eXternal Data Representation (XDR) protocol that standardises data representation in remote communications. This protocol converts the parameters and results of each RPC service provided. RPC consists of two distinct structures, the call message and the reply message. In this model, the client makes a procedure call (call message) to request a service from the server. When the request arrives, the server performs the requested service, and sends a reply (reply message) back to the client.

4.3.3 Termination Model

After a job has been submitted, it starts running on the nodes until termination. Usually a job finishes due to conditions such as normal, failure or user termination. The model uses conventional (as opposed to distributed) termination because most tasks in grid computing are executed in parallel, which is the preferred method. It allows objects working in parallel to complete one task; when any object has done so, it does not wait for the others to finish theirs. This type of termination saves time and allows objects that have finished their tasks to embark on others while the remaining objects are still completing their jobs.

4.3.4 Failure Model

Failures in distributed systems can be unpredictable in that they can leave the job in one of many possible failed states. Failure entails the desired state or condition is not being arrived at. In the grid computing environment the probability of failure is high because the grid aggregates an immense amount of hardware and software components. Depending on the system's size, the probability of failure will therefore increase.

The grid consists of extremely heterogeneous objects, which can lead to failure when they interact.

Grid environments are extremely dynamic, with components constantly joining and leaving the system.

Detecting failures in a dynamic and heterogeneous system such as a grid is very difficult.

The failure may be due to software or hardware crashes, process failure, communication delays, network failures or system performance degradation. Failures in grid computing are partial, meaning that some components fail while others continue to function. When hardware or software faults occur, jobs may produce incorrect results or stop before they are complete. The model assumes that all system components, whether hardware or software, may fail at any time. Hardware failure may be in a node or network or in communication. Fault tolerance is an essential characteristic and a necessary function for grid environments in order to avoid the loss of computation time. Two mechanisms provide a fault tolerance model: failure detection and failure handling.

Checkpointing schemes allow a job to continue from the point of failure, avoiding the necessity of rerunning the whole job. This model is mainly used in computing systems to store the current state of the job. By switching to an earlier checkpoint, a system can reload the previous state and resume computation from the point of failure. Besides recovery, checkpointing also enables other features such as job migration, which allows a failed job to continue on another machine from the point of failure. When failure occurs, job migration is the best way to complete the submitted job.

In the design, the system can detect and handle, and thereby recover from, failure. Detection inside the grid happens at the point when each node sends a regular heartbeat to the monitor. The heartbeat contains a timestamp, node name, job status (if the node is running a job) and other optional information. If the monitor does not receive an expected heartbeat from any node, it puts that node in a SUSPECTED state, and sends an "are-you-alive" message to the node. If the node responds with a message, the monitor puts the node back in the ALIVE state and continues as normal, but if it does not, the monitor puts the node in a FAILED state. This message helps the monitor detect the failure and recover it. When the node sends a heartbeat to the monitor, the latter compares the current state of the job with the last checkpointed state; if it is the same, the monitor assumes that the job has failed. If such failure occurs, the monitor informs the resource broker involved to take one of the following actions:

Migrate the job in its current state to another suitable resource on the list as found by the resource broker in the resource discovery stage, in order to restart execution of the job from its last state.

If only failed resources have appeared in the list, or if all the listed resources are currently busy, the resource broker will try to find another resource in another participating grid that can complete the job. If such a resource is not found it, it sends a message to the user to say that the grid cannot execute the job because of failure.

In order to detect failure outside the grid, when the original grid need a resource from another grid (i.e. external evolution), it sends a requesting message to all participating grids, to which they must all reply. If a particular grid does not do so, the original grid put it in a SUSPECTED state and sends an are-you-alive message to it. If reply is received, the original grid returns it to the ALIVE state, but if it does not, it is put it in a STOPPED state to prevent it from being used. If, however, any other grid has a job for the original grid (i.e. Just in Time evolution), the model assume that all these grids have a failure recovery capability and will therefore periodically send a message to the original grid to inform it of the status of its job. If the original grid does not receive this message, it assumes the job has failed. It therefore finds another resource that can execute this job from its last checkpointed state.

4.3.5 Migration Model

Migration is the capability of physical or virtual computational resources (software codes, portable notebook PC's, running objects, mobile agents and data) to move from one location to another across local or global networks. This very broad concept is central to distributed computing. It can be subdivided into personal, computer and computational migration. Computational migration addresses the movement of software [29, 83], with which the present work is particularly concerned. It can also be referred to as control, data, link or object migration [17].

Control migration supports moving a control thread from one machine to another and back again. Data migration allows the data required by the process to be passed through the network. Link migration is the fundamental concept of distributed objects, which transmits the end point of ability to move objects (codes) between different servers. It is also the basic concept in distributed computing.

Code migration (mobile computation) has come into popular use because it provides the capacity to link software components at runtime. This means that a software component can move around on different servers across the network in order to execute tasks. From the point of view of the execution state, code migration can be weak or strong [18]. Migration is a central concept of grid computing. This model use weak and strong migration. Weak migration means to allow the code to move across nodes so that it sometimes has initialization data attached, but never the execution state (in other words, the state of the computation is lost at the originating node). This happens when the user submits the job to a grid (i.e. to a resource broker), and also when the resource broker submits the job to grid nodes or to any adjoining grids. Strong migration is the capability of a computational environment to migrate the code and execution state (the context of execution) to restart at a new resource. The execution state includes running codes, program counters, saved processor registers, return addresses and local variables. This happens between grid nodes and between the cooperation grids in cases of partial failure, to evolve the grid or when the resources are not available at the current node for any reason. In the model, strong migration can occur when the application software and/or data is migrated from one node to other inside or outside the grid. This migration will depend on the resource strategies that occur during both external and internal evolution.

4.4 Strategies

Strategies or policies are sets of rules and principles stated by one or more owners or administrators of grid or resources that stipulate how those grids or resources can be accessed and used. They determine how a job should be completed, how the resources should used, how security is implemented in the grid and how the grid manages and protects its resources. Policies in the design can be on both resources and grids.

Grid policies

These are formulated by grid administrators and are enforced by resource brokers. They determine how the grids are accessed, used and how they manage their resources. They also define how grid resources are selected - for example, they determine the choice of the lowest load resource from the suitable resources list in order to execute the job.

Resource policies

Resource owners have the right to determine their resource's governance policies and define how their resources can be used. Those strategies will be on node, data and application software. Resources policies are stored in the information service and are used by resource brokers. Resources strategies can be static or dynamic. A static policy is fixed: it does not allow data and/or application software to migrate to other nodes in the grid environment. In other words, the node's data is read-only and the node's application software is use-only. By contrast, a dynamic policy gives resource owners more options by which to decide their polices for all node components (data and application software) by determining the policy for each component separately. This strategy may allow to data and application software to migrate from node to node in grid environments.

4.5 Summary

This chapter described the components of the system, which consists of computational model objects, models and strategies. This chapter also sets out the behavioral properties of the system's objects and the interactions between them by describing how those objects interact using the models listed. It described how the components can communicate in asynchronous communication and using the RPC technique. This chapter also explained how the objects terminate using the conventional termination model, and how these objects can migrate using the GridFTP.