Computational Grids combine heterogeneous, distributed resources across geographical and organisational boundaries. Grids may be formed to provide computational power for CPU-intensive simulation, high throughput computing for analysing many small tasks or for data intensive tasks such as those required by the LHC Experiments. For all of these situations the challenges are the same: how to enable dynamic access to these resources as securely, reliably and as efficiently as possible, without central control and omniscience.
The following chapter discuss the main concepts and components that combine to make computational Grids possible.
There are four main characteristics that distinguish a Grid from other common distributed systems [REF]:
Heterogeneous: The resources of a Grid may be provided by multiple organisations, which are geographically distributed and execute local and autonomy resource utilisation policies. These resources are heterogeneous in service ability and the style of utilisation.
Extendable: A Grid is able to grow from a small amount resource set into a huge global infrastructure. The service ability of a Grid is scaleable to cope with the resource demands of various applications.
Dynamic: Since the resources and the services are contributed by multiple organisations and each organisation has its local resource management and utilisation policy, these resources and services are constantly dynamic available and shared subject to the autonomy of resources contributors and their utilisation behaviours.
High Communication Latency: Since a Grid may be constructed and distributed across a wide geographical distance, the communication latencies involved are likely higher than on any other localised system.
A Grid, according to Ian Foster [REF]:
- Coordinates resources that are not subject to centralised control
- ...using standard, open, general purpose protocols and interfaces
- ...to deliver non-trivial Qualities of Service (QoS).
These three points can be used as a starting point for our discussion. Distributed computing is a well known problem with many potential solutions. CORBA, RMI and more recently Web Services have been developed to cope with the failures and lack of information and control that is inherent in a distributed system. But these systems are no longer enough to satisfy the needs of the scientific and, increasingly, the business community. By combining distributed resources into a single virtual resource, users are able to access far more computing power at lower cost and higher efficiency. The real cost is in the increased complexity of the system. Resources providers and consumers are dynamic and the information and control of the system is incomplete. Hardware and network failures, power cuts and human error together with different architectures, operating systems and middleware must all be handled.
Without centralised control, networks of trust must be established. Collaborations create Virtual Organisations (VOs), which span traditional organisations and can be formed dynamically. Users and resources can then be authorised based on their membership of a particular VO.
All of the above is only possible through the adoption of standard, open protocols and interfaces. The wide range of hardware and software available on a Grid means that the only hope for interoperability is that an application written for one middleware platform can speak the same language as another. The adoption of a single security infrastructure, based upon the Public Key Infrastructure (PKI), is a good example of this. When every interface supports the same authentication method, users are able to use resources based solely on the ownership of their credentials (and potentially membership of a VO).
The delivery of non-trivial Quality of Service (QoS) provides the motivation to overcome all of these hurdles. As network speeds have increased, it has become feasible to harness massive amounts of computing power across multiple domains utilising resources that might otherwise be idle.
The Grid is typically composed of layers with higher layers making use of the functionality provided by lower layers. This is also referred to as the "hourglass model"[REF], where the neck defines a limited number of key protocols, which can be used by a large number of applications, to access a large number of resources. The key layers that are required in a typical Grid are shown in Figure 3.1 and are discussed in the following sections.
Computational resources, high performance networks, storage devices and scientific instruments all combine to form the underlying fabric which form a Grid. The fabric layer provides the resource specific implementations of operations that will be required by the resource layer.
Computational resources take the form of the CPUs upon which the work is performed. Typically existing clusters of centralised, homogeneous computers are attached to the Grid, so that any user on the Grid can use the resources as if they were local to the site. Differences in processor architecture, 32 or 64 bit and operating system need to be accounted for and hidden from the user. A global Workload Management System (WMS) is typically used to communicate with site local Computing Elements (CE). These CEs submit jobs to the Worker Nodes (WN) that form the cluster via the existing Local Resource Management System (LRMS), such as LSF, PBS or Sun Grid Engine.
The emergence of high-speed, optical networking in particular can be seen as one of the key driving forces without which these distributed, data intensive activities would be impossible. It is now common to find 1 Gb/s links, with 10 Gb/s links becoming increasingly available between the larger centres. These clusters will typically have large Mass Storage Systems for secure, reliable storage of large amounts of data. The underlying technology of this Storage Element (SE) may again be different from site to site and these differences need to be accounted for.
Finally, the instruments from which raw data is obtained must be attached to the Grid. This may be a telescope, a microscope or, for the case of the LHC experiments, a particle detector. These detectors will produce upwards of a petabyte of data per year which must be processed, analysed and stored on the Grid so that any member of the collaboration at any point around the world can obtain access to the data.
The connectivity layer glues the Grid fabric resources by providing the core communication (e. g. transport, routing and naming) and security (e. g. authentication) protocols to support the information exchange between Grid resources in the fabric layer. These protocols defined in connectivity layer make communication among Grid resources easy and secure.
In order to support transparent access to resources, single sign-on is required. Without this users would be required to authenticate before using each resource that is required in the workflow. Considering that this could encompass hundreds of resources across distinct administrative domains, it is clearly unacceptable that users should have to obtain and access a local account to use the facilities.
The resource layer defines the information protocols for inquiring the state of a Grid resource, and management protocols for negotiating access to a shared resource. Only the information of sharing an individual resource is concerned by the protocols in the resource layer.
The resource layer is concerned with the provision of management and information protocols for individual resources. This forms the 'neck' of the hourglass: a narrow range of protocols which hides the heterogeneity beneath from the rich applications above. Secure connections are established through the connectivity layer to the resources in the fabric layer. No knowledge of any global state is required.
The collective layer provides services that combine all of the resources represented by the resource layer into a single global image. Services providing accounting, information, monitoring, security and scheduling would operate at this level. Instead of submitting jobs to a single batch system this layer can orchestrate the execution of jobs across multiple systems. Monitoring and diagnostic information is available, to provide information about the state of the Grid as a whole. Security and policies can be applied at the community level, so that the VO managers can control who can access resources.
The application layer is the one that users of the Grid should interact with. Developers can use the services offered at the lower levels to compose applications that can take advantages of the resources within the Grid.
The applications are able to utilise the implementations (e. g. APIs provided by a Grid middleware) of protocols defined within each lower layer.
Standards are essential to ensure the interoperability and reuse of components in such a large and complex system. The original body overseeing standards in Grid computing, the Global Grid Forum (GGF), merged with the Enterprise Grid Alliance to form the Open Grid Forum. These bodies are modelled after (and inherit from) existing standards bodies that are involved in Web standardisation such as the Internet Engineering Task Force (IETF) and the Organisation for the Advancement of Structured Information Standards (OASIS). The World Wide Web Consortium (W3C) should also be mentioned as the body behind the standardisation of HTTP, SOAP and XML technologies which much of Grid Computing rely on. We will discuss two of the most relevant here.
The first standard to be proposed was the Open Grid Services Architecture[73, 74] (OGSA) in 2002. OGSA defines standard protocols and interfaces to manage resources as part of a Service Orientated Architecture (SOA). The aim is to promote interoperability and enable reuse and composition of services, by providing low level functionality that is common to many services.
OGSA extends the existing Web Services framework to provide functionality, such as discovery, creation, destruction and notification, which is required in a 'Grid Service'. Web Services are typically persistent and stateless something that may not be appropriate for a Grid Service. For example, imagine a service that reports on the status of a job. The service needs some concept of a state and the user doesn't want every other user to have access to their results. For this reason Grid Services can be dynamic, transient and stateful.
While OGSA defines the general architecture of a service based Grid, the Open Grid Services Infrastructure (OGSI) describes the 'plumbing' that would make this architecture possible. In 2004 OGSI was superceded by the Web Services Resource Framework (WSRF) which addressed many of the issues in OGSI.
The Web Services Resource Framework (WSRF) has been designed to address the shortcomings of Web Services and the criticisms of OGSI; that it is too large and too different from traditional Web Services. The WSRF retains most of the functionality of OGSI, but it is repackaged and re-factored into a set of six complimentary standards more in line with existingWeb Service standards. In OGSI stateful service instances have service data, whereas in WSRF stateless services act upon stateful resources with certain properties. A WS-Resource is a named, typed element of service data, which is related to a specific Web Service. The WS-ResourceProperties specification defines methods for querying and updating these resources with the WS-ResourceLifetime specification detailing how the persistence of the service can be controlled. The remaining standards, WS-BaseFaults, WS-ServiceGroup, WS-BaseNotification and WS-BrokeredNotification, provide similar functionality to their OGSI counterparts.
The draft standard was proposed by the Globus Alliance, IBM and HP in January 2004 and was standardised by OASIS. Globus Toolkit 4 is a WSRF compliant implementation along with WSRF.NET, WSRF::Lite and Websphere.
In order to make the most of the resources provided in the Grid Fabric, a set of low level services, which perform commonly used operations, are required. This increases the security, performance and reliability of the Grid applications, while reducing the complexity for the developer. The four main areas of Grid Services are outlined in the following section, before a more detailed discussion of the middleware provided by different organisations.
Resources are the fundamental components of any Grid. Whether they are CPUs, storage, network or some form of scientific instrument, they need to be accessible across the Grid according to some policy. Resource management provides the applications and interfaces required to access and control these heterogeneous resources in a consistent manner.
The most common requirement is the management of computing resources. The WMS accepts jobs from the user and allocates them to resources. A Resource Broker (RB) will typically be used to match the users requirements to an advertised resource. The resources themselves are typically accessed through some form of CE which provides a bridge between the global WMS and the local resources. It will accept and execute jobs on its local infrastructure and report the current status of the jobs to some form of logging system. The CPU nodes that the CE represents are referred to as Worker Nodes (WN) and which may be a cluster in its own right or a looser distribution of workstations.
At every stage the global and local components will assess the credentials of the owner of the job, to ensure that they are eligible for access and to determine what, if any, priority they should be given.
Grids are dynamic systems where data, resources and users are transient. Users and services lack knowledge of the status and availability of services and methods of discovery and monitoring are required. Information such as the system load and location of data can be used by the WMS to allocate jobs. The provision of accurate and timely monitoring information makes identifying and diagnosing problems across the system possible.
All Grids regardless of their purpose require mechanisms for the discovery, storage and transfer of data. What differs between Grids is the scale of the data that must be managed and hence the performance that is required from the data management middleware.
Users of the Grid will not be aware of the physical location of their data. File lookup services must be available to users and applications to provide the physical file name based on some logical file name or some meta-data. Data confidentiality, integrity and accessibility must be maintained at all times by maintaining replicas across the Grid and using reliable transfer and storage mechanisms. Data is typically represented as files. Access to the data contained within the files is outside of the scope of the middleware and is the responsibility of domain specific applications.
Multiple replicas of a file may exist within the Grid for redundancy and/or efficiency reasons. To keep track of the files and their replicas a File Catalogue is required. Each file that is registered with the catalogue has a Globally Unique ID (GUID); a hexadecimal number which refers to one or more Storage URLs (SURL), which give the location of the file and its replicas. However, a GUID is not a user friendly way of referring to files. The File Catalogue can be used to assign a Logical Filename (LFN) to each GUID to aid comprehension. Metadata, that is 'data about data', can also be added so that files can be selected based on the selection of some attributes.
These files are stored on a SE, which provides a logical abstraction of the underlying storage mechanism. A SE may be disk or tape based with a disk frontend. The Storage Resource Manager (SRM) protocol ensures that there is a standard method of interacting with the data storage. Files are physically transferred to or from a SE using the GridFTP protocol. This GGF standard defines extensions to the FTP protocol to enable high performance, secure and robust data transfers across high bandwidth distributed networks. Higher level services may be implemented, to automate the transfer of files and the interaction with the catalogues.
Security in a Grid is paramount. By its very nature it exposes valuable resources and data across zones of administrative control. Authentication and authorisation of users and services and the integrity and confidentiality of the data they use is essential. This is complicated by the requirements of the individual sites that compose the Grid who want to restrict access to potentially valuable and sensitive resources and by the requirement that the Grid should support single sign-on. Users should not have to obtain an account on every machine they need to use (in many cases they do not know which machines they are using). The solution chosen is based upon PKI and the concept of VOs.
PKI allows for users and services to authenticate one another if they both trust a third party. Users have two keys: a public key, which is used to encrypt messages and a private key, which is used for decryption and protected by a pass phrase. Security is based on the difficulty of factorising the large prime numbers the public key is based upon to obtain the private key. The public key is presented to a service in the form of a digital certificate, which is signed by a mutually trusted third party, the Certificate Authority (CA). If the service trusts the CA, then it can trust that the person or service that presented the certificate is who they say they are. Authentication, confidentially and integrity are guaranteed without any exchange of the sensitive private key.
However, this does not completely solve the problem. Jobs may have a long run time and need to re-authenticate, or the user may need to delegate responsibility for some action to another service. As the user does not want to expose their private key, another method must be used for authentication. A Grid Proxy is another digital certificate, with a new public and private key (stored on the filesystem so that only the user can read it) signed by the users original public key. These proxies are typically short lived to reduce the risk of exposure due to the lower level of security in the private key. These proxy certificates provide a chain of trust back to the original owner and to the issuing CA. Proxy renewal and delegation is allowed.
Now that the user has been authenticated, they must be authorised to use the resources. Providing access rights on a case by case basis across the Grid, would create a huge, unmaintainable burden on site administrators. Instead users apply for membership of a Virtual Organisation: a group of users, organisations and resources that share a common aim. Membership of the VO entitles the user to use the resources at the VOs disposal and site administrators are free to allocate resources to a single entity.
Delegation allows users to transfer their credentials (or a subset of their credentials) to another service, which will operate on behalf of that user. This ensures that services can be granted the minimum privileges for their task and that every delegated credential is independent. The delegation process consists of several stages, see Figure 3.3. First the client that owns the certificate and the server which requires a delegated copy create a secure connection. The connection need not be encrypted, as no secrets are passed, but it must ensure integrity. The client then creates a new public and private key which are inserted into a certificate request and returned to the client. The client uses the proxies private key to sign the certificate request and the complete certificate is returned to the server, where it is stored with the new private key.
The previous sections discussed the features that are required from any Grid. The following section discusses some of the most common implementations of these features.
The Globus project was formed in the late 1990s from the experience and software that was gained from the I-WAY project in the United States. It is now one of the most well known providers of open-source Grid software. The Globus Toolkit (GT) has produced many of the fundamental standards and implementations that underly many of todays Grids. It is not intended to provide a complete implementation of a Grid, but rather to provide components which can be integrated as required.
Version 2 of the toolkit released in 2002 provides 'non-WS' C implementations of features such as GridFTP, which still form the basis of many Grids today. Version 3 was the first to introduce an OGSA-compliant Service Orientated Architecture, which was completed when GT4, the WSRF compliant version, was released in 2005. Service implementations are provided respecting the relevant standards where possible. Containers are provided for Java, Python and C which implement many of the standard requirements such as security, discovery and management within which other services can be developed.
Workload management is performed by the Globus Resource Allocation and Management (GRAM) component. GRAM defines a protocol for the submission and management of jobs on remote computational resources. GRAM interacts with the LRMS (either LSF, PBS or Condor) which then executes the task. Data can be staged in and out from the WN. It is important to note that the GT does not provide any brokering functionality, where the most appropriate computational resource is chosen according to some requirements.
Data is transferred using an implementation of the GridFTP protocol called GridFTP. The Replica Location Service (RLS) provides a File Catalogue which may be used in conjunction with the Reliable File Transfer (RFT) to manage third party GridFTP transfers and the interactions with the catalogues.
Monitoring and discovery are provided by the Index, Trigger and WebMDS services. The Index service collects information which is published by other services into a single location. The Trigger service can then be used to perform a defined action when some criteria in the index service is met. WebMDS provides a web based interface to information which is collected from either the index service or another service.
The Globus Security Infrastructure (GSI) is perhaps the most widely used component of the Globus Toolkit. It provides tools for the authorisation and authentication of users using a PKI. Rather than submit their valuable private key, users create a short lived proxy which is then used to authenticate with resources. When jobs arrive at a certain site GSI can map the user onto a local credential appropriate for that site. The MyProxy credential store is also implemented. This provides a secure location to store long lived credentials which can then be retrieved by authorised services. This is required as users proxies often have a shorted lifetime than the jobs that they submit.
Condor is a distributed batch computing system. Unlike other batch systems, such as LSF or PBS, Condor's main focus is on high-throughput, opportunistic computing. Whereas in high performance computing the goal is to maximise the amount of work which can be performed per second, high throughput computing attempts to maximise the amount of work that can be performed over a longer time period. To enable this, all of the resources of an organisation must be used as effectively as possible. Instead of just using large dedicated clusters, Condor makes it possible to scavenge idle computing resources from all of the resources in an organisation from large clusters to individual desktops.
Failures are handled transparently and jobs can be migrated from one machine to another if, for example, the user begins to use their desktop again or the machine crashes.
Condor consists of three main components: agents, resources and matchmakers. Users submit jobs to agents, which find resources suitable for the jobs via a matchmaker. Machines may simultaneously run an agent and a resource server, so that it can both submit and accept jobs. Jobs and resources can specify requirements using the ClassAd syntax. Using these 'Classified Advertisements' jobs can specify the attributes they wish their execution resource to have (memory, architecture, etc.) while resources can specify their configuration and the type of jobs they are willing to accept. When th agent accepts a job from the user, it publishes the requirements of the job to the matchmaker. The matchmaker then finds all of the resources where the requirements match and ranks them according to some criteria; processor speed for example. The agent and the resource are informed of the match and further verification and notification may take place until both parties are happy.
Once a job has been matched to a resource, a shadow daemon is created on the submit machine to provide the input files, environment and executable required to complete the job to the sandbox daemon on the resource. The sandbox recreates the users environment, executes and monitors the execution of the job. It also protects the resource from malicious code by executing the job as a user with limited permissions within the resource. If the executable is linked with the Condor libraries, it can use the shadow to read and write files directly from the submission machine and to create checkpoints. With a checkpoint the entire state of the program is saved, so that in the event of the resource becoming unavailable for whatever reason, the job can be migrated to another resource and restarted from the checkpoint.
Every community of agents and resources that is served by a matchmaker is referred to as a pool. Each pool will typically be administered by a separate department or institution. However, users are not limited to a single matchmaker. The flocking process allows agents to interact with multiple matchmakers across organisational boundaries provided that they have permission. The user can then utilise resources from multiple pools to complete their tasks.
Condor-G is the combination of Condor and Globus, see Figure 3.6. Condor is used for local job management, while Globus is used to perform secure inter-domain communication. Condor-G communicates with a remote GRAM server which can then communicate with the LRMS. This could be another Condor pool, in which case the process is referred to as Condor-C. Even if the batch system is not Condor, the process of gliding in can be used to create a Condor pool. The first job that is submitted to the batch system starts the Condor servers which can then become part of the pool of the original user.
Figure 3.7 shows Condor-C, which allows for the transfer of jobs in one agents queue to another agents queue. A single shadow server is started to monitor the jobs while the delegated agent performs the actual job submission. The agent that submits the jobs to the resource needs to have direct contact with the sandbox. Condor is used within several middleware projects including LCG which is discussed next.
The LHC Computing Grid (LCG) project was created to deploy and manage the infrastructure necessary to satisfy the LHC experiments requirements. To achieve this LCG combines middleware from multiple projects such as Globus, Condor and the European DataGrid (EDG) project. EDG ran from 2001 to 2004 and created the base middleware that is required to operate a distributed computing infrastructure of the required scale not just for high energy physics but also for biological and earth science applications.
The LCG project is also closely related to its successor, the Enabling Grids for E-sciencE (EGEE) project, and shares many of the same components. As of July 2007 version 2.7.0 of the LCG middleware is still the version that is used in production and will be discussed first. A discussion of the differences with the EGEE middleware follows.
The EDG WMS is the main interface between users and the Grid, see Figure 3.8. It accepts job submission requests from users, matches them to appropriate resources, submits them to that resource and monitors their execution.
Using the Job Description Language (JDL) users configure their job and specify any requirements that they may have. Job parameters such as the executable, arguments, environment and input data along with requirements on the resource, such as minimum memory or maximum run time are specified in a text file. When a job is submitted the WMS client contacts the Network Server and transfers the job description, the executable and any other files that are required for the job to the WMS server.
The users proxy is also delegated to the WMS, so that it can operate on the users behalf. The job is then processed by the Workload Manager which orchestrates the matching and submission of the job. The RB1 performs matchmaking using the Condor Matchmaker which matches the requirements of the job with the published resources. The Information Service is used to obtain information about the load on each CE so that resources can be matched with jobs most effectively. If there is a requirement on input data the RB can use the Data Location Interface (DLI) of a specified catalogue to determine the location of the SE containing that data. Once the job has been matched to a resource, the Job Adaptor makes any alterations to the job that are necessary before submission. The Job Controller performs the actual job submission using Condor-G to communicate with the CE.
The CE provides a bridge between the global Grid and the local resources. The WMS system sends jobs to the CE using Condor-G. The Globus Gatekeeper authenticates the user that is submitting the job and translates the job so that it can be understood by the Job Manager. Local Center Authorisation Service (LCAS) and the Local Credential Mapping Service (LCMAPS) are used by the Gatekeeper to authorise the user and to map them to a local account at the site respectively. A Job Manager specific to th underlying LRMS performs the actual job submission and monitors the execution until completion. The logging and bookkeeping system is updated as the job progresses through the system.
Each site will have at least one SE which is used to store large volumes of data. Each SE provides an SRM interface to the data. By providing a common interface to access data, the details of the implementation or the underlying storage mechanism can be ignored. The data could be stored on tape in a MSS, such as Castor or dCache, or on disks managed by the Disk Pool Manager (DPM). Third party transfers of data from one SE to another are supported.
The Grid File Access Library (GFAL) is used to provide a POSIX-like library for input and output. GFAL can be given LFNs, GUIDs, SURLs or TURLs and can resolve the file by contacting a catalogue if necessary and then contacting the SE.
Monitoring and accounting services gather information from the individual components of the Grid and publish it in a consistent manner.
The information system is a hierarchy of LDAP databases which can be queried to obtain information, see Figure 3.9. At the resource level a Generic Information Provider (GIP) is used to provide the information. Each time the GIP is run it obtains static information about the resource from a file and dynamic information from the resource via the appropriate plugin. This information is used to populate a Generic Resource Information Server (GRIS) for each resource. Each GRIS registers with a site Berkely Database Information Index (BDII) and populates the database with information from its resource. Each site BDII then registers with a central BDII which will then contain complete information about the Grid. For performance reasons the central BDII is actually three BDIIs in a round-robin DNS alias. BDIIs can also cache information for up to twenty minutes, which improves performance but can result in stale information being delivered.
Another system that can be used for monitoring and information is the Relational- Grid Monitoring Architecture (R-GMA). R-GMA implements the GGF GMA specification, which defines three components: publishers, consumers and registries, see Figure 3.10. Producers register with the registry, from where consumers can determine which producer can answer their query. R-GMA displays information as if it were contained with a relational database. Clients can connect to consumers and perform SQL queries on the information.
The Logging and Bookkeeping system is populated by the WMS and CE as jobs progress through the system. Users can query the status of their jobs via the WMS to obtain the output.
The Virtual Organisation Membership Service (VOMS) allows for fine grained authorisation to be defined. When the users proxy is created, the VOMS client contacts a central service which assigns additional attributes to the user proxy. The VO administrator can approve membership and add the user to groups, 'Higgs' for example, or specific roles such as administrator. Components of the Grid can then check for these attributes and permit or deny users access based on the attributes attached to their proxy.
The gLite middleware is based upon the experience gained developing and running the EDG middleware and LCG grid, combined with some of the ideas from the AliEN middleware. The gLite middleware is currently being deployed on the EGEE infrastructure and will eventually become the default production middleware for the WLCG. The main difference between gLite and LCG middleware are the increased emphasis on a Service Orientated Architecture.
The gLite WMS uses a similar architecture to the LCG WMS but with different components, see Figure 3.11. The Network Server accepts connections from the UI and passes them on to the Workload Manager. There is also a Web Service interface, WMProxy, to the gLite WMS, with additional functionality including bulk submission. The Workload Manager orchestrates all of the other components to satisfy the job request. The RB uses information from the Information Super Market (ISM) to match jobs with resources that satisfy their requirements. Information can be pushed to the ISM by a CE or pulled from a CE by the ISM. The Workload Manager contains a Task Queue which can be used to hold job requests until suitable resources are found to satisfy the jobs requirements. The Job Adaptor creates information that is required by Condor-C during submission and on the WN during execution. Condor-C (see Section 3.5.2 for a description) is used to perform submission to a gLite CE after being processed by the Job Adaptor. The DAGMan is used to resolve dependencies between jobs that are submitted as Direct Acylclic Graphs. The log monitor watches the Condor-C log file for changes in the jobs status on the WN.
Jobs are started on a gLite CE by submitting Condor daemons via the GRAM gatekeeper. Authentication and authorisation are performed by LCAS and LCMAPS at the gatekeeper. These daemons interact with peers on the WMS and are used to submit jobs to the LRMS via the Batch Local ASCII Helper (BLAH) abstraction layer. The gLite CE can operate in push or pull mode.
An alternative architecture is available. The Computing Resource Execution and Management (CREAM) CE is a lightweight Web Service interface to manage jobs on a LRMS.
The major addition to gLites data management capabilities over LCG is the File Transfer Service (FTS). The FTS uses the concept of a channel as a unidirectional connection between two sites. Transfers performed within a channel are monitored to optimise performance and reliability. Users submit transfer requests to the FTS, which then manages transfers using third party GridFTP or srmcp.
Monitoring with the Experiment Dashboard
The Large Hadron Collider (LHC) is preparing for data taking at the end of 2009. The Worldwide LHC Computing Grid (WLCG) provides data storage and computational resources for the high energy physics community. Operating the heterogeneous WLCG infrastructure, which integrates 140 computing centers in 33 countries all over the world, is a complicated task.
Reliable monitoring is one of the crucial components of the WLCG for providing the functionality and performance that is required by the LHC experiments. The Experiment Dashboard system provides monitoring of the WLCG infrastructure from the perspective of the LHC experiments and covers the complete range of their computing activities.
The Experiment Dashboard [FIX] monitoring system was developed in the framework of the EGEE NA4/HEP activity. The goal of the project is to provide transparent monitoring of the computing activities of the LHC VOs across several middleware platforms: gLite, OSG, ARC.
Currently the Experiment Dashboard covers the full range of the LHC computing activities: job processing, data transfer and site commissioning. It is used by all 4 LHC experiments, in particular by the two largest, namely ATLAS and CMS. Generic functionality, such as job monitoring, is provided by the Dashboard server to all VOs which submit jobs via the gLite WMS.
The Experiment Dashboard provides monitoring to various categories of users:
- Computing teams of the LHC VOs
- VO and WLCG management
- Site administrators and VO support at the sites
- Physicists running their analysis tasks on the EGEE infrastructure
To allow for the development and building of components of Dashboard Monitoring applications, a Dashboard framework was designed. This is currently used by other projects and development teams and not exclusively for the development of the monitoring tools. The Dashboard framework is used as well for the construction of the high level monitoring system which provides a global view of LHC computing activities across all LHC experiments, both at the level of the distributed infrastructure in general as well as on the scope of a single site.
Web monitoring shows heavy use of the Dashboard servers, for example the dashboard of the CMS VO serves 2300-2500 unique visitors per month with about 30K pages accessed daily. These numbers are growing steadily.
The future evolution of the project is driven by the requirements of the LHC community which is preparing for LHC data taking at the end of 2009. The main strategy is to concentrate the effort on common applications which are shared by multiple LHC VOs but can also be used outside the LHC and HEP scope.
Reliable monitoring is a necessary condition for the production quality of the distributed infrastructure. Monitoring of the computing activities of the main communities using this infrastructure in addition provides the best estimation of its reliability and performance.
The importance of flexible monitoring tools focusing on the applications has been demonstrated to be essential not only for "power-users" but also for single users. For the power users (such as managers of key activities like large simulation campaigns in HEP or drug searches in BioMed) a very important feature is to be able to monitor the resource behaviour to detect the origin of failures and optimise their system. They also benefit from the possibility to "measure" efficiency and evaluate the quality of service provided by the infrastructure. Single users are typically scientists using the Grid for analysis data, verifying hypothesis on data sets they could not have available on other computing platform. In this case, reliable monitoring is a guide to understand the progress of their activity, identify and solve problems connected to their application.
This is essential to allow efficient user support by "empowering the users" in such a way that only non-trivial issues are escalated to support teams (for example, jobs on hold due to scheduled site maintenance can be identified as such and the user can decide to wait or to resubmit).
Preparation is under way to restart the LHC. The LHC is estimated to produce about 15 petabytes of data per year. This data has to be distributed to computing centres all over the world with a primary copy being stored on tape at CERN. Seamless access to the LHC data has to be provided to about 5000 physicists from 500 scientific institutions. The scale and complexity of the task shortly described above requires complex computing solutions. A distributed, tiered computing model was chosen by the LHC experiments for the implementation of the LHC data processing task.
The LHC experiments use the WLCG distributed infrastructure for their computing activities. In order to monitor the computing activities of the LHC experiments, several specific monitoring systems were developed. Most of them are coupled with the data-management and the workload-management systems of the LHC virtual organizations (VOs), for example PhEDEx , Dirac, Panda  and AliEn. In addition, there was a generic monitoring framework developed for the LHC experiments - the Experiment Dashboard. If the source of the monitoring data is not VO-specific, the Experiment Dashboard monitoring applications can be shared by several VOs. Otherwise, the Experiment Dashboard offers experiment-specific monitoring solutions for the scope of a single experiment.
To demonstrate readiness for the LHC data taking, several computing challenges were run on the WLCG infrastructure over the last years. The latest one, Scale Testing for the Experiment Programme'09 (STEP09), took place in June 2009. The goal of STEP09 was the demonstration of the full LHC workflow from data taking to user analysis. The analysis of the results of the STEP09 and of the earlier WLCG computing challenges proved the key role of the experiment-specific monitoring systems, including Experiment Dashboard, in operating the WLCG infrastructure and in monitoring the computing activities of the LHC experiments.
The Experiment Dashboard allows to estimate the quality of the infrastructure and to detect any problems or inefficiencies. Furthermore, it provides the necessary information to conclude whether the LHC computing tasks are accomplished. The WLCG infrastructure is heterogeneous and combines several middleware flavours: gLite, OSG  and ARC. The Experiment Dashboard project works transparently across all these different Grid flavours.
The main computing activities of the LHC VOs are data distribution, job processing, and site commissioning. The Experiment Dashboard covers all the various computing activities mentioned above. In particular, the site commissioning aims to improve the quality of every individual site, therefore ameliorating the overall quality of the WLCG infrastructure.
The Experiment Dashboard is intensively used by the LHC community. According to a web statistics tool  , the Dashboard server of only one VO, for example CMS, has more than 2500 unique visitors per month and about 30.000 pages are viewed daily. The users of the system can be classified into various roles: managers and coordinators of the experiment computing projects, site administrators, and LHC physicists running their analysis tasks on the Grid.
Experiment Dashboard Framework
The common structure of the Experiment Dashboard service consists of the information collectors, the data repositories, normally implemented in ORACLE database, and the user interfaces. The Experiment Dashboard uses multiple sources of information such as:
- Other monitoring systems, like the Imperial College Real Time Monitor (ICRTM)  or the Service Availability Monitoring (SAM)
- gLite Grid services, such as the Logging and Bookkeeping service (LB)  or CEMon 
- Experiment specific distributed services such as the ATLAS Data Management services or distributed Production Agents for CMS
- Experiment central databases such as the PANDA database for ATLAS
- Experiment client tools for job submission, like Ganga and CRAB
- Jobs instrumented to report directly to the Experiment Dashbaord
This list is not exhaustive. Information can be transported from the data sources via various protocols. In most cases, the Experiment Dashboard uses asynchronous communication between the source and the data repository. For several years, in the absence of a messaging system as a standard component of the gLite middleware stack, the MonALISA  monitoring system was successfully used as a messaging system for the Experiment Dashboard job monitoring applications. Currently, the Experiment Dashboard is being instrumented to use the Messaging System for the Grid [MSG] for the communication with the information sources.
A common framework providing components for the most usual tasks was established to fulfil the needs of the dashboard applications being developed for all the experiments. The schema of the Experiment Dashboard framework is presented in Figure 1.
The Experiment Dashboard framework is implemented in the Python programming language. The tasks performed on regular basis are implemented by the Dashboard agents. The framework provides all the necessary tools to manage and monitor these "agents", each focusing on a specific subset of the required tasks, such as collection of the input data or the computation of the daily statistics summaries.
To ensure a clear design and maintainability of the system, the definition of the actual monitoring application queries is decoupled from the internal implementation of the data repository. Every monitoring application implemented within the Experiment Dashboard framework comes with the implementation of one or more Data Access Objects (DAO), which represents the "data access interface": a public set of methods for the update and retrieval of information. Access to the database is done using a connection pool to reduce the overhead of creating new connections, therefore the load on the server is reduced and the performance increased.
The Experiment Dashboard requests are handled by a system following the Model-View-Controller (MVC) pattern. They are handled by the "controller" component, launched by the apache mod_python extension, which keeps the association between the requested URLs and the corresponding "actions", executing them and returning the data in the format requested by the client. All actions will process the request parameters and execute a set of operations, which may involve accessing the database via the DAO layer. When a response is expected, the action will store it in a python object, which is then transformed into the required format (HTML page, plain XML, CSV, image) by the "view" components. Applying the view to the data is performed automatically by the controller.
All components are included in an automated build system based on the Python distutils, with additional or customised commands enforcing strict development and release procedures. In total, there are more than fifty modules in the framework, and fifteen of them being common modules offering the functionality shared by all applications.
The modular structure of the Dashboard framework enables flexible approach for implementing the needs of the customers. For example, for the CMS production system, Dashboard provides only the implementation of the data repository. Data retrieved from the Dashboard database in the XML format is presented to the users via a web user interface developed by the CMS production team in the CMS web-tools framework.
LHC job processing and the Experiment Dashboard applications for job monitoring
The LHC job processing activity can be split in two categories: processing raw data and large-scale Monte-Carlo production, and user analysis. The main difference between the mentioned categories is that the first one is a large scale, well-organized activity, performed in a coordinated way by a group of experts, while the second one is chaotic data processing by members of the huge distributed physics community. Users running physics analysis do not necessarily have enough knowledge about the Grid and profound expertise in computing in general. With the restart of the LHC, a considerable increase of analysis users is expected. Clearly, for both categories of the job processing, complete and reliable monitoring is a necessary condition for the success of this activity.
The organisation of the workload management systems of the LHC experiments differs from one experiment to another. While in the case of ALICE and LHCb the job processing is organised via a central queue, in the case of ATLAS and CMS, the job submission instances are distributed and there is no central point of control as in ALICE or LHCb. Therefore, the job monitoring for ATLAS and CMS is a more complicated task and it is not necessarily coupled to a specific workload management system.
The Experiment Dashboard provides several job monitoring solutions for various use cases, namely the generic job monitoring applications, monitoring for ATLAS and CMS production systems, and applications focused on the needs of the analysis users. The generic job monitoring, which is provided for all LHC experiments, is described in more detail in the next section. Since the distributed analysis is currently one of the main challenges for the LHC computing, several new applications were built recently on top of the generic job monitoring, mainly for monitoring of the analysis jobs. Chapter 5 gives a closer look at the CMS Task Monitoring as an example of the analysis job monitoring applications.
Experiment Dashboard Generic Job Monitoring Application
The overall success of the job processing depends on the performance and the stability of the Grid services involved in the job processing and on the services and the software which are experiment-specific. Currently, the LHC experiments are using several different Grid middleware platforms and therefore a variety of Grid services. Regardless of the middleware platform, access from the running jobs to the input data as well as saving output files to the remote storage are currently the main reasons for job failures.
Stability and performance of the Grid services, like the storage element (SE), the storage resource management (SRM) and various transport protocols, are the most critical issues for the quality of the data processing. Further on, the success of the user application depends as well on the experiment-specific software distribution at the site, the data management system of the experiment and the access to the alignment and calibration data of the detector known as "conditions data".
These components can have a different implementation for each experiment and they have a very strong impact on the overall success rate of the user jobs. The Dashboard Generic Job Monitoring Application tracks the Grid status of the jobs and the status of the jobs from the application point of view. For the Grid status of the jobs, the Experiment Dashboard was relying in the Grid related systems as an information source. In the past, the Relational Grid Monitoring Architecture (RGMA) and Imperial College Real Time Monitor were used as information sources for the Grid job status changes.
None of the mentioned systems provided complete and reliable data. The current development aimed to improve the situation of publishing the job status changes by the Grid services involved in the job processing, as described later in the chapter. To compensate the lack of information from the Grid-related sources, the job submission tools of the ATLAS and CMS experiments were instrumented to report job status changes to the Experiment Dashboard system. Every time when the job submission tools query the status of the jobs from the Grid services, the status is reported to the Experiment Dashboard. The jobs themselves are instrumented for the runtime reporting of their progress at the worker nodes. The information flow of the generic job monitoring application is described in the next section.
Information flow of the generic job monitoring application
Similar to the common Dashboard structure, the job monitoring system consists of the central repository for the monitoring data (Oracle database), the collectors, and a web server that renders the information in HTML, XML, CSV, or in an image format.
The main principles of the Dashboard job monitoring design are:
- to enable non-intrusive monitoring, which must not have any negative impact on the job processing itself.
- to avoid direct queries to the information sources and to establish asynchronous communication between the information sources and the data repository, whenever possible.
When the development of the job monitoring application started, the gLite middleware did not provide any messaging system, so the Experiment Dashboard was using the MonALISA monitoring as a messaging system. The job submission tools of the experiments and the jobs themselves are instrumented to report needed information to the MonALISA server via the apmon library, which uses the UDP protocol. Every few minutes the Dashboard collectors query the MonALISA server and store job monitoring data in the Dashboard Oracle database. The data related to the same job and coming from several sources is correlated via a unique Grid identifier of the job.
Following the outcome of the work of the WLCG monitoring working groups, the existing open source solutions for the messaging system were evaluated. As a result of this evaluation, Apache ActiveMQ was proposed to be used for the Messaging System for the Grids (MSG). Currently, the Dashboard job monitoring application is instrumented to use the MSG in addition to the MonALISA messaging system.
The job status shown by the Experiment Dashboard is close to the real-time status. The maximum latency is 5 minutes, which corresponds to the interval between the sequential runs of the Dashboard collectors. Information stored in the central job monitoring repository is being regularly aggregated in the summary tables. The latest monitoring data is made available to the users. For the long term statistics, data is being retrieved from the summary tables, which keep aggregated data with hourly and daily time bin granularity.
Instrumentation of the Grid services for publishing job status information
As it was mentioned above, information about any job status changes provided by the Grid-related sources is currently not complete and covers only a subset of jobs. This has a bad impact on the trustworthiness of the Dashboard data. Though some job submission tools are instrumented to report any job status changes at the point when they query the Grid-related sources, this query is done from the user's side. For example, when a user never requests the status of his jobs and the jobs were aborted, there is no way for the Dashboard to be informed about the abortion of the jobs. As a result, they can stay in 'running' or 'pending' status, unless being turned into the 'terminated' status with 'unknown' exit code by a so-called 'timeout' Dashboard procedure.
To overcome this limitation, the ongoing development aims to instrument the Grid services involved in the job processing to publish any job status changes to the MSG. Dashboard collectors consume information from the MSG and store it in the central repository of the job monitoring data. The services which need to be instrumented and the concrete implementation depend of the way the jobs are submitted to the Grid.
The Dashboard collectors consume the information from the MSG and store it in the central repository of the job monitoring data. The advantages of using the MSG are numerous:
- Common way of publishing information.
- Common way of communicating between different components.
- Monitoring information is publicly available.
- Decreasing the load of the Grid Services.
When the jobs are submitted via the gLite Workload Management System (WMS), the LB service keeps full track of the job processing. The LB provides the notification mechanism which allows to subscribe to the job status changes events and to be notified as soon as events matching the conditions specified by the user happen. A new component "LB Harvester" was developed in order to register at several LB servers and to maintain the active notification registration for each one. The output module of the harvester formats the job status message according to the MSG schema and publishes it to the MSG.
Currently, the LB does not keep track of the jobs submitted directly to the Computing Resource Execution And Management (CREAM)  computing element (CE). The CEMon service plays a role similar to the LB but only for jobs submitted to the CREAMCE. A CEMon listener component is being developed in order to enable job status changes publishing to the MSG. It subscribes to CEMon for notifications about job status changes and republishes this information to the MSG.
Finally, jobs submitted to Condor-G, as in the previous case, do not use the WMS service and correspondingly do not leave a trace in the LB. The job status changes publisher component was developed in collaboration with the Condor and the Dashboard teams. Condor developers have added a job logs parsing functionality to the Condor standard libraries. The publisher of the job status changes reads new events from standard Condor event logs, filters events in question, extracts essential attributes and publishes them to the MSG. The publisher runs in the Condor scheduler as a Condor job. In this case, Condor itself takes care of publishing status changes.
Job monitoring user interfaces
The standard job monitoring application provides two types of user interfaces. First, the so called "Interactive User Interface", which enables very flexible access to recent monitoring data and shows the job processing for a given VO at runtime. The interactive UI contains the distribution of active jobs and jobs terminated during a selected time window by their status. Jobs can be sorted by various attributes, for example, the type of activity (production, analysis, test, etc.), site or CE where they are being processed, job submission tool, input dataset, software version and many others. The information is presented in a bar plot and in a table. A user can navigate to a page with very detailed information about a particular job, for example, the exit code and exit reason, important time stamps of processing the job, number of processed events, etc. This application is presented in detail in Chapter 6.
Second, the "Historical Interface", which shows job statistics distributed over time. The historical view allows following the evolution of the numeric metrics such as the number of jobs running in parallel, the CPU and the wallclock consumption or the success rate. The historical view is useful for understanding how the job efficiency behaves over time, how resources are shared between different activities, and how various job failures fluctuate as a function of time.
This chapter introduced the major concepts and components that are required to make Grid computing a reality. In a relatively short space of time the Grid has been created and moved past the hype to provide serious computing power. Several scientific Grids led the way but Grids are now increasingly found in commercial organisations as they provide a flexible, adaptive method of managing their computational loads without increasing expenditure.
The major components that form a Grid were discussed and examples given from the major implementations including Condor, Globus, EDG and gLite. Finally, a reliable system to monitor the Grid activities using the Experiment Dashboard was presented.
- LHC homepage,http://lhc.web.cern.ch/lhc/
- WLCG homepage, http://lcg.web.cern.ch/LCG/
- Phedex homepage, http://cmsweb.cern.ch/phedex/
- A.Tsaregorodsev et al, Dirac: A community grid solution, CHEP07 Conference Proceedings, Victoria, BC, Canada
- P. Nilsson, PanDA System in ATLAS Experiment, ACAT'08, Italy, November 2008
- Saiz, P. et al, AliEn - ALICE environment on the GRID, Nucl. Instrum. Meth., A502 (2003) 437-440
- gLite homepage, http://glite.web.cern.ch/glite/
- Open Science Grid (OSG) Web Page, http://www.opensciencegrid.org/
- Nordugrid homepage, http://www.nordugrid.org/middleware/
- CMS Dashboard stats, http://lxarda18.cern.ch/awstats/awstats.pl?config=lxarda18.cern.ch
- Real Time Monitor home page, http://gridportal.hep.ph.ic.ac.uk/rtm/
- SAM paper, asked David, waiting for answer
- LB homepage, http://egee.cesnet.cz/cz/JRA1/LB/
- C. Aiftimiei, P, et al, Using CREAM and CEMON for job submission and management in the gLite middleware, to appear in Proc CHEP'09, 17th International Conference on Computing in High Energy and Nuclear Physics, Prague, Czech Republic, March 2009
- J. Moscicki et al, Ganga: a tool for computational-task management and easy access to Grid resources, Computer Physics Communication, arXiv:0902.2685v1
- D. Spiga et al, The CMS Remote Analysis Builder (CRAB), Lect.Notes Comput.Sci.4873:580-586,2007
- I. Legrand, H. Newman, C. Cirstoiu, C. Grigoras, M. Toarta, C. Dobre, "MonALISA: an Agent Based, Dynamic Service System to Monitor, Control and Optimize Grid Based Applications", in Proceedings of Computing for High Energy Physics, Interlaken, Switzerland, 2004
- James Casey, Daniel Rodrigues, UlrichSchwickerath, Ricardo Silva, Monitoring the efficiency of user jobs, CHEP'09: 17th International Conference on Computing in High Energy and Nuclear Physics, Prague, Czech Republic, March 2009
- Google Web Toolkit, http://code.google.com/webtoolkit/,
- S.Metson et al, CMS offline webtools, CHEP07 Conference Proceedings, Victoria, BC, Canada
- R-GMA homepage, http://www.r-gma.org/
- Condor home page, http://www.cs.wisc.edu/condor/
- E. Karavakis, et al, CMS Dashboard for Monitoring of the user analysis activities, CHEP'09: 17th International Conference on Computing in High Energy and Nuclear Physics, Prague, Czech Republic, March 2009
- Agrawal, R., Srikant, R., Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, 487--499 (1994)
- S. Belforte et al, The commissioning of CMS sites: improving the site reliability, 17th International Conference on Computing in High Energy and Nuclear Physics, Prague, Czech Republic, March 2009
- GridMap visualization, http://www.isgtw.org/?pid=1000728
- EDS HP company homepage, http://www.eds.com/
- ALICE monitor, http://pcalimonitor.cern.ch/map.jsp
- QAOES: http://dashb-cms-mining-devel.cern.ch/dashboard/request.py/qaoes
Grid computing is a dream of human beings to achieve more powerful, easy and cheap information processing ability. However, the reality is that Grid technology is still in its infancy. Besides the challenges from finding the technical solutions for interoperability of resources and virtualised utilisation of a large scale shared resources, there are still some social challenges, such as collaboration management, security policies coordination, which are not always within a technical scope.