CMS dashboard Job

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Chapter 6.
CMS dashboard JOB Summary

The Dashboard Job Summary was the first job monitoring application to be developed, based on vision more than experience, therefore emphasis was put on flexibility. The application provides a job-centric view aimed at understanding and debugging what happens “now”.

This chapter discusses the development of the CMS Dashboard Job Summary application that was performed by the author.

6.1 Introduction

The CMS Virtual Organisation (VO) uses various fully distributed job submission methods and execution backends. The CMS jobs are processed on several middleware platforms such as the gLite, the ARC and the OSG. Up to 200,000 CMS jobs are submitted daily to the Worldwide LHC Computing Grid (WLCG) infrastructure and this number is steadily growing. These mentioned factors increase the complexity of the monitoring of the user analysis activities within the CMS VO.

Distributed analysis on the WLCG infrastructure is currently one of the main challenges of the LHC computing. Reliable monitoring is an aspect of particular importance; it is a vital factor for the overall improvement of the quality of the CMS VO infrastructure. Transparent access to the LHC data has to be provided for more than 5000 scientists all over the world. Users who run analysis jobs on the Grid do not necessarily have expertise in Grid computing. Currently, 100-150 distinct CMS users submit their analysis jobs to the WLCG daily. The significant streamlining of operations and the simplification of end-users' interaction with the Grid are required for effective organisation of the LHC user analysis. Simple, user-friendly, reliable monitoring of the analysis jobs is one of the key components of the operations of the distributed analysis.

The goal of the CMS Dashboard Job Summary is to follow the job processing of the CMS experiment on the distributed infrastructure. The entry point of the application is the number of the jobs submitted or terminated in a chosen time period categorised by their activity such as the analysis, the production, the job robot and the simulation jobs.

The CMS Dashboard Job Summary, the so called “interactive interface”, allows to drill down expanding the set of jobs by various relevant properties such as the execution site, the grid gateway, the user, various completion statuses, the grid workload management host, the activity type and the used dataset, until all details stored in the database about a chosen (set of) job(s) can be accessed. The interface reports success/failure according to grid/site/application problem, and information on used wall clock and cpu time consumed by the jobs.

Information related to the job processing can be aggregated and presented per user, per site or computing element (CE), per resource broker, per application and per input collection.

The application offers very flexible access to recent monitoring data and shows the job processing at runtime. The interactive UI contains the distribution of active jobs and jobs terminated during a selected time window by their status. Jobs can be sorted by various attributes, for example, the type of activity (production, analysis, test, etc.), site or CE where they are being processed, job submission tool, input dataset, software version and many others. The information is presented in a bar plot and in a table. A user can navigate to a page with very detailed information about a particular job, for example, the exit code and exit reason, important time stamps of processing the job, number of processed events, etc.

The CMS Dashboard Job Summary was the very first monitoring application developed in the Dashboard project. The motivation for this development, started at the summer of 2005, was to show whether the Grid is operational, because at that period of time people were rather pessimistic about the Grid, and to show what is going on regarding job processing “now”, detect any problems or inefficiencies, not necessary with the site , but for example with a particular dataset, or particular instance of RB, or particular application version.

This is the reason why the application provides such a wide flexibility to the users; a user can sort information by any of the job/task attributes. The application does not offer long term statistics, since there is no precooked information on the database. The application is using raw database data and the database was tuned for better performance with this high level of flexibility.

6.2 Design

The following sections discuss the requirements that shaped the design of the CMS Dashboard Job Summary application.

6.2.1 Objectives

The main objectives for re-developing CMS Dashboard Job Summary is to provide a more stable, maintainable release aimed at various CMS User Communities such as the VO Management Team, the coordinators and participants of various CMS computing projects such as the Analysis Support Team and CMS Site Administrators.

The CMS Site Administrators need a monitoring tool to monitor the usage of their site, who is using it, get consumed time information and the processing efficiency of their site.

The main beneficiaries of this activity were the users who have come to rely on the functionality that the application provides. The increased stability and performance was a benefit to them and to new users.

6.2.2 Use Cases

A use case analysis was carried out based upon the feedback received by the CMS physicist community. The main use cases are described in Appendix XXX and illustrated in 1.

With the major use cases established it is possible to extract the key requirements that the application has to fulfil. The following points represent the baseline requirements divided into principal areas.

6.2.3 Requirements

Assumptions

1. Users have a grid certificate.

2. Users are members of the CMS VO.

User Interface

1. Users control the application via a web interface using a browser.

2. Easy to understand how it works and how to navigate throughout the tool.

3. Compatible with all the recent browsers and operation systems.

4. All of the Grids and the job submission systems that CMS uses will be supported.

5. The user will access a very detailed information of the job processing including every single resubmission that he/she might have performed for each job individually.

6. The application will be connected to the CMS Dashboard Task Monitoring for job-centric information.

7. The application will offer consumed time information and plots such as Waiting Time, Running Time, Overall Time, CPU Time, Job Wrapper Time and Processing Efficiency.

8. The user will be able to search of a specific job by entering its Grid Job ID which is a unique identifier.

9. Update in 'real-time' from the worker nodes where the jobs are running.

10. The user will be able to bookmark his/her favourite tasks for later use or to share them among his/her colleagues.

11. Offer a selection of advanced graphical plots that will visually assist the user.

12. The application will offer success rate calculation.

13. The user will be able to retrieve the results in the XML format as well as the standard HTML, XSL format.

14. The application will be built on top of the CMS Dashboard Job Monitoring Data Repository.

15. Exceptions should be caught by the application and informative error messages will be provided to the users.

16. Verbose logging should be available to identify any problems.

17. Quick access to the application's manual, help, the FAQ and the meanings of the error exit codes should be provided.

Developer's Requirements

1. Variable level of logging will be built in from the start.

2. Logging will write to stdout and to a file to ease debugging.

3. Low coupling between the components is required.

4. Minimum version of Python that is supported is determined by that installed on lxplus.cern.ch (currently 2.3).

6.2.4 Architecture

The CMS Dashboard Job Summary application is part of the Experiment Dashboard system which is widely used by the four LHC experiments. The architecture does not differ from the one of the CMS Dashboard Task Monitoring covered in depth in section 5.2.4.

Job status is reported to Dashboard from several information sources. The main ones are CMS Job Submission systems such as CRAB and ProdAgent. The status changes of the jobs can be triggered by reports sent from Job Submission Systems user interface, when the job status is checked, or reports from the jobs running at the Worker Node (WN). The jobs running on the WN are instrumented to report when they start running and when they finish. Exit status of the job is also reported from the WN. As soon as the job is terminated at the WN, it is turned into “terminated” status in the Dashboard.

6.3 Implementation

Python was chosen as the main development language for the CMS Dashboard Job Summary for the reasons outlined in Section 5.3. Apache 2.0.52 (as of November 2009) was chosen to provide the client interface as it has a history of being flexible, secure and performant. PHP was chosen as the implementation language for the interactive plot, due to its power and the availability of third party libraries. Javascript and AJAX were used to connect the web interface with the database. Finally, the patched version of the Graphtool python library was used for the creation of the consumed time and failure diagnostics plots.

The relation between the Action and the View python classes and their generated output files is being illustrated in X. All the Action classes access the database to collect the data and if a calculation in the results is needed, they forward the data to the appropriate View class for the calculation and then the data is returned to the user in the appropriate output format. The output format generated from the Generic Histogram View classes is in XSL containing an image plot and a table with the requested results.

6.3.1 Filters

The filter classes contain the menu data and all of the sorting parameters. When the user enters the application for the first time of a session, the Filters python class calls the jobFilters function of the Data Access Object (DAO).

The jobFilters function contains the database queries to get the menu data for all the available parameters of the menu. The DAO then executes the queries and the python class puts the data in a shared area, the ActionContext as defined in Section 5.2.4, to picked up by the Filters.xsl output file. The flowchart of the Filters request is illustrated in 3.

The available filter parameters can be seen in 4. The DAO JobFilters function, executes 10 queries to get the results for the drop-down menu. The user can also select to view only a selected job status by clicking on any of the check boxes.

The application also offers 18 different sorting parameters. These parameters are contained in a single python dictionary as illustrated by Listing X.X

menus['sortbys'] = [{'sortby':'user'}, {'sortby':'site'},{'sortby':'submissiontool'}, {'sortby':'submissionui'},{'sortby':'dataset'},{'sortby':'application'}, {'sortby':'rb'}, {'sortby':'ce'},{'sortby':'activity'}, {'sortby':'grid'}, {'sortby':'submissiontype'}, {'sortby':'task'}, {'sortby':'jobtype'},{'sortby':'subtoolver'},{'sortby':'tier'}, {'sortby':'genactivity'}, {'sortby':'outputse'}, {'sortby':'appexitcode'}]

6.3.2 CMS Dashboard Database Schema

The CMS Dashboard Job Summary application is built on top of the CMS Dashboard Job Processing Data Repository. To ensure a clear design and maintainability of the application, the actual monitoring queries are decoupled from the internal implementation of the data storage. The application comes with a Data Access Object (DAO) implementation that represents the data access interface. Access to the database is done using a connection pool to reduce the overhead of creating new connections and therefore, the load on the server is reduced and the performance is increased.

5 illustrates the entity relationship diagram between the most important tables of the database used by the CMS Dashboard Job Summary application. The Job table is the most important table and it contains information regarding the job itself such as the number of events to be analysed, the task that it belongs, the site that the job is running and various submission timestamps. The Primary Key (P) is the JobId and there are 5 Foreign Keys (F) connecting the Job table with the Site, the Task, the Resource Broker (RB), the Short Computing Element (CE) and the Scheduler table.

The Task table contains task-specific information such as the task creation timestamp, the name of the task, the submission method used, the user that has submitted this task, the input collection and the target Computing Element (CE). The Primary Key is the TaskId and there are 8 Foreign Keys connecting the table with the User, the Task_Type, the Application, the Input_Collection, the Scheduler, the Submission_Tool, the Submission_IU and the Submission_Tool_Ver table.

The Site table contains site-specific information such as the site name, the country that the site belongs to, the Computing Elements of the site and the nodes of the site. The Primary Key is the SiteId and the Foreign Key is the SchedulerId connecting the table with the Scheduler table.

6.3.3 SQL Queries

In this section, the most important SQL queries of the application will be presented. The first set of SQL queries are responsible for fetching the list with the values of the filters ordered by the name of the value for each category.

select distinct "GridName" as "user" from users order by "user"

select distinct "VOName" as "site" from site where "InteractiveInterfaceFlag" = 0 order by "site"

select distinct "ShortCEName" as "ce" from short_ce order by "ce"

select distinct "SubmissionTool" as "submissiontool" from submission_tool order by "submissiontool" select distinct "ApplicationVersion" as "application" from application order by "application" :

select distinct "RbName" as "rb" from rb order by "rb"

select distinct "Type" as "activity" from task_type order by "activity"

select distinct "SchedulerName" as "grid" from scheduler order by "grid"

select distinct "JobType" as "jobtype" from job_type order by "jobtype"

select distinct "Tier" as "tier" from site order by "tier"

The SQL queries for the consumed time information are variable and constantly changing according to the selected set of the filters. The following query calculates the overall time per site.

select "VOName" as "name", 24*60*60*avg(delay) as "value", 24*60*60*min(delay) as "dmin",

24*60*60*max(delay) as "dmax", 24*60*60*sum(delay) as "total" from (

select ( to_date(to_char("FinishedTimeStamp",'YYYY-MM-DD HH24:MI:SS'),'YYYY-MM-DD HH24:MI:SS') -

to_date(to_char("DboardFirstInfoTimeStamp",'YYYY-MM-DD HH24:MI:SS'),'YYYY- MM-DD HH24:MI:SS') ) as delay, site."VOName" as "VOName"

from job, task, site ,task_type where ("DboardFirstInfoTimeStamp" <= :bv_date2) and ("DboardFirstInfoTimeStamp" >= :bv_date1) and ( TASK."TaskTypeId" = task_type."TaskTypeId" and task_type."Type" = :bv_activity) and ("FinishedTimeStamp" >= "DboardFirstInfoTimeStamp") and ("FinishedTimeStamp" != '01-Jan-70 12.00.00 AM') and ("DboardFirstInfoTimeStamp" != '01-Jan-70 12.00.00 AM')

and job."SiteId" = site."SiteId"

and (job."TaskId" = task."TaskId")

and ("DboardStatusId" in ('T'))

and job."TimeOutFlag"='0' ) group by "VOName" order by "value" desc

The SQL query for the exit code summary calculation is variable according to the selected set of filters. The following query calculates the exit code summary values for a specific site.

with temp as (select "exitcode", count("exitcode") as "num", "URLToDoc" as "url", "Comment" as "comment", "AppGenericStatusReasonValue" as "value", "SiteUserFlag" as "flag"

from APP_GENERIC_STATUS_REASON app,(

select Job."DboardStatusId", Job."JobExecExitCode" as "exitcode",

Job."DboardJobEndId", Task."UserId", site."SiteId", Job."DboardFirstInfoTimeStamp",

site."SchedulerId", Task."ApplicationId", Job."RbId", task_type."Type", task_type."GenericType", Task."InputCollectionId",Task."TaskTypeId", Task."SubmissionToolId", task."TaskId" as "TaskId" , submission_tool_ver."SubToolVersion" from job,task,site, task_type , submission_tool_ver where (task."TaskTypeId" = task_type."TaskTypeId") and (job."SiteId" = site."SiteId") and (job."TaskId" = task."TaskId") and ("DboardFirstInfoTimeStamp" <= :bv_date2) and ("DboardFirstInfoTimeStamp" >= :bv_date1) and (("DboardJobEndId"='F' and "DboardStatusId"='T')) and (task_type."Type" = :bv_activity) and (site."VOName" = :bv_site) and (task."SubToolVerId" = submission_tool_ver."SubToolVerId")) ex

where (app."AppGenericErrorCode"=ex."exitcode" )

group by "exitcode", "URLToDoc", "Comment", "AppGenericStatusReasonValue", "SiteUserFlag"

order by "SiteUserFlag" desc)

select * from ((select temp."flag", sum("num") as "sum_n" from temp group by temp."flag") sum_n left join temp on temp."flag"=sum_n."flag" )order by sum_n."flag"

The SQL query that fetches the data for the plot and the table is listed in the Appendix XXX due to its size.

6.3.4 User Interface

The User Interface of the CMS Dashboard Job Summary is divided in two parts. The graphical plot, the filters with their sorting parameters, the consumed time information buttons and the search field to search for a specific job can be seen in the upper part of the User Interface as illustrated in 6.

By clicking on any category on the plot, a “sort-by” menu appears allowing the user to explore further on the available information as illustrated in 7.

The table with all the available numerical data can be seen in the lower part of the User Interface as illustrated in 8. The table is categorised by the current status, the grid exit status, the application exit status, the overall status and the number of events processed and the CPU and Job Wrapper time.

Bars are sorted by the number of jobs in a given category. Since labels of every category can be rather long, it is difficult to find a given item in the table. The items in the table by default are sorted in the alphabetic order but by clicking on the table header of any selected column, the user can sort the items in the table by a value in a corresponding column.

The table also offers success rate calculation as illustrated in 9. The formula to calculate the success rate follows:

• Grid Success Rate (Grid%) = Done / (Done + Abort)

• Application Success Rate (App%) = Success / (Success + Fail)

• Overall Success Rate (Overall%) = (Success- (Success & Abort)) / (Terminated-(GridUnknown & AppUnknown))

• Site Success Rate (Site%) = 1 - ((SiteFailed + GridAborted) / (Terminated-(GridUnknown & AppUnknown))

where

• Done = reported as “Grid success” by the Grid information services.

• Abort = reported as “Grid aborted” jobs by the Grid information services.

• Success = application ran successfully.

• Fail = application failed.

• Terminated = reported as terminated (success or failure) by any of the information sources (grid or application).

The user can also retrieve the result of the table in the XML format by using the following command:

curl -H 'Accept: text/xml' http://dashb-cms-job.cern.ch/dashboard/request.py/jobsummary-plot-or-table > /tmp/action.xml

The XML output will be a bit hard to read because there is no newline break. We can reformat the output by using the 'xmllint' command:

xmllint --format /tmp/action.xml

By clicking on any consumed time button, a new window appears with a graphical plot and a table. The Waiting Time information can be seen in 10. This functionality offers a per job average waiting time and it is calculated by subtracting the “Started_Running time” with the “Submission time” timestamps.

The Running Time information can be seen in 11. This functionality offers per job average running time and it is calculated by subtracting the “Finished time” with the “Started_Running time” timestamps. The timestamps are reported by the jobs themselves and in case of a job resubmission, only the latest attempt is considered.

The Overall Time information can be seen in 12. This functionality offers per job average overall time and it is calculated by subtracting the “Finished time” with the “Submission time” timestamps. The timestamps are reported by the jobs themselves and in case of a job resubmission, only the latest attempt is considered.

The CPU Time information can be seen in 13. This functionality offers per job average CPU time and it is calculated by the sum of the “CPUTime” field ordered by a category, such as the site and the user. Currently, only jobs submitted using CRAB report the “CPUTime” value.

The Job Wrapper Time information can be seen in 14. This functionality offers average per job Wall Clock time as reported by the job wrapper and it is calculated by the sum of the “WCTime” field ordered by a category, such as the site and the user. Currently, only jobs submitted using CRAB report the “WCTime” value.

The Processing Efficiency information can be seen in 15. This functionality offers average per job processing efficiency as reported by the job wrapper and it is calculated by dividing the “CPUTime” with the “WCTime” ordered by a category, such as the site and the user. Currently, only jobs submitted using CRAB report the “CPUTime” and “WCTime” values.

The Exit Code Summary can be seen in 16. This page reports error diagnostics by providing a table with numerical values and a graphical plot showing the distribution of user, application and site failures.

6.4 Experience of the CMS User Community with Job Summary

According to our web statistics[8][9], more than seventy distinct users are using Job Summary for their everyday work as illustrated in 17. The Dashboard Applications Usage Statistics programme was developed to count the daily total number of distinct users using a selected number of CMS Dashboard applications.

In order to count the distinct daily users, the daily access_log file of the apache http web server was used. The following bash script commands were used in a python programme to determine the date of the month and the total number of distinct daily users using some selected applications according to the total number of unique visitor IPs.

# Command to get the date of the month:

getDate = "zgrep +0 /var/log/httpd/access_log.1.gz | awk '{print $4}'| uniq | head -n 1| cut -c 2-13"

# Job Summary usage:

JobSum = "zcat /var/log/httpd/access_log.1.gz | grep jobsummary | awk '{print $1}' | sort | uniq |wc -l"

The “JobSum” bash command counts the total number of distinct users using the application. The following cron command was scheduled to run the programme daily at 06:00am for the updating of the stats.

0 6 * * * python /usr/share/dashboard-stats/dashb_stats.py 2>&1 >> /var/log/script_output.log

The Graphtool library was then used to create the plot of the programme, which is available at http://lxarda18.cern.ch/usage.html.

6.5 Summary

Currently a big variety of monitoring tools on the CMS Virtual Organisation provide job monitoring functionality. Most of them are middleware-specific and are used in the scope of a single middleware. CMS Dashboard Job Summary provides monitoring functionality regardless of the job submission method or the middleware platform offering a complete and detailed view of the Grid.

The CMS Dashboard Job Summary was the first monitoring application developed in the Dashboard project. The motivation for this development, started at the summer of 2005, was to show whether the Grid is operational, because at that period of time people were rather pessimistic about the Grid, and to show what is going on regarding job processing “now”, detect any problems or inefficiencies, not necessary with the site, but for example with a particular dataset, or particular instance of RB, or particular application version. This is the reason why the application provides such a wide flexibility to the users; a user can sort information by any of the job / task attributes recorded in the CMS Dashboard database.

The application offers an appropriate visualisation of the job processing data, providing navigation from the global to a detailed view and taking into account the requirements of the different categories of the users.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.