GridPP helps drive prize winning Dashboard system

A monitoring system which uses GridPP resources is being used to help improve the Grid. It also impressed judges at the EGEE'06 conference in Geneva and went away with a €500 prize for best application of Grid technology. Development began at CERN a year ago to provide an overview of the distributed computing system for CMS. The result was the Dashboard which gathers information about job processing, data transfer and data publishing using multiple sources of information like RGMA, the Imperial College Real Time Monitoring Database and the MonaLisa monitoring system. Quantities such as the number of jobs, usage of resources, application behaviour and distribution of data can all be displayed by the system. All the monitoring data is recorded in the Oracle database at CERN and can be accessed via a web interface or downloaded in different formats like CSV and XML. During development support for the ATLAS project was added and the Dashboard is a vital tool used by both experiments. The work is being carried out by Julia Andreeva and her colleagues Juha Herrala, Benjamin Gaidioz, Ricardo Brito Da Rocha, Catalin Cirstoiu, Pablo Saiz and Craig Munro of GridPP in collaboration with ASGC in Taipei. "The Dashboard presentation by Julia Andreeva and her team at CERN, demonstrated a great device for monitoring the status of Grid resources and the applications using them," said Professor Alan Blatecky, Deputy Director of RENCI and head of the selection committee for best demonstration. Craig Munro worked on the team developing the User Interface which enables users to quickly and easily examine the relevant Information. "The past year, while at CERN, has been an exciting time to be a part of the dynamic group developing the CMS Dashboard, and its wonderful to be a part of a team that won the EGEE demo prize" says Munro. At EGEE '06 the team also presented a poster showing how Dashboard drives another useful application, having been expanded to watch all job failures on the Grid. "We use Dashboard to monitor job failures," says Pablo Saiz, presenting the poster, "then we interpret common error messages so that we can fix them to make data transfer much more stable." In the last 3 months, the team has used the monitoring system to analyse more than 400,000 job attempts. One of the most common error messages they found was 'no compatible resources', meaning the resource broker could not match the requirements that the user had specified. The team investigated the source of miscommunication and found that the problem lay in the BDII storage. This was reported to BDII developers who worked on the system to make it more stable. Andreeva's team also uses Dashboard to create automatic reports for the for CMS and ATLAS virtual organizations showing the performance of individual sites on the Grid, the success rate of the applications at the sites and indicate the nature of any problem "this can highlight which sites are not behaving properly, so that problems can be attended to and jobs are more successful" says Saiz. You can find more information about the prizewinning monitoring system at: its Web site Thanks to Helen Thomson for reporting.