New work from Glasgow has shed light on the efficiency of the Grid for physics analysis, highlighting the progress made and work still to be done before the LHC switches on. Stuart Paterson, a PhD student, looked at the experience of LHCb physicists submitting jobs to the Grid, both before and during LHCb's latest data challenge (DC06). Job completion efficiencies of 95% for analysis jobs have been measured from two different studies using statistics accumulated before DC06 commenced. However, examining the performance over a six month period, including the recent DC06 activity when production work and analysis jobs are running together, the picture became more complex and the efficiency could drop to around 70%.
The LHCb analysis infrastructure is built on the DIRAC workload management system and Ganga user interface - both GridPP supported projects. Stuart worked on developing DIRAC to support the user distributed analysis activity and has been understanding how the performance of DIRAC is affected by the simultaneous use of the Grid for production and analysis. He said, "While studies over a short period of time have yielded very high job completion efficiencies, issues which manifest themselves over longer periods of time can have a significant effect. In particular, the recent upsurge in physics analysis being submitted to the Grid has coincided with the 2006 data challenge, providing a good opportunity to study DIRAC under realistic conditions".
Figure 1. Breakdown of results from 3,000 analysis jobs distributed over the Grid using the DIRAC Analysis System.
The results in Figure 1 show a drop in efficiency over the six month period as a whole. Further examination of the sample indicates that the majority of the causes of job failure are due to transient effects which have since been resolved. For example, the cause of the 10% of jobs failing to find input data is due to inconsistencies in the file catalogue which have since been identified and fixed. Omitting such transient effects from the sample, the job completion efficiency becomes 91%, with the remaining 9% due to factors such as intermittent power cuts and network outages. The matching times in Figure 2, indicate that the DIRAC scheduling approach scales well to the more demanding requirements of distributed data analysis tasks. Glenn Patrick, the UK Computing Co-ordinator for LHCb, emphasised the importance of Stuart's work to the collaboration saying, "Leading up to the turn-on of the Large Hadron Collider, it is essential that physicists can access data from the LHCb detector in a reliable and transparent way. Stuart's work on DIRAC is proving vital in providing a well tuned analysis system for the experiment".
Figure 2. Matching times from 3,000 analysis jobs in the DIRAC Analysis System
(matching time is defined as the time between a Pilot Agent requesting a job from the DIRAC WMS and the job being delivered to the computing resource).