ACADEMIA
DOE Taps Stottler Henke to Improve the Reliability of Clusters & Grids
Stottler Henke Associates, Inc. (www.stottlerhenke.com) today announced it has been tapped by the U.S. Department of Energy (DoE) to develop "smart job recovery" software that detects, diagnoses, and recovers from problems encountered by long-running batch programs to improve the quality of service provided by computer clusters and grids. Stottler Henke, a software development firm specializing in innovative applications that incorporate artificial intelligence and other advanced technologies, has been awarded a $750,000 Small Business Innovation Research (SBIR) contract from DoE to develop the Agent-Based High Availability (ABHA) system. The goal of Stottler Henke's ABHA system is to let computer clusters process long-running batch jobs more reliably by detecting and diagnosing problems so that ABHA can determine how best to restart those jobs and, if possible, continue execution. With long-running "batch" jobs, restarting jobs by going back to the beginning wastes time, computer resources, and money. Lawrence Berkeley National Laboratory (LBNL) is one of Stottler Henke's partners on the ABHA project. According to Douglas Olson, a nuclear physicist in the lab's Computational Research Division, several of their experiments push the performance limits of the computing infrastructure common to experimental physics - clusters of commodity computers running Linux. "With commodity hardware, some level of failure is to be expected," Olson said. "As we transition into the widely distributed mode of grid computing, the normal failure rates in such a complex system make efficient use of the resources extremely challenging. The key is to be resilient when a failure occurs; we expect ABHA will afford us that resiliency. Without it, we could not use commodity computers, and our hardware costs would be 10- to 100-times greater." Charu Chaubal, a grid computing technologist at Sun Microsystems, agrees that devising failure recovery technology is vital to migrating grid computing from the academic world into mainstream business environments. "Business computing has a more stringent requirement for high availability," Chaubal said. "And as the grids get larger, failure recovery becomes more critical. That's why initiatives like ABHA are so timely." Computer clusters are groups of low-cost processors and storage devices, interconnected by networks and workload management software, which provide the appearance of a single, high-performance system. Clustered computing provides low-cost parallel processing of resource-intensive jobs that decompose naturally into parallel tasks. Use of computer clusters is growing rapidly within industry, government, and universities to support scalable, cost-effective scientific computing. Computer jobs can fail for many reasons, however, such as transient and permanent hardware failures; software configuration errors; insufficient computing, storage, or network resources; and ill-behaved applications. As jobs continue to grow in size, exploiting larger and more powerful clusters, it becomes increasingly likely that a failure will occur during execution. When these job failures occur, simplistic job recovery policies lead to low quality of service and inefficient use of the cluster's resources. For example, if the original cause of the job failure no longer exists, aborting the entire job is unnecessarily pessimistic because it delays job completion and wastes the computer resources already spent running the job's upstream tasks, reducing the system's throughput. On the other hand, when the cause of the failure still persists, automatically restarting tasks wastes computing resources because the restarted task will fail again for the same reason. To provide high throughput and high reliability, it is necessary to determine the cause of task failure in enough detail to select and execute the appropriate job recovery. Unfortunately, automated job recovery is currently impractical because it is difficult, time consuming, and error-prone for end users to implement the specific fault detection diagnosis, and recovery algorithms needed by each job. Stottler Henke's ABHA system will monitor the execution of each task, detect task failures, diagnose their cause, and recover intelligently by applying knowledge of the cluster's configuration and topology, knowledge of each job's decomposition into parallel and sequential tasks, and optional knowledge of each task's resource requirements. Possible job recovery actions include aborting the task or rescheduling the task for immediate or future execution, possibly modifying job parameters to avoid problematic portions of the cluster. Stottler Henke is working with LBNL (Berkeley, Calif.) to design and prototype the ABHA system to support their clusters comprised of Linux-based computers managed by both the Condor (developed at the University of Wisconsin) and LSF (developed by Platform Computing) workload management systems. The company expects that core technologies developed during this project will support clusters and grids implemented by other hardware platforms and operating systems as well. Stottler Henke is also seeking interest and participation from vendors and end user organizations that provide or employ clustered computing.