Ames laboratory scientist receives 2006 IBM Faculty Award

"This is a very competitive program, so this award demonstrates that Brett's research topic of 'Control Systems for Peta-Scale Computing' has attracted considerable interest within the corporation," said Sam Ellis, IBM Program Manager for BlueGene System Software Development and Campus Relationship Manager for Iowa State University. Petascale computing is the push toward developing a supercomputer that has over a petaflop peak performance. The DOE's Office of Science has made this goal a high priority, spearheading it through their Advanced Scientific Computing Program. Bode has been a key player in the program's efforts to reach that objective through his involvement in the Scalable Systems Software Project to design a fully integrated systems software suite that can be used by the nation's largest scientific computing centers for the cost-effective management and use of their computational resources. A petascale supercomputer would be able to perform one quadrillion mathematical calculations per second. Today's high-end supercomputers operate at teraflop speed, performing one trillion calculations per second. So a petascale computer would increase that speed by one thousand times. Imagine the potential benefits that level of computing power would afford in advancing science and engineering! "All of the real high-end systems today are based on some sort of distributed computing architecture that includes multiple independent operating system nodes," said Bode. "The fastest supercomputer in the world is IBM's BlueGene located at Lawrence Livermore National Lab. It's a 280-teraflop system made up of 64 thousand nodes containing 131,072 processors – just above a quarter of a petaflop – which means there's still a factor of four to go before we reach the petaflop level. So it looks like the petascale system will be one with an enormous number of processing elements." Bode said one of the challenges then becomes finding a way to manage all of those elements. How do you locate faults in a system that big and how do you maintain system up time when there are faults? The IBM Faculty Award will provide Bode with a means to work on those problems. He will be collaborating with IBM researchers on a project to design fault-tolerant control systems that can identify and manage faults as they occur, with the goal of maintaining total system up time as much as possible. "The project with IBM will be somewhat of a follow-on to the work we did with the Scalable Systems Software Project, and it will extend the collaboration with IBM on the BlueGene architecture that we've had in place for the past several years," said Bode. "We know we can't prevent faults per se, so we have to manage them as they occur. The ideal is to predict them ahead of time. If we know where a component is going to die, we can get the running job off of it. But that's a tough problem – not all faults are predictable," he adds. "You'll always have some faults that will take out at least the jobs running on those particular nodes, but you want to isolate that to just the minimum number of jobs that can be affected so you can maximize the investment in the machine. That's what we'll be working on with IBM."