Petascale chemistry

By J. William Bell, NCSA -- Computational chemists are rising to the challenge of developing applications for petascale computing. Petascale computing is rapidly approaching. The Department of Energy's Office of Science plans to install a computing system with a peak performance of one petaflop at Oak Ridge National Laboratory in 2009, and the National Science Foundation plans to install a computing system with a sustained performance of one petaflop by 2011. Potential breakthroughs from petascale computing abound -- the NSF solicitation for the sustained petaflop system listed more than 30 science and engineering problems expected to benefit. "Chemistry is one area of science that needs petascale computing to improve current molecular models and tackle more complex molecular phenomena, including nanoscale systems. Prediction of the structures, energetics, and reactivity of molecules is computationally intensive, and chemists have long been in the vanguard of high-performance computing," says Thom Dunning, NCSA's director. But, tapping the power of petascale computers will require the development of a new generation of scalable, parallel chemistry codes as petascale computers will obtain their power by harnessing hundreds of thousands of processors, not the thousands present in today's high-end systems. Chemists tackled a similar problem in the 1990s, when "massively" parallel computers had a few hundred processors. This effort led to the identification of a few key concepts, which will be critical as applications scale to sustained petascale performance, says Robert J. Harrison, who leads the Computational Chemical Sciences Group at Oak Ridge National Laboratory and is a member of the chemistry faculty at the University of Tennessee, Knoxville. One of these concepts is modularity. Codes being built or revised to attain petascale performance will need to accommodate new models and algorithms. The cost of development -- in terms of person hours and dollars spent -- is so high that tomorrow's applications scientists will have to be able to leverage investments that have already been made. This is more easily done if codes are modular. As an example of modularity, Harrison points to NWChem, a chemistry application for parallel computers initially developed in the early 1990s at the Department of Energy's Environmental Molecular Sciences Laboratory and widely used by NCSA researchers. Harrison was NWChem's chief architect. NCSA's Dunning instigated and oversaw the project. "NWChem creates a path that allows smaller groups to implement codes more quickly," Harrison says. "It's important to think of NWChem as a framework for chemical computation rather than as just an application," Dunning says. "That's an ideal that's been with NWChem from the outset and that similar projects -- now and going forward -- will benefit from." But modularity does not solve the problem of scaling to hundreds of thousands of processors. To address this issue, new algorithms will be needed. Chemistry codes make use of standard mathematical algorithms (for example, BLAS, SCALAPACK, and PEIGS) whenever possible, so they will be able to leverage investments made in improving the scalability of these packages. However, other algorithms are specific to chemistry and will require investments in chemistry-specific teams of mathematicians and computer scientists to advance the scalability of these algorithms. In some cases, entirely new algorithms may be required. Another concept for applications developers to keep in mind as they conceive of petascale codes is object-based virtualization. Virtualization refers to representing multiple resources or processes as single entities. That might mean automating the method by which applications are spread out across the hundreds of thousands of processor cores so that users don't have to address the problem directly, or it might mean hiding the true complexity behind how data is called up from memory during a calculation. Here, Harrison calls out Charm++. "We need to place a greater emphasis on changing programming models. The complexity of a petascale machine is much greater. [Those using petascale applications] will need to be working at much higher levels of abstraction than they're allowed to today," Harrison says. Charm++ is a parallel programming system that can be used with a variety of high-performance codes in disciplines like computational fluid dynamics, biophysics, and cosmology. It runs on some of the largest systems at NCSA and around the world. Charm++ supports virtualization activities like dynamic load balancing, which migrates calculations among processors so that they work together more efficiently. It also provides a means of hiding memory latency. With Charm++, "a processor is allocated to an object only when a message for the object is received. This means when a processor is waiting for a message, another object may execute on it. It also means that a single processor may wait for any number of distinct messages and will be awakened when any of these messages arrives. Thus it's an effective way of scheduling a processor in the presence of potentially large latencies," explains Laxmikant Kale, a computer science professor at the University of Illinois who leads the development of Charm++. The performance advantages to be gained by using Charm++ in chemistry codes can be substantial. The LeanCP (also known as OpenAtom) project, a collaboration with Glenn Martyna of IBM's TJ Watson Research Laboratory and Mark Tuckerman of New York University, is devoted to extreme scaling for Car-Parrinello ab initio molecular dynamics simulations. LeanCP runs on various NCSA computers. It efficiently scales 256-molecule water simulations to all 40,960 Blue Gene/L processors of the machine at the TJ Watson lab. The team plans extensions for petascale computers. Kale works closely with NCSA in planning for the deployment and use of emerging petascale systems and applications. His research team is also developing BigSim, a system simulator and emulator that allows scientists to develop, debug, and predict the performance of applications on petascale machines before the machines are available. "This way applications can be ready when the machine first becomes operational," Kale says.