INTERCONNECTS
NASA Study: Cell Processor Shows Promise for Climate Modeling
by Jarrett Cohen -- A feasibility study funded by NASA's High-End Computing (HEC) Program shows that the IBM Cell Broadband Engine is a promising platform for climate modeling applications. The study is a collaboration among Goddard Space Flight Center's Software Integration and Visualization Office (SIVO), HEC's NASA Center for Computational Sciences (NCCS), and other organizations. The Cell was originally developed for the Sony PlayStation 3 gaming console released in November 2006. The processor has since made its way into scientific computers, including the fastest supercomputer on Earth. "If you really want to continue increasing performance, you have to deal with something beyond a single powerful processor," said study initiator and lead Shujia Zhou, a Northrop Grumman Corp. senior computer scientist working in SIVO. Zhou presented the study results at the 2008 International Supercomputing Conference (ISC '08), held June 18–20 in Dresden, Germany. Up until a few years ago, scientific computing clusters linked chips with just one processor embedded in the silicon. The current trend—also seen in desktop computers and laptops—is to use multicore processors with two to four processing units each. Most multicore processors are homogeneous, having identical cores. The Cell is heterogeneous multicore. It incorporates one standard PowerPC processor, named the Power Processing Element (PPE), and eight Synergistic Processing Elements (SPEs).
The PlayStation 3 puts significant speed demands on the Cell, which can perform 205 gigaflops (billion floating-point operations per second) peak using single-precision math. Notably, the processor also has high bandwidth between processing units and memory: 25.6 gigabytes per second. With a massive market keeping the Cell cost-effective, scientists and engineers have begun exploring its usefulness for their applications. "Because the Cell was designed for gaming, we really have to fill a gap with programming skills and software tools to take advantage of its dramatic performance," Zhou said. Porting Yields Large Performance Gains
Rather than work with an entire climate model, the NASA study team decided to focus on one component from a production model used in day-to-day computing. They chose the solar radiation component from the Goddard Earth Observing System Model, Version 5 (GEOS-5)—the flagship atmospheric model of the Global Modeling and Assimilation Office (GMAO). “Solar radiation is the ultimate energy source driving the climate,” getting absorbed or reflected by Earth’s atmosphere and surface, Zhou said. The solar radiation component and a related component for infrared radiation coming out from Earth together consume at least 20 percent of GEOS-5 computing time. Besides being computationally intensive, the solar radiation component was an effective study subject because of its relatively small size (~2,000 Fortran lines) and because its vertical columns can be computed independently, which eliminates the need for communication. Porting the solar radiation component to the Cell environment required several modifications. Because there was no Cell Fortran compiler at the time, the study team’s first step was to convert the code to C. An unexpectedly time-intensive modification was inserting “library calls” to use the Cell’s Direct Memory Access to transfer data between the main memory and the SPEs, which perform the calculations. The team also had to determine how to best map the calculations across the eight SPEs. They ultimately put four columns of data onto each SPE. Zhou described this last step as "SIMDizing" the code (SIMD stands for Single Instruction, Multiple Data). Initial development occurred on an IBM Cell simulator provided through the NCCS. The team later got access to an IBM BladeCenter QS20 system at the University of Maryland, Baltimore County's (UMBC) Multicore Computational Center. Using UMBC's BladeCenter, they ran the new C version of the solar radiation component on a single Cell processor to gauge performance. Using all eight SPEs, the Cell consistently calculated more than 3,000 columns per second. The study team made comparative runs of the original Fortran component on three Intel processors being used in the NCCS' Discover and Explore computing systems. These runs used one core per processor because—unlike with the Cell processor—performance does not linearly scale upwards inside the processor when adding cores. For the largest case of 1,024 columns, the Cell outperformed Intel processor cores as follows:
Exploiting Next-Generation Hardware and Software
The Department of Energy's Los Alamos National Laboratory (LANL) is tapping the power of the Cell for its Roadrunner supercomputer. This IBM BladeCenter QS22/LS21 Cluster combines 12,240 of the newest-generation Cell chips with 6,562 AMD dual-core Opteron chips. This May, Roadrunner achieved a monumental milestone in supercomputing. It calculated the widely used Linpack benchmark at 1.026 petaflops, more than a staggering 1,000 trillion—1,000,000,000,000,000—flops. At ISC '08 in Germany, Zhou interacted with the Roadrunner technical manager from LANL, who was excited about the study results and invited Zhou to use Roadrunner when it opens to researchers in October. The new Cell processor in Roadrunner is more versatile for scientific and engineering computing. It performs 102.4 gigaflops peak for the double-precision math used by most such applications and maintains this speed while being energy- and space-efficient. UMBC expects to gain access to this new Cell in the coming months. In addition, IBM has made several software improvements to support research applications. Zhou said that an "auto-SIMD" feature would reduce porting costs. The most significant development was release of new Fortran compilers in January 2008. The C language is widely used throughout the computer industry and is the first to be supported by new processors such as Cell and NVIDIA’s GPGPU (General Purpose Graphics Processing Unit). However, "a considerable number of high-performance computing applications are still in Fortran," Zhou said. For instance, climate and weather models are mostly written in Fortran. Zhou is currently working towards exploiting the Cell Fortran compilers with a hybrid version of the solar radiation component. "I found a way to combine the C and Fortran code to make the Fortran code run on the Cell," he explained. IBM and LANL personnel at ISC '08 were particularly impressed with this pioneering aspect of the study. More broadly, the GEOS-5 atmospheric model has altogether more than a dozen physics components for representing turbulence, moisture, chemistry, and other processes, which collectively take about 50 percent of GEOS-5 computing time. Due to these components’ similarity to the radiation component, Zhou believes that the Cell processor could provide comparable performance benefits to a significant fraction of the model. The GMAO and other Earth system modelers are looking to use more sophisticated components, which will have much greater computational demands. For instance, cloud-resolving models will require more than a 10-fold increase in computing power. "Using disruptive acceleration technologies such as the Cell is a good candidate for achieving these goals," Zhou said. On July 18, a session organized by Zhou and collaborators on "Emerging Multicore Computing Technology in Earth and Space Sciences" was accepted for the 2008 American Geophysical Union Fall Meeting, December 15–19, San Francisco, CA. Presenters will discuss issues and solutions with multicore and many-core processors. The feasibility study team consisted of Shujia Zhou, Carlos Cruz, and Bruce Van Aartsen, SIVO/Northrop Grumman; Daniel Duffy, Mike Chyatte, and Theresa Held, NCCS/Computer Sciences Corp.; Tom Clune, SIVO; Max Suarez, GMAO; Samuel Williams, University of California, Berkeley; and Milt Halem, UMBC. ----------------------------------------------------------- A PDF version of the ISC '08 poster is available at: isc08_cell_poster.pdf . For more information about the Cell feasibility study, contact: Shujia Zhou, shujia.zhou@nasa.gov . For more information about the Software Integration and Visualization Office, visit: sivo.gsfc.nasa.gov/ . For more information about the NASA Center for Computational Sciences, visit: www.nccs.nasa.gov/ . For more information about the GEOS-5 model, visit: gmao.gsfc.nasa.gov/systems/geos5/ .
- 6.76 times faster than Intel Xeon Woodcrest (2.66 GHz),
- 8.91 times faster than Intel Xeon Dempsey (3.2 GHz), and
- 9.85 times faster than Intel Itanium2 (1.5 GHz)
"I am happy about these results because normally if you only do cache tuning on conventional processors like those from Intel or AMD, you get a 20 to 30 percent increase," Zhou said. "Now, with the IBM Cell, even against the Fortran baseline I see more than six-fold performance improvement. We could see even greater performance after optimizing the code. That really changes the way people look at disruptive technology."