IBM to build Opteron-Cell supercomputer

According to a report by CNET News.com, IBM has won the contract to build a supercomputer called “Roadrunner” that will include Opteron chips and the Cell processor used in the Sony PS3 at Los Alamos National Laboratory. U.S. Senator Pete Domenici welcomed official action to undertake the LANL bid to acquire what will eventually be the world’s fastest computing supercomputer. Domenici, chairman of the Senate Energy and Water Development Appropriations Subcommittee, provided $35 million in FY2006 funding to begin a three-phase program to acquire a supercomputer that is able to run at a sustained performance level of 1 petaflop, or a billion million computations per second. LANL, through the National Nuclear Security Administration, issued the request in May for proposals to begin phase one of the effort. “LANL currently has some of the most limited computational capabilities of all the DOE laboratories. That will change with this new petaflop computer, which will fill an immediate need to increase the lab’s computing capabilities,” Domenici said. Not using the conventional Linpack supercomputing speed test, Japan's Institute of Physical and Chemical Research, called RIKEN, announced that it had completed its Protein Explorer supercomputer. Although it has not been confirmed by SC Online, RIKEN claims the Protein Explorer reached the petaflop level. The Roadrunner system, along with the Protein Explorer and the seventh-fastest supercomputer, Tokyo Institute of Technology's Tsubame system built by Sun Microsystems (SC Online READERS' CHOICE PRODUCT OF 2005: Sun Microsystems Sun Fire servers), illustrate a new trend in supercomputing: combining general-purpose processors with special-purpose accelerator chips. IBM's BladeCenter systems are amenable to the hybrid approach. A single chassis can accommodate both general-purpose Opteron blade servers and Cell-based accelerator systems. The BladeCenter chassis includes a high-speed communications links among the servers, and one source said the blades will be used in Roadrunner. Advanced Micro Devices' Opteron processor is used in supercomputing "cluster" systems that spread computing work across numerous small machines joined with a high-speed network. In the case of Roadrunner, the Cell processor, designed jointly by IBM, Sony and Toshiba, provides the special-purpose accelerator. Cell originally was designed to improve video game performance in the PS3 console. The single chip's main processor core is augmented by eight special-purpose processing cores that can help with calculations such as simulating the physics of virtual worlds. Those engines also are amenable to scientific computing tasks, IBM has said. A watt is about a dollar a year if you have the things on all the time, so 10 megawatts per year equates to $10 million in operating expenses. The Los Alamos-IBM alliance is newsworthy for another reason. The Los Alamos lab has traditionally favored supercomputers from manufacturers other than IBM, including SGI, Compaq and Linux Networx. Its sister lab and sometimes rival, Lawrence Livermore, has had the Big Blue affinity, housing the current top-ranked supercomputer, Blue Gene/L. It has a sustained performance of 280 teraflops, just more than one-fourth of the way to the petaflop goal. In June, Appro announced the award of the Peloton supercomputing project from Lawrence Livermore National Laboratory. The deployment will consist of three Appro 1U Quad XtremeServer Clusters (SC Online EDITORS' CHOICE PRODUCT OF 2005) with a total of 16,128 cores based on the latest Dual-Core AMD Opteron Processors. The High-Performance Computing (HPC) Solution includes a two stage 20 Gb/s 4x Double Data Rate (DDR) InfiniBand fabric featuring Voltaire edge, spine switches and Mellanox dual port DDR InfiniBand HCAs. In order to affordably and efficiently provide this production quality computing capacity, the cluster of servers will be deployed in Scalable Unit (SU) groups. The U.S. government has become an avid supercomputer customer, using the machines for simulations to ensure nuclear weapons will continue to work even as they age beyond their original design lifespan. Such physics simulations have grown increasingly sophisticated, moving from two to three dimensions, but more is better. Los Alamos expects Roadrunner will increase the detail of simulations by a factor of 10. IBM and petaflop computing are no strangers. Although customers can buy the current Blue Gene/L systems or rent their processing power from IBM, Blue Gene actually began as a research project in 2000 to reach the petaflop supercomputing level. Though it was designed as the heart of the upcoming Sony PlayStation 3 game console, the processor dubbed STI Cell has created quite a stir in the computational science community, where the processor's potential as a building block for high performance computers has been widely discussed and speculated upon. To evaluate Cell's potential, Berkeley Lab computer scientists evaluated the processor's performance in running several scientific-application kernels, and then compared this performance against other processor architectures. The results of the group's evaluation were presented at the ACM International Conference on Computing Frontiers, held May, 2006 in Ischia, Italy, in a paper by Samuel Williams, Leonid Oliker, Parry Husbands, Shoaib Kamil, and Katherine Yelick of the Future Technologies Group in Berkeley Lab's Computational Research Division, and by John Shalf from DOE's National Energy Research Scientific Computing Center, NERSC. "Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency," the authors report. "We also conclude that Cell's heterogeneous multicore implementation is inherently better suited to the HPC" -- high-performance computing -- "environment than homogeneous commodity multicore processors." Cell, designed by a partnership of Sony, Toshiba, and IBM (the STI in STI Cell), is a high-performance implementation of software-controlled memory hierarchy in conjunction with the considerable floating-point resources required for demanding numerical algorithms. Cell takes a radical departure from conventional multiprocessor or multicore architectures. Instead of using identical cooperating processors, it uses a conventional high-performance PowerPC core that controls eight single-instructions, multiple-data cores called synergistic processing elements (SPEs), each of which contains a synergistic processing unit, a local memory, and a memory-flow controller. In addition to its radical departure from mainstream general-purpose processor designs, Cell is particularly compelling because the intended game market means it will be produced at high volume, making it cost-competitive with commodity central processor units. Moreover, the pace of commodity microprocessor clock rates is slowing as chip power demands increase, and these worrisome trends have motivated the community of computational scientists to consider alternatives like STI Cell. Playing the science game The authors examined the potential of using the STI Cell processor as a building block for future high-end parallel systems by investigating performance across several key scientific computing kernels: dense matrix multiplication, sparse matrix vector multiplication, stencil computations on regular grids, and one-dimensional and two-dimensional fast Fourier transforms. According to the authors, the current implementation of Cell is noted for its extremely high-performance, single-precision (32-bit) floating point resources. The majority of scientific applications require double precision (64 bits), however. Although Cell's peak double-precision performance is still impressive compared to its commodity peers (eight SPEs running at 3.2 gigahertz mean 14.6 billion floating-point operations per second),the group showed how a design with modest hardware changes, which they named Cell+, could improve double-precision performance. The authors developed a performance model for Cell and used it to show direct comparisons of Cell against the AMD Opteron, Intel Itanium 2, and Cray X1 architectures. The performance model was then used to guide implementation development that was run on IBM's Full System Simulator, in order to provide even more accurate performance estimates. The authors argue that Cell's three-level memory architecture, which decouples main memory accesses from computation and is explicitly managed by the software, provides several advantages over mainstream cache-based architectures. First, performance is more predictable, because the load time from an SPE's local store is constant. Second, long block transfers from off-chip DRAM (dynamic random access memory) can achieve a much higher percentage of memory bandwidth than individual cache-line loads. Finally, for predictable memory-access patterns, communication and computation can effectively be overlapped by careful scheduling in software. "Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency," the authors state. While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking. On average, Cell is eight times faster and at least eight times more power-efficient than current Opteron and Itanium processors, despite the fact that Cell's peak double-precision performance is fourteen times slower than its peak single-precision performance. If Cell were to include at least one fully usable pipelined double-precision floating-point unit, as proposed in the Cell+ implementation, these performance advantages would easily double.