Roskies Comments on Last Week’s Delivery of PSC’s Terascale Processors

By Steve Fisher, Editor In Chief -- When fully installed later this year, the Pittsburgh Supercomputing Center’s new system will be used for large-scale modeling in areas that include the life sciences, weather forecasting and climate change and will be the most powerful system in the world available for public research. To learn more Supercomputing Online interviewed Ralph Roskies, Co-Scientific Director, PSC. Supercomputing: You just received the first high performance processors for PSC's Terascale Computing System. This has to be an exciting time for everyone at PSC. Please tell the readers a bit about the new processors. ROSKIES: The processors are the latest generation of Compaq Alphas, called EV68s. They run at a gigahertz, with two floating point instructions per clock cycle, so they each have a peak speed of two gigaflops. Altogether we will have 3000 of these processors, for a 6 teraflop machine. The processors come in a 4-way SMP, each of which will have 4 GB of memory, so a total of 3TB memory. That’s three times as much memory than our current fastest supercomputer, the T3E, has disk! What also distinguishes the Compaq processors, apart from their speed, is their bandwidth to memory, which provides better sustained performance, which is what really matters for the scientific applications we will enable. Supercomputing: The full Terascale System is slated to be installed by October first. When do you anticipate the system being fully operational? ROSKIES: Our style is to allow ‘friendly users’ on the system rather early. These are people who can get a lot of scientific work done and understand that the machine may not be entirely stable, or that we have to take it down on very short notice. We plan to have NSF-allocated users on the machine after January 1, and to have the whole machine in full production mode, to allocated users, by April 1. Supercomputing: The sheer size of this system (3000 processors) surely presents quite a challenge. What aspects of the installation to you anticipate as being the toughest? The easiest (relatively of course)? ROSKIES: It’s really hard to say what will be tough and what will be easy. For months, in collaboration with Compaq and Westinghouse Electric, we have been preparing the infrastructure for power, cooling etc. for this machine room. As we speak, teams of people from PSC, Compaq, and Westinghouse, under the direction of Ray Scott, Assistant Director for Systems and Operations, are testing machines and cabling them together as they arrive. Perhaps the toughest challenge for such a large system is what to do in the case of component failure. We will enable calculations where the mean time to completion of the job is larger than the mean time to failure of a component. We have designed redundancy into the complement of compute nodes, into the file system and into the network to mitigate the effect of failure. We have also developed a simple checkpoint/restart interface for users, which will enable them to restart their jobs very simply when a component fails. Supercomputing: Can you provide us with a little information about the software the Terascale System will use to take advantage of the new processors? How about memory, storage and interconnect/networking equipment? ROSKIES: The system software is called Alphaserver SC, built on Compaq’s TRU64 Unix and Quadrics Supercomputer World’s Resource Management System. We expect most of the message passing to take place under MPI. The processors are connected by a very high speed network from Quadrics. The topology is a fat tree, which has the property that any 1500 processors can talk to the other 1500 processors (1 to 1) without contention. For redundancy and increased bandwidth, we will have two rails of Quadrics, which means two completely independent fat trees. There will be about 40TB of disk storage local to the nodes, and about 30GB of global file system storage. As usual, users will access the machine and their data over national networks, including special high-speed networks like Internet2. Both the machine room and wide area networking are managed by our Networking group, under the direction of Wendy Huntoon, Assistant Director for Networking. Supercomputing: What research areas do you feel will benefit the most from the heavy-duty modeling capabilities of this new system? ROSKIES: The areas with the largest allocations on the initial version of the Terascale system are astrophysics/cosmology, molecular dynamics, materials science, heart modeling, fluid dynamics, geophysics, quantum chromodynamics, and so on. We expect that these application areas will also benefit greatly from the much larger Terascale machine. Our user consultants, many with PhDs in the relevant fields, will work intensively with users in all these areas, under the direction of Sergiu Sanielevici, Assistant Director for Scientific Applications and User Support, and David Deerfield, Assistant Director for Biomedical Applications, whose group specializes in the application of high performance computing to biomedical problems. We are also working on some exciting new real-time applications, such as TeleImmersion, which would use the TCS machine as a real-time signal processor. Supercomputing: Lot's of news in the HPC community the last couple of weeks. What are your thoughts on the recent DOE SciDAC funding and the NSF's DTF(Distributed Terascale Facility) announcements? ROSKIES: This is a wonderful time for computational scientists. They will soon have substantially more resources available to them than ever before. We have the challenge of putting together machines that will be fast, balanced and stable. Computer scientists have the challenge of developing tools that make these machines easier to use, and more efficient. Users have the challenge of developing the important applications that will utilize these machines to make significant breakthroughs in science and engineering. We work with people in each of these communities, no matter the origin of their funding, to advance the state-of-the art. Supercomputing: Anything else you'd like to add? ROSKIES: The Terascale machine we are assembling is much larger than the scale for which the original components were designed. So you just can’t extrapolate from small systems in building such machines. You have to fundamentally rethink things as mundane as cabling, and as important as I/O. We have been working very hard with Compaq to field a machine that we think will be extremely productive for the scientific community. ---------- Supercomputing Online wishes to thank Ralph Roskies for his time and insights. It would also like to thank PSC’s Cheryl Begandy for her assistance. ----------