ORNL Researchers Discuss Recent IBM-Related Agreements

By Steve Fisher, Editor In Chief -- The folks at Oak Ridge National Lab have been busy. Over the past two weeks they not only announced a major Cooperative Research and Development Agreement with IBM but also the acquisition of a new supercomputer. To learn more Supercomputing Online interviewed Thomas Zacharia, director of ORNL’s Computer Science and Mathematics Division and Al Geist, ORNL’s PI for the CRADA with IBM. Supercomputing: The main focus of this interview was going to be the cooperative research and development agreement between ORNL and IBM announced last week, but only yesterday you folks announced the new supercomputer so let’s talk a little about that. Can you tell us why you selected it, how you specifically hope to use it etc. ZACHARIA: The center for computational sciences at Oak Ridge is as you know a department of enrgy high performance computing center and one of the clear mandates for CCS is early evaluation. Right from its inception we have looked at first of the promising new architectures on the horizon to be evaluated for DOE science applications and working in collaboration with the vendors to make it a successful architecture. This particular machine again is clearly a distinctly new architecture from IBM’s point of view and indeed our point of view in that the powerful processors represent by far the most powerful processors among the supercomputers that are currently used in sort of the commodity type supercomputers. It provides incredible memory bandwidth even as compared to say the Alpha processor so it is a very interesting architecture. From a CPU point of view it is also a fairly built based on a fat SMP node which is again somewhat unique. So these are 32 processor nodes and as we look to the future the machine also offers fairly good balanced interconnect between the nodes in the (federation) switch which is put up...all of these are sort of future products if you will. It’s a future technology so it’s a fairly interesting opportunity. Supercomputing: I agree. Let me ask you…did I read something recently about what IBM is calling cellular computing, cellular processing? ZACHARIA: Yes. Supercomputing: Can you talk a little about that? ZACHARIA: Well, so that’s referring to the CRADA (cooperative research and development agreement), so if you look at, even including the 4Teraflop machine that was announced recently, these machines are built, put together through commodity components if you will. They are massively parallel machines. And as you look to the future to build ever more powerful computers to satisfy the needs of science, these large machines become prohibitively expensive, it requires a tremendous amount of space, it also requires a tremendous amount of power, and a variety of things, so we are at a point where the needs for supercomputing power requires a transition and IBM has been working on the cellular architecture that essentially makes it a departure from the current way of assembling large supercomputers. It is based on, much more specific to supercomputing technology where the computer will have large numbers of these cells if you will. So you’re looking at literally hundreds of thousands of processors where each of cell or processor will be an integrated processor and memory type of an architecture. Supercomputing: If you would, please tell our readers a bit about the Cooperative R&D agreement you announced last week between Oak Ridge and IBM. ZACHARIA: This is built on a long-standing relationship with IBM, research partnership with IBM in a number of areas. This particular collaboration focuses on joining forces in this sort of developing the cellular architecture. Clearly IBM’s strength and their emphasis is on the hardware side and our strength and our emphasis is on the applications side. We feel that it’s sort of a marriage made in heaven and that we have our people who are interested in applications, systems software, scalable systems software as well as algorithms working with IBM’s software and hardware people in developing this new architecture. Supercomputing: Thomas you and I were chatting about this briefly, but Al, would you mind sharing with the readers your thoughts about the cellular servers and why some people see them as the next step in the evolution of HPC systems. GEIST: I think the real key is trying to get to a better performance point and the existing architectures out there are simply getting to a point where if we’re going to continue to meet the computational power requirements that the users have, we really need to think of a new way of designing those servers and this cellular architecture is giving us that opportunity. Supercomputing: Can you tell us about the software that would/will run on servers built on this cellular architecture? ZACHARIA: Well that is the crux of this CRADA. If you think about it these cellular servers, and I will admit that I’m picking a date out of thin air, but let us say that these cellular servers are available by the year 2006 just for the sake of a particular timeframe. These servers represent a significant departure in terms of how you would program in an environment that has literally hundreds of thousands of processors. The whole approach is that we want to co-develop applications and systems software that could effectively take advantage of these servers when they become available in that timeframes. So the crux of this CRADA is to begin to develop some of this software. Al? GEIST: That’s a good point. The kind of issues that come up when you have this kind of server, this many processors, are issues like fault tolerance. How do you build the systems software so it continues to run? How do you build the applications so that they can still get the right answer even though there may be a very, very small percentage of processors that actually fail during the operation of this computer? You’ve got a hundred thousand processors, if one or two drop down that would be an almost insignificant amount of the actual computer. On the other hand, most algorithms developed today are very intolerant of any sort of failures like that. If something fails you end up re-booting the computer and that’s something that these new architectures really need to take into consideration. This is a whole new arena of how you do fault tolerance and how do you make algorithms scale to such large numbers of processors. Supercomputing: What kinds of research at Oak Ridge and other institutions stand to benefit the most from the agreement with IBM and the technology that will be born of it? GEIST: The biology, and the whole genomics areas, is probably the one that has the most direct and initial impact, but there are a number of applications here at Oak Ridge that are very eager to also become involved in this project, in this CRADA project, so that includes the work on climate and climate modeling, global climate modeling that was recently in the news for CO2 build-up in the global economy based on that. Another one we have here at Oak Ridge National Labs is materials science. There’s a very strong emphasis here on doing nanotechnology for example and being able to model more and more complex structures for nanotechnology is something that these types of computers are going to allow us to do. Supercomputing: Is there anything else that either of you gentlemen would like to add? ZACHARIA: Well, nothing in particular other than the fact that this is a pretty exciting time both in terms of computing and opportunity it presents, but also in terms of the commitment that Oak Ridge National Laboratory and the DOE have made. In fact the managing contractor that manages ORNL which is UT-Battelle, a consortium of the University of Tennessee and Battelle Memorial Institute, they have just announced that they’re going to privately finance and construct two major buildings in support of this computing agenda. One building which is called the computational sciences building will have ASCI-class 40,000 square feet computer room and offices to house two hundred researchers and brand new visualization capabilities and so forth. The other building is the joint institute for computational sciences, which is aimed at bringing graduate students and academic researchers, collaborators, to Oak Ridge so that we can interact and build on this. What’s really unique, I mean it’s always exciting that, as you know buildings are the hardest things to construct. What is really interesting and innovative is that the management contractor is taking the initiative and financing it to build these two buildings in support of the DOE program, which is very unusual and of course very gratifying. ---------- Supercomputing Online wishes to thank Thomas Zacharia and Al Geist for their time and insights. It would also like to thank ORNL’s Betsy Riley for her assistance. ----------