TeraGrid Preparing for Friendly Users

CHAMPAIGN, IL — Recent upgrades to the hardware and to the software have put the TeraGrid in place to meet another milestone as it moves toward friendly-user mode this month. The target date for production is January 1, 2004, at which point its overall capability will be 256 compute nodes and 40 terabytes of storage at NCSA, 128 nodes and 225 terabytes of storage at SDSC, 55 nodes and 94 terabytes of storage at Caltech, and 112 nodes and 20 terabytes of storage at Argonne. An upgrade to the hardware involved replacing one-gigahertz processors with new Itanium 2 Madison 1.3-gigahertz processors. This upgrade increases the peak performance yield of the 256 compute nodes by 30 percent. "This is a pretty significant piece," said Rob Pennington, Senior Associate Director at NCSA's Computing and Data Management and NCSA site lead for TeraGrid. "Initial tests ... have shown that we are now able to sustain over 2 teraflops on this initial system with HPL." Upgrading the Myrinet software began this past month and took three days to complete. The new Myrinet 9.10 interconnect interfaces will provide lower latency and allow software compatibility between the Phase 1 machines and the Phase 2 machines that are due this fall. "This way," said Pennington, "we will have ... [a] slightly different clock rate but [the] same processors, and the same Myrinet fabric all the way across the entire cluster." Testing on the Myrinet upgrade began on August 7. On September 2, a system with a frozen software stack was placed in service for testing by initial application users and for pushing the environment toward a more general friendly user period to begin in October. This environment will also be compatible with supporting the Phase 2 hardware that will be shipped over the next few months. There will be one software stack on both sides of the final cluster configuration. During the month of September, the environment at NCSA along with the partner sites will be prepared to allow Friendly Users on the systems—a stage during which users are allowed access with the caveat that the system is not considered "production-ready." This period has always benefited both the center and the users by allowing early access for code porting to users at no cost against an allocation while allowing center staff to debug the system under reasonable loads and a wide variety of usage modes. Early User Feedback Before the hardware and software upgrades, a limited number of users outside of NCSA had been on the machine sporadically since March. During that period, frequent configuration changes were made to the machine, which meant that the users' quality of time on the machine was not very high. During this experimental stage, the administrators had the authority to take the machine down without notice if they needed to make repairs or adjustments. On September 1, however, a new policy went into effect: if the administrators take the machine down, they must provide notice and only do so for well-justified reasons that have been pre-approved. "There is this watershed," said Pennington, "that we're working toward where it goes out [of] the experimental stage ... [when center staff] are still working on making it function as it should and into the stage where applications people will be trying to get useful work out of it." So far, the feedback from users has indicated that the performance is good but the machine is still somewhat unstable. According to Pennington, good performance had been hoped for and a certain amount of instability had been expected. The instability issues that were encountered centered on the software stack, which is being integrated to run on a machine within TeraGrid without counting on a vendor to do the integration while providing a common environment across all the TeraGrid resources. So it has been a slow, but steady, process to ensure that all drivers work with the current version of the operating system and that there are no outstanding problems with the operating system itself. A problem that appeared early involved the operating system, which was providing numerically incorrect answers. With help from the vendor partners, this problem has been resolved. The TeraGrid group has also been working closely with IBM to make GPFS (General Parallel File System) available at NCSA to all the nodes in the cluster by this fall. TeraGrid Partner Collaboration The TeraGrid software stack is a collaborative effort of the four participating sites: the National Center for Supercomputing Applications (NCSA), San Diego Supercomputing Center (SDSC), Argonne National Laboratory, and the Center for Advanced Computing Research (CACR) at Caltech. Because all four sites have had previous experience with software for clusters; each site brings something different to the table. (The fifth TeraGrid partner, Pittsburgh Supercomputing Center (PSC), is also being integrated into the TeraGrid with both the TCS machine and under Phase 2 funding with an HP Marvel system.) For the TeraGrid project to work as designed, all sites had to agree to use the same software stack on their TeraGrid machines. The collaboration process involved looking at user requirements, communicating among partner sites, agreeing on the right package based on technical reasons, and then making the packages available. Once accomplished, each site had to integrate the package with the software stack at their respective site and provide automated reporters to tell which version is actually running. "We're working on test code that will go back and make sure that each component of the environment is operating correctly," said Pennington. The monitoring software, which is at all sites, resides at a layer above the software stack. It will indicate, for every site, what version is installed, if it is working, and if there are any problems with it. For example, a user whose code is broken, can evaluate the information gleaned from the monitoring software to find out if anything has changed. Last but not least, collaboration involves trust. The sites have different security requirements for their computing environments; therefore, the packages that go onto the system have to be vetted beforehand. Each site has to be able to trust the other sites in terms of their security policies. If an Argonne user is using a Caltech machine with an NCSA certificate, the Caltech system has to be able to trust that NCSA knows that user. Consistency from Machine to Machine An important TeraGrid goal is to maintain consistency from machine to machine. For example, instead of learning how to use four different machines, users only need to know how use the TeraGrid machine, which has four instantiations. Users should be able to take their makefiles to any of the TeraGrid sites, compile their code, and run that same binary at any of the TeraGrid sites. To make things more complex, the same version of the software has to be installed in paths where the users can find them in exactly the same way. If there is a path difference, it has to be reworked in such a way that the user is unaware of it. When the user invokes the code, all the executables should be found, and everything should function in the same way. The intent here is that a user will be presented with a consistent environment across all the TeraGrid resources. With the recent upgrades in place and the ongoing collaborative efforts of its partners, TeraGrid is well positioned to move into its friendly-user phase. "This project, in a real sense," said Pennington, "is taking advantage of the technologies and the collaborations that the sites have been developing over a number of years." Story submitted by Herbert Morgan, NCSA