University of Alberta wins the Cluster Challenge at SuperComputing 2007

The team from the University of Alberta took the checkered flag at the first ever Cluster Challenge at SuperComputing 2007. The team was comprised of six student members: Antoine Filion, Paul Greidanus (Student Lead), Gordon Willem Klok, Chris Kuethe, Andrew Nisbet, and Stephen Portillo. Dr Paul Lu, an associate professor from UA’s Department of Computing Science, coached the team. All teams, comprised of six undergraduates and one University staff member, were allowed a single 19” rack of equipment that was assembled at the conference center. Over the first three days of the conference a combination of industry standard benchmarks and current scientific modeling problems were run on the clusters. The teams were also limited to a single 30A, 110V power circuit and penalties were given for excess draw. Results were displayed through out the course of the competition on a 42” display, which was also a factor in the power limitations. Participants were judged by benchmark performance and the throughput of the scientific applications. The University teams were paired with corporate sponsors that provided the computers and networking equipment used for the competition. SGI supplied the University of Alberta team with five Altix XE310 nodes, which the team installed with Scientific Linux 4.5 and OSCAR 5.0 (Open Source Cluster Application Resources http://oscar.openclustergroup.org). The 1U XE310 consists of two motherboards, each of which is dual-socket running 2.66Ghz quad-core processors, for a total of 16 cores per chassis, and 16Gb of RAM. The team used a total of 48 cores for the HPCC/Linpack runs, and 64 for the rest of the competition (Linpack is expensive on the competition's 26A power budget). A Voltaire ISR-9024 InfiniBand switch was used to provide a high performance network to the cluster. The software stack used by the University of Alberta team was composed of OSCAR, SystemImager, C3, Sun Grid Engine, Ganglia, and MVAPICH2 as core elements. Each piece was chosen for its usability, manageability and performance. "OSCAR allowed us to deploy the cluster quickly, and get onto the important work of benchmarking and characterizing our applications. It also sets up applications to allow our jobs to run, and have status displays with no effort," said Paul Greidanus, the team’s Student Lead. OSCAR includes all the software necessary for building a HPC cluster -- no prior experience is necessary and its intuitive interfaces allow anyone to quickly deploy a supercomputer. According to Bernard Li, an open source developer who has worked with a number of open source clustering-related applications, including OSCAR, "Open source clustering software has come a long way since they started appearing in the late 90's. Now these softwares have moved out of research/academia and into the enterprise as the code gets more mature and stable. These open source software will continue to contribute greatly in bringing HPC to the masses in the years to come." However, as in all competitions, people and strategy are also important, and this is one place where the team's approach and effort before the competition paid off. Team member Chris Kuthe had this to say about the team’s preparations. "Before the competition we tested each application thoroughly. This allowed us to ensure that the applications worked correctly, and gave us processes for managing runtime errors. The testing also revealed some critical properties about the application scaling: sometimes the cost of off-node communication was offset by access to a larger number of cores, in other cases a smaller of cores would suffice if slow interconnects could be avoided. We were pleased to find that POV-Ray could be used to consume any unallocated CPU time without adversely impacting other running processes." Gordon Willem Klok was also thought that preparation was important to the team’s victory. "In particular [preparation] was emphasized by our coach Dr Paul Lu. Understanding the importance of speedup curves and correctly characterizing the applications ahead of time allowed us find a good mix of work loads to maximize utilization." Power was the major limiting factor between the teams, and was a constant source of challenges. The teams were given two metered power bars with 13A current limits which they were not permitted to exceed. Power management was also very important in the team’s victory. "We arrived very well prepared having conducted a considerable number of tests to develop a power profile and gauge this profile against the contest limitations. We chose to characterize not just the cluster power consumption in aggregate, but attempted to ascertain what the cost of each of the removable hardware components was and weighed this carefully against its utility in terms of performance or ease of use,” said team member Gordon Willem Klok. There was a power outage of the section of Reno that the conference was being held in, which caused the entire convention center to lose power for a time, interrupting the competition. The University of Alberta team recovered quickly from this outage, due mostly to their scheduler allocating jobs onto the nodes as soon as they came back, and the team's preparation. The team spent most of the setup day running the challenge applications and the datasets provided by the challenge organizers. They were surprised to learn that one benchmark, the popular LINPACK benchmark, would consistently break the contest’s power limitations if they used the 56 cores they had planned in Edmonton. The team decided to risk losing points on the LINPACK benchmark by lowering the number of cores used during that portion of the competition rather than risk breaking the power limitations. They also discovered during the setup day that they could run the other applications on all 64 cores if they used the built in throttling mechanism of the XE310 to even out the load on the processors and handle occasional power spikes caused by changing workloads. “It was a risky gamble that in the end paid off we were drawing precisely 26 amps for a good portion of the competition," said team member Gordon Willem Klok.