SCIENCE
Purdue puts the brakes on datacenter computers to avoid crash
- Written by: Writer
- Category: SCIENCE
At a research institution like Purdue University, even a brief shutdown often means millions of hours of lost computing time.
Computation for scientific research can run for hours, or much longer. For example, the complex atmospheric climate research by Purdue professor Joe Francisco can sometimes take up to four months of continuous computing time.
If the computer stops running in the middle of the massive calculation, the job has to be run again from the beginning, like landing on the wrong square in a computational game of Sorry.
"Past a certain temperature, our only option has been to start shutting down racks of computers," says Mike Shuey, high-performance computing systems manager at Purdue. "That has ripple effects on the research efforts of the university for weeks afterward."
But staff at Purdue recently developed a way to keep the machines running at all times - even during a loss of cooling - and save the research computation already completed. They accomplish this by slowing down the computers in a datacenter, which reduces energy use and cuts the heat produced by the machines by as much as 30 percent.
Purdue's campus research center contains two supercomputers that are internationally ranked by Top500.org as being among the world's largest. The thermal energy produced by more than 15,000 processors, as well the other computers, robotic data storage systems, and other computer hardware, combine to produce enough heat to create an environment that can be a miserable place for man and machine if there is an interruption in the 50-degree water that cools the facility.
If temperatures in the large datacenter room exceed 82 degrees, alarms sound, lights flash and e-mail warnings are automatically sent to staff across campus. At 90 degrees the machines begin shutting themselves down, if the technology works as planned. If not, parts of the multimillion dollar machines can be ruined.
During a recent interruption in cooling during one of the hottest summers on record - a datacenter emergency that can become a computing catastrophe if the machines fail to automatically shut down - Purdue staff were able to slow down the computers enough to keep the datacenter running through the outage.
It's believed to be the first time a datacenter was throttled back.
"The program worked, and the datacenter didn't overheat, so the process was a success. We actually were a bit surprised it worked so seamlessly," Shuey says. "It's much better to have jobs run slowly for an hour than to throw away everyone's work in progress and mobilize staff to try to fix things."
The program works like the power-management systems used in laptop computers to reduce battery drain. However, in this case the program slows down nearly 8,000 processors simultaneously - a bit like getting 8,000 cars on a highway to suddenly slam on their brakes without causing a pileup.
Patrick Finnegan, a Unix systems administrator at Purdue, first came up with the idea of using the built-in capability of the Linux operating system to slow the machines in an emergency, and he prepared the program. However, because of the nature of the work being done on the supercomputers, there was no way to test the program until an actual cooling emergency occurred.
When the program worked, Finnegan became the toast of the datacenter.
"I was a bit overwhelmed by all of the positive responses I got," he says.
Purdue is now making the procedure available to other institutions and corporations on the Folio Direct website at http://www.FolioDirect.net
"The program includes all of the notes on the implementation, so a datacenter manager can see the implementation from someone who has brought the process into production," Shuey says.
The "High Performance Computing Saving Device" is available for download for $250.
"We're offering it to other organizations so they don't have to do all of the discovery in-house," Shuey says. "This is a process that we know works."