PSC Scientists Patent Software for Protecting Results Against System Failures

Scientists at Pittsburgh Supercomputing Center (PSC) have patented ZEST, a piece of software that takes a rapid “snapshot” of a supercomputer’s calculations as it works. ZEST greatly speeds the ability to store complex calculations as a hedge against a system failure, saving precious supercomputing time and slowing calculations down far less than current methods.

 

PSC co-inventors of ZEST included Paul Nowoczynski, Jason Sommerfield, Nathan Stone, and Jared Yanovich.

 

Just as we all hit “save” as we work, scientists carrying out vast computations such as those required for detailed weather predictions or earthquake science need to periodically store — “checkpoint” — the machine’s computational state. In the case of a system malfunction, this allows them to avoid having to start from the beginning after hours or days of work.

 

The problem, according to J. Ray Scott, Director of Systems and Operations at PSC, is that retrieving and storing these data takes time away from calculation, which is carefully rationed to researchers using highly in-demand supercomputers. In fact, he adds, over the last seven years the memory available in the largest machines has increased about 25-fold, while the capacity for retrieving that memory has increased only about four-fold.

 

“If you have a large job, checkpointing the run often means writing out tens of terabytes of data” — enough to fill about a thousand new iPads, Scott says. “This takes a nontrivial amount of time. The whole time you’re doing the checkpoint, you’re not using the computer.”

 

The ZEST software works by tightly managing the supercomputer’s disk drives, continuously routing checkpoint storage to disks that aren’t being used for computation.

 

“Every disk drive holds up its hand and says, ‘I can take these data now,’” Scott explains. This “pull-based model” ensures the checkpointing conflicts as little as possible with the computer’s own use of the drives. “You’re always writing to whomever’s the most available.”

 

ZEST is far more efficient than current methods, which “push” data to disks whether or not they’re ready to receive it. ZEST has demonstrated 90 percent of the theoretical maximum efficiency of writing data to drives; currently available commercial systems have efficiencies of 25 percent or less.