The surprising usefulness of sloppy arithmetic - High tolerance


High tolerance

So in May 2010, with funding from the U.S. Office of Naval Research, Bates came to MIT as a visiting professor, working with Roy’s group to determine whether video algorithms could be retooled to tolerate sloppy arithmetic. George Shaw, a graduate student in Roy’s group, began by evaluating an algorithm, commonly used in object-recognition systems, that distinguishes foreground and background elements in frames of video.

To simulate the effects of a chip with imprecise arithmetic circuits, Shaw rewrote the algorithm so that the results of all its numerical calculations were either raised or lowered by a randomly generated factor of between 0 and 1 percent. Then he compared its performance to that of the standard implementation of the algorithm. “The difference between the low-precision and the standard arithmetic was trivial,” Shaw says. “It was about 14 pixels out of a million, averaged over many, many frames of video.” “No human could see any of that,” Bates adds.

Of course, a really useful algorithm would have to do more than simply separate foregrounds and backgrounds in frames of video, and the researchers are exploring what tasks to tackle next. But Bates’ chip design looks to be particularly compatible with image and video processing. Although he hasn’t had the chip manufactured yet, Bates has used standard design software to verify that it will work as anticipated. Where current commercial computer chips often have four or even eight “cores,” or separate processing units, Bates’ chip has a thousand; since they don’t have to provide perfectly precise results, they’re much smaller than conventional cores. 

But the chip has another notable idiosyncrasy. In most commercial chips, and even in many experimental chips with dozens of cores, any core can communicate with any other. But sending data across the breadth of a chip consumes much more time and energy than sending it locally. So in Bates’ chip, each core can communicate only with its immediate neighbors. That makes it much more efficient — a chip with 1,000 cores would really be 1,000 times faster than a conventional chip — but it also limits its use. Any computation that runs on the chip has to be easily divided into subtasks whose results have consequences mainly for small clusters of related subtasks — those running on the adjacent cores.