FPGAs and the Quest for Petaflop Systems

By Steve Fisher, Editor In Chief -- One innovative and perhaps often misunderstood method of attaining impressive computing results is the use of FPGAs (field programmable gate arrays). How is it possible to build a supercomputer with this technology? For the answer to this and other questions Supercomputing Online interviewed Richard Loosemore, Director of Research, Starbridge Systems. Supercomputing: I think there may be some degree of misunderstanding about how FPGAs could possibly be used to build a supercomputer. Can you shed some light on this subject and tell us about the benefits of this process? LOOSEMORE: First a quick summary of the domain. An FPGA is just a reprogrammable array of gates - think of it as a large grid of cells, each of which contains some logic that we can configure, together with wide highways of connecting wires running north-south and east-west between the cells. Every place the lines intersect there is a crosspoint switch, so we can program the wires to connect any cell output to any other cell input. We can reconfigure the connections and the logic in each cell as many times as we want - in fact, we can do this about a thousand times a second. Our goal here at Star Bridge is to turn a piece of conventional software into a logic circuit, then run that circuit on a computer that consists of a massive array of FPGAs. Our claim is that this yields a dramatic performance improvement over what you would have gotten by running the original software on a conventional processor. Some people get this idea straight away, but there are plenty who don't, and they often base their reasoning on myths that are easy to debunk. For example: 1) FPGAs are slow chips. Well, yes, FPGAs are slower if you look at clock speed - 200 to 300 MHz is typical. But this is almost meaningless when all the other factors are taken into account. It is the same story that allows Steve Jobs to stand up in front of an Apple crowd once or twice a year and watch his latest Macintosh design thrash a Pentium in a real world test, even when the Pentium has maybe twice the clock speed. 2) People have tried implementing processors on FPGAs and the performance has been dismal. There are processor emulators that run on FPGAs, but for the most part this is not what we do in our machines. I mean: you have all that parallelism at your disposal and then you throw it away on a serial von Neumann design? This would be crazy, and it is not our main goal. We can take a piece of conventional software, reimplement it as a parallel array of gates (not an easy task, but we can do it), then run that to completion on the FPGA inless time than it takes the conventional code to go through its initialization sequence. Now, I did qualify my answer by saying that we 'mostly' don't do that. The fact is, there has recently been a big change in FPGA architecture that makes it easier to emulate processors, so there are some circumstances where it would make sense to put (for example) an array of Linux processors on an FPGA and make the whole thing behave like a Beowulf cluster. That won't squeeze the maximum juice out of the hardware, but it makes it easier to port old-style software. 3) FPGAs will always lag behind the performance of dedicated processor designs (your Pentiums and Alphas and SPARCs) because they are just general purpose arrays of gates, without the benefits of the highly optimized gate designs that you find in those processors. There is some confusion about what exactly is the density of horsepower you can cram into an FPGA, as opposed to a dedicated processor. In fact, FPGAs are already showing a faster rate of increase of logic density than is being achieved on ASICs, and this is because *uniformity* matters a great deal in chip production. When the chip is in the oven and those atoms are wandering around looking for a place to deposit themselves, the atoms have a nasty tendency to be influenced by the smoothness of the silicon plain underneath them. If they see a conventional chip, with its complex, bumpy-looking floating point unit in one place, and a nice smooth area of cache memory somewhere else, they tend not to rain down uniformly on these two different areas. For somebody like Intel, this creates an enormous headache and leads to low chip yields. The FPGA manufacturers, meanwhile, can finesse this problem because their chips have such a uniform surface. Year on year, this gap between the two approaches will only get wider, so the criticisms of the raw power that can be put in these chips are just plain misinformed. I want to add one important comment before I leave this issue. All I have done so far is attack a few of the myths that seem to be in circulation, without even touching on the positive reasons why our hardware and software can do what they do. That is a different thing altogether. Supercomputing: Please tell us about Starbridge Systems’ hardware products and their performance capabilities. The HAL 300? 12.8TeraOps, is that possible? How about the HAL 15 and the PENSA processors? LOOSEMORE: Okay, the quote about the HAL 300 is certainly valid, but it refers to 'Ops' as in 'operations', and those operations might be appropriate at the gate level but are not easy to translate into something like a floating point multiply, which is the usual crude meaning of 'op' in the context of 'Gigaflop'. What I want to do here is give you a quick sketch of a theoretical performance calculation that lets us compare a HAL 300 with other machines. It is by no means a benchmark figure, just a ball-park calculation for what is achievable - and I'll say a bit more in a moment about why this estimate is not as specious as it seems. The goal here is to find out roughly how many 32-bit multiply operations we could get out of a HAL 300 in one second. On a 10 million gate Vertex II FPGA from Xilinx, there are 192 dedicated 18-bit multipliers and 15,360 configurable logic blocks (CLBs). The dedicated multipliers can be connected to give about 108 32-bit multipliers, and since they can produce one result every two clocks, at a clock speed of 300 MHz we can get 16 billion floating point operations per second. Allowing for the fact that we need some of the CLBs for other logic, we can also build about 290 serial 32-bit multipliers, each of which will deliver one result every 20 clocks (on average). That comes out to roughly 4 billion floating point operations per second. So the total performance of one chip would be 16 + 4 = 20 GFlops. In a single HAL 300 there are 200 chips, so the total machine performance would be 4 TeraFlops. Now, before you throw up your hands and shout "benchmark fraud", let me give you a few reasons to believe these numbers. Did you notice that there were 290 *serial* multipliers in there? These guys work slowly, but they are compact AND they give the rest of the circuitry time to keep up - so if some part of the program needs to do a bunch of work before it can be ready to process the result of the multiply operation, it can get on with that work during the time that the multiplication is cooking inside the serial multiplier. If you need the result real quick, use a dedicated multiplier, but if not then you do have another option. This kind of flexibility is immensely important because it allows a real-world program running on an FPGA computer to get much nearer to these theoretical performance figures than would otherwise be possible. In fact, we can take that flexibility a great deal further. If there is some process in your software that needs to multiply numbers coming from a very slow source, we can build a "superserial" multiplier that is so compact we could get many thousands of them on a chip and still have bags of room to spare. This general principle of building circuitry that closely matches the data rates that need to be handled in the various parts of a program is something we call "superspecificity", and it lies at the root of our claims about performance. In practice, superspecificity allows us to do two things. One is the obvious achievement of saving space on the chip. The other is more subtle: because we can be so economical with space, we can often ensure that calculations stay right there on the chip and do not have to go off into the slow-slow world of external storage. This is locality of reference, and it is something to be prized above all else if what you want is speed. The HAL 15 is a smaller cousin of the HAL 300, with only 10 chips in it. The "Pensa" processor is a generic term for the combination of any FPGA chip and the Viva software. Supercomputing: Please share a bit of information with the readers about your Viva 1.0 software. LOOSEMORE: Viva is a graphical programming language that allows you to build software for the HAL machines. It is designed in such a way that you can either plunge down to the lowest levels of gates, flip-flops and registers, or you can abstract all the way out to processors and memories and objects. It has one immense benefit that is not available to the old-fashioned gate designer, which is that it does not have to pump information around in single-bit lines: you are not obliged to write one piece of code to handle an 8-bit operation, then another version for the 16-bit case, and so on. Instead you write a recursive definition of your object that can handle any bit width in terms of a smaller bit-width, then a second version that explains how to handle the one-bit case. This recursion does not simply apply to bit-width: you can define your own data types as combinations of other types, then write your code in that same recursive way. If you have ever written anything in Lisp, you'd be at home straight away. Viva is also the root of the superspecificity I mentioned earlier. When it comes to choose an implementation for any given chunk of your program, it will look at the data rate expected to flow through that chunk and if it finds, for example, that the rate need not be large it will try to find a slow-but-compact way to implement it. Conversely, if your chunk is on the critical path it will try to use the fastest possible implementation. Supercomputing: Are all these products currently available? What are we talking as far as price? A ballpark estimate is fine. LOOSEMORE: The particular HAL 300 that I analyzed above is not yet available because Xilinx is still in the process of building the 10-million-gate chips. More generally, we are going through a transition from old technology to new because of the recent introduction of Xilinx's Virtex II family of chips, so although we have a small number of the current HAL 15 systems available, there are no more of the old-generation HAL 300s. I believe these HAL 15s are available for $150K, but we are offering discount arrangements with respect to the new systems when they become available next year. Supercomputing: There's been a lot of news about clustering lately. Could Starbridge products be clustered? It seems like a cluster of HAL 300's would have a startling amount of power if clustering is relevant/do-able. LOOSEMORE: The possibility of running these machines in parallel is at the core of their design. There is a great deal of bandwidth coming out of the cards that are in each machine, and their connection architecture is such that as you add more cards or more machines onto the system, the global architecture looks the same but all the modules are composed of larger-sized sub-modules. Remember that we are dealing with an inherently parallel system here. Viva is all about building a large number of parallel processes that can be flexibly distributed across the underlying hardware. If that underlying hardware is expanded it's a relatively straightforward deal to take advantage of it. Supercomputing: SRC does some work with reconfigurable computing and FPGAs. How does Starbridge Systems approach and products differ? LOOSEMORE: The main thing I understand about SRC is that each of their conventional processors has one or more FPGAs hanging off it. In theory this could be good news, but it is not clear to me that they have fully confronted the issue of how to utilize that extra power. (I seem to remember that when I was a toddler I decided one day to wake my mom up with a cup of tea, so I went to the kitchen and put half a cup of milk, a few tablespoons of sugar, some tea leaves and some cold water into a cup, stirred it around with a spoon, then delivered it to her with great pride. Had all the right ingredients, you see, but ...) Supercomputing: Is there anything else you'd like to add? LOOSEMORE: Too often in the past supercomputer vendors have built wonderfully powerful systems that cost a great deal of money, and *then* required their customers to do some intricate hacking or porting in order to get anything running on the new systems. It is hardly surprising that the customers and their programmers found this effort painful at times. With a few notable exceptions, these machines were all destined for the scrap heap in a few years anyway. The saddest part of it was that these machines were never destined to migrate down to lower levels of the Computer ecology. Nobody seriously believed that there would be Connection Machines for sale in CompUSA in a few years, running fabulously parallel versions of common software titles. They were physically too big (and not shrinking fast enough) and there was no serious way to write massively parallel software except for the kind of tasks (like CFD) where it was a natural. The HAL 300 I described above fits in a case about the size of a conventional desktop PC and gets all its electricity from a single domestic wall outlet. The Viva software with which you program that machine is certainly a radical departure from old programming approaches, but it is designed for longevity. There are some new developments now in the pipeline that will make it possible for highly parallelized software to be written for all types of applications, so in about five years' time you will be able to buy this same HAL 300 as a personal computer and use it to do anything that you can now do with a Pentium or a Macintosh - but a thousand times faster or (more significantly, I think you'll find) a thousand times more complex. So, in spite of the fact that we will be putting together a HAL 300 cluster next year to make the world's first Petaflop machine, what we are actually selling are not supercomputers destined for obsolescence but machines that will become PCs as they grow older. ----- Supercomputing Online wishes to thank Richard Loosemore for his time and insights. -----