The Wrong Chip

When designing a system for high-speed computation, the selection of the processor is critical. In this paper we will suggest that the obvious choice may not always be the best foundation for a high-performance, cost-effective system.

Serial Processing

The vast majority of software used today involves serial processing. Instructions are carried out one at a time, sometimes in loops, sometimes as subroutines, and sometimes suspended as interrupts of various sorts are allowed to halt the current sequence of events.

With modern microprocessors such as the Pentium 4, these instructions are often carried out so quickly that they give the appearance of simultaneous execution. Indeed, techniques such as pipelining provide overlapping instruction execution to some degree. But when you analyze the circuitry or the information flow, it becomes clear that the instructions are being executed one by one in the vast majority of operations. When you want to perform two or more operations in a truly simultaneous manner, you increase the number of processors.

Sometimes an additional processor is placed on the same silicon as the CPU, such as the arithmetic logic units that complement various models in the Intel X86 series of processors. You may recall that the 80286 processor had a matching 80287 math coprocessor on a separate chip. More recent X86 chips have the ALU circuitry integrated on the same silicon with the primary microprocessor, so that yes, you will have two simultaneous processes being executed on the same chip. But this is really only an asterisk on the claim that in general, modern microprocessors are serial processors.

High-Performance Computing

Despite some brilliant engineering by Seymour Cray and others to speed up the operation of serial processor-based supercomputers through vector processing and pipelining, by the 1980s it became clear that further improvements in system throughput were needed to meet the demand for higher performance. The logical route was to revive the old idea of using multiple processors, and the growing capabilities and low prices of microprocessors made them the obvious candidates. Supercomputer producers realized that there must come a time when no matter how fast they could make one rabbit run, a small army of turtles could haul more freight to the destination.

So they took several inexpensive microprocessors, put them into the same box, and divided the workload between them. In practice, the theoretical boost in computational throughput was difficult to obtain, as attempts were made to coordinate the work of several processors. Synchronization was difficult, as was access to memory. Should you have all of the processors share the same memory, or allocate separate memory to each processor and synchronize the data when needed? Or do you share memory in clusters with Distributed Shared Memory? Each technique has some advantages but also creates its own unique problems.

Despite more than a decade of intensive efforts to develop software able to exploit the power inherent in a massive array of microprocessors, the results have often been disappointing. The IDC high-performance computing conference in Dearborn, Michigan, this past April was filled with complaints by center managers about their continuing dissatisfaction with some very expensive systems. While supercomputers of 1,000 or more processors have been built, with theoretical performance in the teraflops range, you might be surprised to learn how often the users of these systems complain about hitting a brick wall of performance at 64 or even 32 processors. It is a real challenge to coordinate the work of a large number of processors.

Some of you may have experience with a Beowulf cluster of PCs, the poor person's supercomputer. Communications bottlenecks and data delays from simply the distance between PC boxes limit the effectiveness of this low-cost approach to parallel processing. But the popularity of this concept demonstrates that there is a demand for almost any kind of parallel system.

Microprocessor Characteristics

If parallel processing is failing to deliver on its promises, let's follow the lead of the TV quiz show and see if we can find the Weakest Link. Our vote would be the type of chips chosen to power today's supercomputers.
Modern microprocessors from AMD, Intel, IBM, Motorola, and other manufacturers are incredible devices. They can perform a bewildering array of functions at blazing speeds; their clock rates are now about three orders of magnitude faster than they were just a couple of decades ago. Look at the specs on any of them and you simply have to be amazed.

But there is a catch. At any given instant there is likely to be just a tiny portion of the chip doing serious work. Millions of transistors are sitting at the red stoplight, revving their engines in hot standby mode, waiting for the green light to free them to perform their specific functions. They eat power and waste real estate perhaps 99 percent of the time. The modern chip can do almost anything you would want a digital circuit to do. But at any instant, there isn't very much on this chip that is doing anything productive at all.

Parallel Processing on One Chip

If we wanted a single chip to perform several truly simultaneous processes, what parts of the circuitry on a microprocessor would we duplicate, triplicate, quadruplicate, etc.? That's not an easy question to answer, because until we know the nature of the problem we will be solving, we don't know which functions will be needed. And any type of problem could be thrown at the chip, because it does so many things well. By the same token, if we decided to perform several simultaneous processes on a single Application Specific Integrated Circuit, which parts of its circuitry would we replicate multiple times? This might be a little easier to answer, as the ASIC is optimized for a smaller number of functions, and perhaps several of these could be replicated in an oversized ASIC. But there is a third type of chip that deserves our consideration.

FPGA

Field programmable gate arrays are chips that leave the factory without predefined circuitry. The buyer, often a circuit board manufacturer, determines the circuitry required and converts the FPGA into an ASIC.
In the vast majority of cases, the program instantiated on the FPGA is never changed throughout its useful life. But modern FPGAs from companies such as Altera and Xilinx are more flexible than that. They can be reconfigured over and over, just as data stored in RAM can be changed on a continuing basis.  Indeed, you can configure part of an FPGA to perform Function A, then reconfigure it to perform Function B, then reconfigure it again to perform Function C, etc. And if the function you put on the chip leaves some real estate left over, you can use that excess capacity to handle additional circuitry for other tasks. So you can have multiple processes being executed on the same chip.

Let's assume that there is a specific function you want to use in a parallel manner, and it only requires five percent of the real estate available on the FPGA. This will allow you to replicate that functionality 20 times on the same chip. And if you put 10 chips on the same circuit board, you can replicate that functionality 200 times on the circuit board. And if you put 10 circuit boards in a box, you can replicate that functionality 2,000 times in a single box.

By now you should understand why we question the choice of the full-function microprocessor as the basic unit of computing power in a parallel system.

A Parallel Dream Machine

For a moment, let's leave the real world of practical problems for high-performance systems and let our imaginations run wild. Let's envision the ideal high-speed computing system, your personal dream machine. It would have to be parallel to do many tasks at once. It would have to be scaleable, so you could increase its power by adding more chips or upgrading to more powerful chips without changing your code. It would be small to minimize inter-process communication delays and allow it to be portable. It would have modest power consumption, so you could run it off a battery back-up system during a California blackout. It would be easy to program, so that you could access its power on your own. And it would not cost much.

But that's just a dream. Unfortunately, you cannot buy such a system today. However, you can see one in operation. The NASA Langley Research Center is experimenting with the prototype for such a system: A HAL-15 from Star Bridge Systems.

This paper is based upon the presentation “Moving to the Parallel Universe,” given by Robert D. Bliss and Dr. Lloyd G. Allred to the Software Technology Conference, May 2, 2001.

Star Bridge Systems
7651 South Main Street
Midvale, UT 84047
Phone 801-984-4444
Fax 801-984-4445
www.starbridgesystems.com