When designing a system for
high-speed computation, the selection of the processor is critical.
In this paper we will suggest that the
obvious choice may not always be the best foundation for a high-performance, cost-effective
system.
The vast majority of software used
today involves serial processing.
Instructions are carried out one at a time, sometimes in loops,
sometimes as subroutines, and sometimes suspended as interrupts of various
sorts are allowed to halt the current sequence of events.
With modern microprocessors such as
the Pentium 4, these instructions are often carried out so quickly that they
give the appearance of simultaneous execution.
Indeed, techniques such as pipelining provide overlapping instruction
execution to some degree. But when you
analyze the circuitry or the information flow, it becomes clear that the
instructions are being executed one by one in the vast majority of operations.
When you want to perform two or more
operations in a truly simultaneous manner, you increase the number of
processors.
Sometimes an additional processor is
placed on the same silicon as the CPU, such as the arithmetic logic units that
complement various models in the Intel X86 series of processors.
You may recall that the 80286 processor had
a matching 80287 math coprocessor on a separate chip.
More recent X86 chips have the ALU circuitry integrated on the
same silicon with the primary microprocessor, so that yes, you will have two
simultaneous processes being executed on the same chip.
But this is really only an asterisk on the
claim that in general, modern microprocessors are serial processors.
Despite some brilliant engineering
by Seymour Cray and others to speed up the operation of serial processor-based
supercomputers through vector processing and pipelining, by the 1980s it became
clear that further improvements in system throughput were needed to meet the
demand for higher performance. The
logical route was to revive the old idea of using multiple processors, and the
growing capabilities and low prices of microprocessors made them the obvious
candidates. Supercomputer producers
realized that there must come a time when no matter how fast they could make
one rabbit run, a small army of turtles could haul more freight to the
destination.
So they took several inexpensive
microprocessors, put them into the same box, and divided the workload between
them. In practice, the theoretical
boost in computational throughput was difficult to obtain, as attempts were
made to coordinate the work of several processors.
Synchronization was difficult, as was access to memory.
Should you have all of the processors share
the same memory, or allocate separate memory to each processor and synchronize
the data when needed? Or do you share
memory in clusters with Distributed Shared Memory?
Each technique has some advantages but also creates its own
unique problems.
Despite more than a decade of
intensive efforts to develop software able to exploit the power inherent in a
massive array of microprocessors, the results have often been
disappointing. The IDC high-performance
computing conference in Dearborn, Michigan, this past April was filled with
complaints by center managers about their continuing dissatisfaction with some
very expensive systems. While
supercomputers of 1,000 or more processors have been built, with theoretical
performance in the teraflops range, you might be surprised to learn how often
the users of these systems complain about hitting a brick wall of performance
at 64 or even 32 processors. It is a
real challenge to coordinate the work of a large number of processors.
Some of you may have experience with
a Beowulf cluster of PCs, the poor person's supercomputer.
Communications bottlenecks and data delays
from simply the distance between PC boxes limit the effectiveness of this
low-cost approach to parallel processing.
But the popularity of this concept demonstrates that there is a demand
for almost any kind of parallel system.
Microprocessor
Characteristics
If parallel processing is failing to
deliver on its promises, let's follow the lead of the TV quiz show and see if
we can find the Weakest Link. Our vote
would be the type of chips chosen to power today's supercomputers.
Modern microprocessors from AMD,
Intel, IBM, Motorola, and other manufacturers are incredible devices.
They can perform a bewildering array of
functions at blazing speeds; their clock rates are now about three orders of
magnitude faster than they were just a couple of decades ago.
Look at the specs on any of them and you
simply have to be amazed.
But there is a catch.
At any given instant there is likely to be
just a tiny portion of the chip doing serious work.
Millions of transistors are sitting at the red stoplight, revving
their engines in hot standby mode, waiting for the green light to free them to
perform their specific functions. They
eat power and waste real estate perhaps 99 percent of the time.
The modern chip can do almost anything you
would want a digital circuit to do. But
at any instant, there isn't very much on this chip that is doing anything
productive at all.
If we wanted a single chip to
perform several truly simultaneous processes, what parts of the circuitry on a
microprocessor would we duplicate, triplicate, quadruplicate, etc.?
That's not an easy question to answer,
because until we know the nature of the problem we will be solving, we don't
know which functions will be needed.
And any type of problem could be thrown at the chip, because it does so
many things well. By the same token, if
we decided to perform several simultaneous processes on a single Application
Specific Integrated Circuit, which parts of its circuitry would we replicate multiple
times? This might be a little easier to
answer, as the ASIC is optimized for a smaller number of functions, and perhaps
several of these could be replicated in an oversized ASIC.
But there is a third type of chip that
deserves our consideration.
Field programmable gate arrays are
chips that leave the factory without predefined circuitry.
The buyer, often a circuit board
manufacturer, determines the circuitry required and converts the FPGA into an
ASIC.
In the vast majority of cases, the
program instantiated on the FPGA is never changed throughout its useful
life. But modern FPGAs from companies
such as Altera and Xilinx are more flexible than that.
They can be reconfigured over and over, just
as data stored in RAM can be changed on a continuing basis. Indeed, you can configure part of an FPGA to
perform Function A, then reconfigure it to perform Function B, then reconfigure
it again to perform Function C, etc.
And if the function you put on the chip leaves some real estate left
over, you can use that excess capacity to handle additional circuitry for other
tasks. So you can have multiple
processes being executed on the same chip.
Let's assume that there is a
specific function you want to use in a parallel manner, and it only requires five
percent of the real estate available on the FPGA.
This will allow you to replicate that functionality 20 times on
the same chip. And if you put 10 chips
on the same circuit board, you can replicate that functionality 200 times on
the circuit board. And if you put 10
circuit boards in a box, you can replicate that functionality 2,000 times in a
single box.
By now you should understand why we
question the choice of the full-function microprocessor as the basic unit of
computing power in a parallel system.
For a moment, let's leave the real
world of practical problems for high-performance systems and let our
imaginations run wild. Let's envision
the ideal high-speed computing system, your personal dream machine.
It would have to be parallel to do many
tasks at once. It would have to be
scaleable, so you could increase its power by adding more chips or upgrading to
more powerful chips without changing your code.
It would be small to minimize inter-process communication delays and
allow it to be portable. It would have
modest power consumption, so you could run it off a battery back-up system
during a California blackout. It would
be easy to program, so that you could access its power on your own.
And it would not cost much.
But that's just a dream.
Unfortunately, you cannot buy such a system
today. However, you can see one in
operation. The NASA Langley Research
Center is experimenting with the prototype for such a system:
A HAL-15 from Star Bridge Systems.
This paper is based upon the presentation “Moving to the
Parallel Universe,” given by Robert D. Bliss and Dr. Lloyd G. Allred to the Software Technology Conference, May 2, 2001.
Star Bridge Systems
7651 South Main Street
Midvale, UT 84047
Phone 801-984-4444
Fax 801-984-4445
www.starbridgesystems.com