[Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.
James.P.Lux at jpl.nasa.gov
Thu Mar 10 11:48:19 PST 2005
At 09:19 AM 3/10/2005, Robert G. Brown wrote:
>On Thu, 10 Mar 2005, Joe Landman wrote:
> > Part of what motivates this question are things like the Cray XD1 FPGA
> > board, or PathScale's processors (unless I misunderstood their
> > functions). Other folks have CPUs on a card of various sorts, ranging
> > from FPGA to DSPs. I am basically wondering aloud what sort of demand
> > for such technology might exist. I assume the answer starts with "if
> > the price is right" ... the question is what is that price, what are
> > the features/functionality, and how hard do people want to work on such
> > bits.
>Problems with coprocessing solutions include:
> a) Cost -- sometimes they are expensive, although they >>can<< yield
>commensurate benefits for some code as you point out.
> b) Availability -- I don't just mean whether or not vendors can get
>them; I mean COTS vs non-COTS. They are frequently on-of-a-kind beasts
>with a single manufacturer.
Definitely an issue.
> c) Usability. They typically require "special tools" to use them at
>all. Cross-compilers, special libraries, code instrumentation. All of
>these things require fairly major programming effort to implement in
>your code to realize the speedup, and tend to decrease the
>general-purpose portability of the result, tying you even more tightly
>(after investing all this effort) with the (probably one) manufacturer
>of the add-on.
To a certain extent, though, this is being mitigated by things like Signal
Processing Workbench or Matlab, which have "plug ins" to convert generic
algorithm descriptions (i.e. simulink models, etc.) into runnable code on
the coprocessor or FPGA.
As far as product lock-in goes, "in theory" one could just recompile for a
new target processor, although I don't know if anyone's ever done this.
It does greatly reduce the "time and cost to demonstrate capability"
> c) Continued Availability -- They also not infrequently disappear
>without a trace (as "general purpose" coprocessors, not necessarily as
>ASICs) within a year or so of being released and marketed. This is
>because Moore's Law is brutal, and even if a co-processor DOES manage to
>speed up your actual application (and not just a core loop that
>comprises 70% of your actual application) by a factor of ten, that's at
>most four or five years of ML advances. If your code has a base of 30%
>or so that isn't sped up at all (fairly likely) then your application
>runs maybe 2-3 times as fast at best and ML eats it in 1-3 years.
There are specialized applications, lending themselves to clusters, for
which this might not hold. If we look at Xilinx FPGAs, for instance, while
not quite doubling every 18 months, they ARE dramatically increasing in
speed and size fairly quickly. And, it's not hugely difficult to take a
design that ran at speed X on size Y Xilinx FPGA and port it to speed A on
Size B Xilinx FPGA.
Consider a classic big crunching ASIC/FPGA application, that of running
many correlators in parallel to demodulate very faint signals buried in
noise (specifically, raw data coming back from deep space probes), or some
applications in radio astronomy. In the latter case, particularly, there's
a lot of interest in taking an array of radio telescopes and simultaneously
forming many beams, so you can look lots of directions at once, to look for
transient events that are "interesting" (like supernovae). The radio
astronomy community is relatively poor (Paul Allen's interest
notwithstanding), so they've got an incentive to use cheap commodity
processing for their needs, but off the shelf PCs might not hack
it. They're looking at a lot of architectures that strongly resemble the
usual cluster... data from all antennas streams into a raft of processors
via ethernet, and each processor forms some subset of beams either in space
or frequency. They might have a coprocessor card in the machine that does
some of the early really intensive beamforming computation.
Take a look at the Allen Telescope Array or at the Square Kilometer Array
or at LOFAR.
>Anecdotally I'm reminded of e.g. the 8087, Micro Way's old transputer
>sets (advertised in PC mag for decades), the i860 (IIRC), the CM-5, and
>many other systems built over the years that tried to provide e.g. a
>vector co-processor in parallel with a regular general purpose CPU,
>sometimes on the same motherboard and bus, sometimes on daughterboards
>or even on little mini-network connections hung off the bus somehow.
>None of these really caught on (except for the 8087, and it is an
>exercise for the studio audience as to why an add-on processor that
>really should have been a part of the original processor itself, made by
>the mfr of the actual crippled CPU from the beginning, succeeded),
THat's pretty easy. In the good old days, you had an integer CPU and an
add on FPU in almost all architectures. The FPU didn't have instruction
decoding, sequencing, or anything like that.. more like an extra ALU that
tied to the internal bus. Just like having memory management in a separate
chip. Intel and Motorola both used this approach. Intel did start to
integrate the MMU into the chip with "segment registers" on the 8086,
except that it provided zip, zero, none, nada memory protection. This was
part of a strategy to keep the codebase compatible with the 8080. After
all, who in their right mind would write a program bigger than 64K.. the
user application code would never look at the segment registers, which
would be managed by a multitasking OS. Think of it as integrated "bank
switching", which was quite popular in the 8bit processor world (and
itself, an outgrowth of how PDP-11 memory mangement worked)
It wasn't until the 80286 that it started to be some more sophistication,
and really, it was the 386 that made decent memory management possible.
Moto started with a virtual memory scheme and paging, and so became the
darling of software folks who had come to expect such things from the
PDP-11, DEC-10, DG, and even mainframe world.
In any case, NONE of them could have fit the FPU on the die and had decent
yields. Besides, you're talking processors that cost $200-400 (in 1980s)
and processors with integrated FPUs would have cost upwards of $1K-$1.5K
(because of the lower yield). As fab technology advanced, you could either
build bigger faster processors (in the separate CPU/FPU model) or you could
build integrated processors at the same slow speed.
Even today, I'd venture to guess that the vast number of CPU cycles spent
on PCs are integer mode computations (bitblts and the like to make windows
work). It's not like you need FP to do Word or PowerPoint, or even
Excel. It's rendered 3D graphics that really drives FP performance in the
This drives an interesting battle between the graphics ASIC makers (so that
an add on card can do the rendering) and the CPU makers (who want to put it
onboard, so that the total system cost is less), and, as well the support
provided by MS Windows to use either one effectively. The game market
clearly doesn't want to have to try and support ALL the possible graphics
cards out there (it was a nightmare trying to write high performance
graphics applications back in the late 80's, early 90s. The few skilled
folks who were good at it earned their shekels.)
>although nearly all of them were used by at least a few intrepid
>individuals to great benefit. Allowing that Nature is efficient in its
>process of natural selection, this seems like a genetic/memetic
>variation that generally lacks the CBA advantages required to make it a
James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
More information about the Beowulf