[Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.

Thu Mar 10 09:19:11 PST 2005

On Thu, 10 Mar 2005, Joe Landman wrote:

> Inverting the question, if you pay 4000$US per dual CPU compute node 
> (+/- a bit depending upon technology, config, supplier), what price (if 
> any) would you be willing to pay for an accelerator that offered you an 
> order of magnitude more performance per node, on your code, and sat in 
> the PCI-e/X or HTX slots?  And also as important: how hard would you be 
> willing to work/how much effort committed to program these things?  This 
> makes lots of assumptions, such as such a beast existing, your code 
> being mapped or mappable to it, and you being interested in this.
> 
> Part of what motivates this question are things like the Cray XD1 FPGA 
> board, or PathScale's processors (unless I misunderstood their 
> functions).  Other folks have CPUs on a card of various sorts, ranging 
> from FPGA to DSPs.   I am basically wondering aloud what sort of demand 
> for such technology might exist.  I assume the answer starts with "if 
> the price is right" ...  the question is what is that price, what are 
> the features/functionality, and how hard do people want to work on such 
> bits.
> 
> Note:  As Jeff Layton pointed out many times, the GPUs in a number of 
> machines are being used by at least one group for CFD, so you can think 
> of these as a sort of dedicated attached processor.  They are not 
> general purpose, but highly specialized computational pipelines.  If you 
> could have a more general one, what would it look like, what would it 
> do/emphasize, and how much would it cost?  I know there is no one 
> answer, but I thought it would be fun to extend Omri's question.

Problems with coprocessing solutions include:

  a) Cost -- sometimes they are expensive, although they >>can<< yield
commensurate benefits for some code as you point out.

  b) Availability -- I don't just mean whether or not vendors can get
them; I mean COTS vs non-COTS.  They are frequently on-of-a-kind beasts
with a single manufacturer.  

  c) Usability.  They typically require "special tools" to use them at
all.  Cross-compilers, special libraries, code instrumentation.  All of
these things require fairly major programming effort to implement in
your code to realize the speedup, and tend to decrease the
general-purpose portability of the result, tying you even more tightly
(after investing all this effort) with the (probably one) manufacturer
of the add-on.

  c) Continued Availability -- They also not infrequently disappear
without a trace (as "general purpose" coprocessors, not necessarily as
ASICs) within a year or so of being released and marketed.  This is
because Moore's Law is brutal, and even if a co-processor DOES manage to
speed up your actual application (and not just a core loop that
comprises 70% of your actual application) by a factor of ten, that's at
most four or five years of ML advances.  If your code has a base of 30%
or so that isn't sped up at all (fairly likely) then your application
runs maybe 2-3 times as fast at best and ML eats it in 1-3 years.

  d) Support.  Using the tools and processors effectively requires a
fair bit of knowledge, but there is usually a pitifully small set of
other implementers of the non-mainstream technology and no good
communications channels between them (with some exceptions, of course).
You're likely to be mostly on your own while trying to get the tools
installed, code written and debugged, and eventually made efficient.  If
the tool or processor turns out to be "broken" for your purpose, you
aren't likely to get much help with this, either, as you're a fringe
market (again, with possible exceptions).

Each of these alter the naive cost-benefit estimate of "Gee it is 10x
faster in more core loop and only makes my system cost 2x as much".

Maybe it is 10x faster in the core loop that is 70% of your code, so
that now the application runs in 0.37x the original time (good, but now
has to be compared to perhaps 0.5x the time available from getting 2x as
many ordinary systems).  Maybe it takes you four months to get the
cross-compiler installed and all your code ported and to then TWEAK the
code so it really DOES give you the touted 10x speedup for your core
loops, which may have to be reblocked and written using special
instructions, which then also necessitates revalidating the results (in
case bugs have crept in during the port).  Maybe the company that made
the core DSP releases a new one in the meantime (they've got ML to
contend with as well) and it has a different instruction set, so that a
year from now when you want to expand the cluster you either
re-instrument all the code again or rely on warehoused chips of the old
variety.  Maybe in 1 year dual core, 64 bit CPUs are released that
effectively double, then double again, what you can get out of COTS
systems at near constant cost and your 32 bit CPU plus coprocessor
suddenly is slower, less portable, AND more expensive.

Or not.  Maybe it speeds things up 10x, costs only 2x, will be available
for at least 3 more years, has a user base with hundreds of users and a
dedicated mailing list, has commercial or open source compiler support
that requires only minor tweaks or the use of standard library calls to
get most of the benefit, and is built to a standard so that four
companies make the actual chips, not just one.

I'm just reviewing the questions one would like to ask.

Anecdotally I'm reminded of e.g. the 8087, Micro Way's old transputer
sets (advertised in PC mag for decades), the i860 (IIRC), the CM-5, and
many other systems built over the years that tried to provide e.g. a
vector co-processor in parallel with a regular general purpose CPU,
sometimes on the same motherboard and bus, sometimes on daughterboards
or even on little mini-network connections hung off the bus somehow.

None of these really caught on (except for the 8087, and it is an
exercise for the studio audience as to why an add-on processor that
really should have been a part of the original processor itself, made by
the mfr of the actual crippled CPU from the beginning, succeeded),
although nearly all of them were used by at least a few intrepid
individuals to great benefit.  Allowing that Nature is efficient in its
process of natural selection, this seems like a genetic/memetic
variation that generally lacks the CBA advantages required to make it a
real success.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu