[Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster.

Thu Mar 10 09:57:29 PST 2005

Robert G. Brown wrote:
> On Thu, 10 Mar 2005, Joe Landman wrote:

[...]

> Problems with coprocessing solutions include:
> 
>   a) Cost -- sometimes they are expensive, although they >>can<< yield
> commensurate benefits for some code as you point out.

I am imagining a co-processor in the 1/10 x -> 4x range compared to node 
cost:  A graphics card is the prototype in mind.

> 
>   b) Availability -- I don't just mean whether or not vendors can get
> them; I mean COTS vs non-COTS.  They are frequently on-of-a-kind beasts
> with a single manufacturer.  

This is an issue with almost anything except CPUs, where we have 2 (or 3 
if you include PPC) manufacturers.

>   c) Usability.  They typically require "special tools" to use them at
> all.  Cross-compilers, special libraries, code instrumentation. 

TANSTAAFL.  The idea is that the cost to do the "port" has to be low (to 
zero).

> All of
> these things require fairly major programming effort to implement in
> your code to realize the speedup, and tend to decrease the
> general-purpose portability of the result, tying you even more tightly
> (after investing all this effort) with the (probably one) manufacturer
> of the add-on.
> 
>   c) Continued Availability -- They also not infrequently disappear
> without a trace (as "general purpose" coprocessors, not necessarily as
> ASICs) within a year or so of being released and marketed.  This is
> because Moore's Law is brutal, and even if a co-processor DOES manage to
> speed up your actual application (and not just a core loop that
> comprises 70% of your actual application) by a factor of ten, that's at
> most four or five years of ML advances.  If your code has a base of 30%
> or so that isn't sped up at all (fairly likely) then your application
> runs maybe 2-3 times as fast at best and ML eats it in 1-3 years.

And this is why the pricing issue is important.  At what point does it 
make economic sense to buy a coprocessor?  In the case of graphics 
cards, the coprocessor has amazing economies of scale (and it needs it). 
  You need similar economies of scale for a coprocessor system, which is 
why I think the cost should be similar to the node cost (like existing 
graphics cards at 1/10 to 4x node cost).

> 
>   d) Support.  Using the tools and processors effectively requires a
> fair bit of knowledge, but there is usually a pitifully small set of
> other implementers of the non-mainstream technology and no good
> communications channels between them (with some exceptions, of course).

Hmmm.  OpenGL uses C/C++/Fortran bindings to get at the power (at least 
I think there is a way to call GL from fortran).  What I was thinking 
was a high level (C/Fortran/C++) interface to them ala OpenGL.  Jeff 
Layton if you are around, what is the name of that compiler set for the 
GPUs?  Brook?  Something like that.

> You're likely to be mostly on your own while trying to get the tools
> installed, code written and debugged, and eventually made efficient.  If
> the tool or processor turns out to be "broken" for your purpose, you
> aren't likely to get much help with this, either, as you're a fringe
> market (again, with possible exceptions).
> 
> Each of these alter the naive cost-benefit estimate of "Gee it is 10x
> faster in more core loop and only makes my system cost 2x as much".
> 
> Maybe it is 10x faster in the core loop that is 70% of your code, so
> that now the application runs in 0.37x the original time (good, but now
> has to be compared to perhaps 0.5x the time available from getting 2x as
> many ordinary systems).  Maybe it takes you four months to get the
> cross-compiler installed and all your code ported and to then TWEAK the
> code so it really DOES give you the touted 10x speedup for your core
> loops, which may have to be reblocked and written using special
> instructions, which then also necessitates revalidating the results (in
> case bugs have crept in during the port).  Maybe the company that made
> the core DSP releases a new one in the meantime (they've got ML to
> contend with as well) and it has a different instruction set, so that a
> year from now when you want to expand the cluster you either
> re-instrument all the code again or rely on warehoused chips of the old
> variety.  

Again, I point to OpenGL as a prototypical interface for this.  The 
underlying driver may change, but the interface is effectively constant 
to the programmer, regardless of how many pixel shaders exist in the 
pipeline.

> Maybe in 1 year dual core, 64 bit CPUs are released that
> effectively double, then double again, what you can get out of COTS
> systems at near constant cost and your 32 bit CPU plus coprocessor
> suddenly is slower, less portable, AND more expensive.

Well, this has happened in the GPU market, and the GPUs have tracked 
with ML.  This is an issue for anyone committing to any computer of any 
sort.  ML is ML and it is going to drop the value of what you purchase 
very quickly.

> 
> Or not.  Maybe it speeds things up 10x, costs only 2x, will be available
> for at least 3 more years, has a user base with hundreds of users and a
> dedicated mailing list, has commercial or open source compiler support
> that requires only minor tweaks or the use of standard library calls to
> get most of the benefit, and is built to a standard so that four
> companies make the actual chips, not just one.

And an interface layer that masks the chip differences, so that when 
chips are changed out, the programs need not change (like OpenGL), 
though they can to take advantage of great new feature X (an additional 
MAC layer in the pipeline).

> 
> I'm just reviewing the questions one would like to ask.
> 
> Anecdotally I'm reminded of e.g. the 8087, Micro Way's old transputer
> sets (advertised in PC mag for decades), the i860 (IIRC), the CM-5, and
> many other systems built over the years that tried to provide e.g. a
> vector co-processor in parallel with a regular general purpose CPU,
> sometimes on the same motherboard and bus, sometimes on daughterboards
> or even on little mini-network connections hung off the bus somehow.
> 
> None of these really caught on (except for the 8087, and it is an
> exercise for the studio audience as to why an add-on processor that
> really should have been a part of the original processor itself, made by
> the mfr of the actual crippled CPU from the beginning, succeeded),
> although nearly all of them were used by at least a few intrepid
> individuals to great benefit.  Allowing that Nature is efficient in its
> process of natural selection, this seems like a genetic/memetic
> variation that generally lacks the CBA advantages required to make it a
> real success.

So there is an expression that I like attributing to myself, but I may 
have "borrowed" it from elsewhere.

	Something designed to fail often will.

The "general purpose" accelerator cards (transputer, NS32032, ...) all 
suffered from a lack of application focus among other things.  There was 
the prevalent attitude of "if you build it, then they will buy".  These 
units largely failed to take hold apart from tiny niches.

OTOH, "specialized" accelerator cards (Graphics cards, RAID cards, Sound 
cards) have been a smashing success, as the CBA makes sense, they 
deliver a specific value, and they are easy to use.  The take home 
message is that any accelerator card needs to do the same.  What these 
accelerator cards do is offload work from the CPU.  Not all of the will 
work as businesses, and this isn't a magical formula for success.

Moreover, the "specialized" GPUs seem to have applicability in CFD and 
other areas.  This is interesting as it opens a possibility for 
significant acceleration of some computations.  They fundamental 
question is whether or not there will be wide adoption.  I am not seeing 
wide adoption of the GPU as a CFD engine right now, but what if you had 
a "CFD engine" chip that cost about the  same as the GPU, stuck it on a 
card, and had a high level language interface to it, so you hand it your 
expensive routines to crank on.

The physics chip bit got me thinking along the molecular dynamics lines 
last night, specifically the non-bonded calculations.  I am sure others 
could regail us with their computational burdens (and I would like to 
hear them myself at some point in time, it is quite instructive to hear 
what people are worrying about).

I think the physics chip in hardware is a neat idea, though I think you 
need a high level interface to it, open standards, and lots of support 
to make it work.  Moreover, it needs to be programmable: not because 
physics changes so often, but because the implied models may differ from 
what you want.

As I said, I am curious, and I think it is an interesting idea.  If done 
right, with the wind at the right angles, good user/community support, I 
think it could work :)

> 
>    rgb
> 

-- 
joe