[Beowulf] Cluster OpenMP

Tue May 16 21:54:45 PDT 2006

> > I'm not so sure.  DSM has been done for decades now, though I haven't
> > heard of OpenMP (ish) implementations based on it.  fundamentally,
> 
> SGI's Altix and O3k (and O2k) were hardware versions of DSM upon which 
> OpenMP was based.

I understand why you call SGI machines DSM, but I usually expect
the 'distributed' to mean independent, not hardware-coherent memory systems.
the distinction is significant mainly because choice of programming model
is difficult and long-lasting.

> It is possible to do it and to do it well, though if 
> you want it fast it is going to cost you money (low latency).  This is 
> where coupling this with something like Infinipath and similar 
> technologies *might* be interesting.

SGI is still substantially faster than Infinipath - at least SGI 
talks about sub-1-us latency, and bandwidth well past 2 GB/s...

> The O2k/O3k did this pretty well, as does the Altix.  That Stanford Dash 
> architecture is pretty good.

directory-based coherence is fundamental (and not unique to dash followons,
and hopefully destined to wind up in commodity parts.)  but I think people
have a rosy memory of a lot of older SMP machines - their fabric seemed
better mainly because the CPUs were so slow.  modern fabrics are much 
improved, but CPUs are much^2 better.  (I'd guess that memory speed improves
on a curve similar to fabrics, but memory _capacity_ is areal like CPUs
and disk capacity.)

> > quite modest speedups.  for instance, in the Intel whitepaper, they're 
> > showing speedups of 3-8 for a 16x machine.  that's really not very good.
> 
> Hmmm.  A fair number of high quality commercial codes struggle to get 8x 
> on 16 CPUs on SMPs, 3-8x isn't bad, or , 8x isn't bad.  3x isn't great.

3/16 speedup is not worth running - or rather, is a sign that the program
is broken and needs to be re-done.  sure, it may be common in industry to 
damn the torpedos because time-to-market costs more.  but 3/16 is pretty
pathetic, and certainly not appropriate for, eg, shared academic clusters.
some other researcher is waiting in line for those those 13 "wasted" cpus...

> > towards smarter system fabrics by AMD and Intel), affordable SMP is 
> > likley to grow to ~32x within a year or two.
> 
> Yup, though this might stretch the meaning of the word "affordable".

why do you say that?  8-16-core machines are affordable today,
the latter perhaps depending on your application.  AMD and Intel 
both promise to ship 4-core chips next year, so 32 seems pretty expectable.
the main question is whether the SMP fabric will be improved.

> I admit freely that I like the semantics of programming with OpenMP.  It 
> is easy to do, and it isn't all that hard to figure out how to get 
> performance out of it, and where your performance has gone.   DSM layers 

really?  my limited experience with it is that trivial use is easy,
but after that, the curve gets steep.  so it's great for someone who's 
just starting out, and has some huge set of independent (often data-parallel)
SIMD-like operations to perform.  add in some privacy, locking, reductions,
try to control scheduling, and suddenly it's real work.  once you get to 
the point of explicitly managed work queues (adaptive mesh refinement,
for instance), you might as well be using MPI.

> do complicate it, but with good metrics (page faults, numa transactions, 
> etc), it shouldn't be hard to build a reasonable DSM atop a fast message 
> passing architecture in software.

fundamentally, these are at odds: the shared-memory programing model
leads the programmer to think of all memory as uniform.  that's simply
untrue on anything but very small multiprocessors today, and is extremely 
far from the truth for anything large.  and fabrics are on a lower-order 
curve than CPUs, so this "non-idealism" can only increase.