[Beowulf] Cluster OpenMP

Tue May 16 08:17:08 PDT 2006

> http://softwareforums.intel.com/ids/board/message?board.id=11&message.id=3793
> 
> Maybe it will be interesting for many people here.

I'm not so sure.  DSM has been done for decades now, though I haven't
heard of OpenMP (ish) implementations based on it.  fundamentally,
there's no real conceptual challenge to implementing SHM across nets,
just practical matters.  regardless of whether you present a SHM
interface to the programmer, you eventually have to adopt the same 
basic programming models that reflect the topology of your interconnect.
for instance, scalable OpenMP codes seem to migrate towards looking
a bit like message-passing codes.  after all, if you care about latency,
you want to batch together relevant data.  and scalability really does 
mean caring about latency (and often bandwidth).

doing DSM based on pages is convenient, since it means you can put 
your smarts into a library with a fairly straightforward kernel/net 
interface.  innumerable masters-thesis projects have done this, as well
as Mosix and others.  the downside is that to fetch a single byte,
you take a page fault, and do some kind of RPC.  but if your shared 
data is read-mostly, or naturally very granular, you're golden.

hooking into the language is a popular way to break up the chunks - 
the language can simply emit get/put at language-appropriate places.
for those apps hurt by page-based sharing, this is certainly better.
but writing a compiler, even a preprocessor, is a pretty big deal.
there have been multiple implementations of this approach, but somehow
they never gain much traction.

doing an implementation that is really smart about handling sequential
vs random patterns (prefetching, etc), doing all the right locking,
load-balancing, accepting programmer hints/assertions, etc, that's 
a pretty big undertaking, and I don't know of a system which has done 
it, well.  also, there are sort of intermediate interfaces, such as 
Global Arrays or Charm++.

and fundamentally, you have to notice that SHM approaches tend to yield
quite modest speedups.  for instance, in the Intel whitepaper, they're 
showing speedups of 3-8 for a 16x machine.  that's really not very good.
if you insist on a SHM interface, IMO you're best off sticking to fairly
small SMP machines as the mass-market inches up.  right now, that's 
probably 4x2 opterons, but with quad-cores coming (and some movement 
towards smarter system fabrics by AMD and Intel), affordable SMP is 
likley to grow to ~32x within a year or two.

IMO, anyone who needs "real" scaling (>64x real speedup, say) has 
already bitten the MPI bullet.  but I'm willing to be told I'm a 
message-passing bigot ;)

regards, mark hahn.