[Beowulf] Cluster OpenMP

Tue May 16 18:53:59 PDT 2006

Mark Hahn wrote:
>> http://softwareforums.intel.com/ids/board/message?board.id=11&message.id=3793
>>
>> Maybe it will be interesting for many people here.
> 
> I'm not so sure.  DSM has been done for decades now, though I haven't
> heard of OpenMP (ish) implementations based on it.  fundamentally,

SGI's Altix and O3k (and O2k) were hardware versions of DSM upon which 
OpenMP was based.  It is possible to do it and to do it well, though if 
you want it fast it is going to cost you money (low latency).  This is 
where coupling this with something like Infinipath and similar 
technologies *might* be interesting.

> there's no real conceptual challenge to implementing SHM across nets,
> just practical matters.  regardless of whether you present a SHM
> interface to the programmer, you eventually have to adopt the same 
> basic programming models that reflect the topology of your interconnect.
> for instance, scalable OpenMP codes seem to migrate towards looking
> a bit like message-passing codes.  after all, if you care about latency,
> you want to batch together relevant data.  and scalability really does 
> mean caring about latency (and often bandwidth).

True to an extent.  Scalability usually means doing as little 
interprocess communication as possible in critical sections, and doing 
everything you can to hide latency.   In order to do this, you need to 
take topology into account, as well as make sure the underlying 
algorithm is well mapped onto the topology.

> doing DSM based on pages is convenient, since it means you can put 
> your smarts into a library with a fairly straightforward kernel/net 
> interface.  innumerable masters-thesis projects have done this, as well
> as Mosix and others.  the downside is that to fetch a single byte,
> you take a page fault, and do some kind of RPC.  

... and this can be hideously expensive.  Unless you can figure out a 
way to move single cache lines around (paying a huge non-amortizable 
latency penalty ... )

> but if your shared 
> data is read-mostly, or naturally very granular, you're golden.

Shared data is still dangerous.  You want private copies whenever 
possible (not always, especially in huge global data structures).

> hooking into the language is a popular way to break up the chunks - 
> the language can simply emit get/put at language-appropriate places.
> for those apps hurt by page-based sharing, this is certainly better.
> but writing a compiler, even a preprocessor, is a pretty big deal.
> there have been multiple implementations of this approach, but somehow
> they never gain much traction.

This stuff needs to be handled down at the OS level (really below it) to 
be transparent.  But this really stresses the NUMA impacts (and 
surprises in your performance) if your latencies are high.

> doing an implementation that is really smart about handling sequential
> vs random patterns (prefetching, etc), doing all the right locking,
> load-balancing, accepting programmer hints/assertions, etc, that's 
> a pretty big undertaking, and I don't know of a system which has done 
> it, well.  also, there are sort of intermediate interfaces, such as 
> Global Arrays or Charm++.

The O2k/O3k did this pretty well, as does the Altix.  That Stanford Dash 
architecture is pretty good.

Global Arrays and Charm++ are not a panacea (nor is MPI, or ...)

> and fundamentally, you have to notice that SHM approaches tend to yield
> quite modest speedups.  for instance, in the Intel whitepaper, they're 
> showing speedups of 3-8 for a 16x machine.  that's really not very good.

Hmmm.  A fair number of high quality commercial codes struggle to get 8x 
on 16 CPUs on SMPs, 3-8x isn't bad, or , 8x isn't bad.  3x isn't great.

> if you insist on a SHM interface, IMO you're best off sticking to fairly
> small SMP machines as the mass-market inches up.  right now, that's 
> probably 4x2 opterons, but with quad-cores coming (and some movement 
> towards smarter system fabrics by AMD and Intel), affordable SMP is 
> likley to grow to ~32x within a year or two.

Yup, though this might stretch the meaning of the word "affordable".

> 
> IMO, anyone who needs "real" scaling (>64x real speedup, say) has 
> already bitten the MPI bullet.  but I'm willing to be told I'm a 
> message-passing bigot ;)

:)

I admit freely that I like the semantics of programming with OpenMP.  It 
is easy to do, and it isn't all that hard to figure out how to get 
performance out of it, and where your performance has gone.   DSM layers 
do complicate it, but with good metrics (page faults, numa transactions, 
etc), it shouldn't be hard to build a reasonable DSM atop a fast message 
passing architecture in software.

Just be prepared to deal with a rather unforgiving NUMA.  Opterons spoil 
me to local fast memory that isn't too expensive to access node to node 
on the motherboard.  Now go compute-node to compute-node.  Watch that 
massive (2-3 order of magnitude) change in latency.  That wasn't very 
nice now, was it.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615