[Beowulf] Cluster OpenMP

Wed May 17 03:11:54 PDT 2006

Mark Hahn wrote:
>>> I'm not so sure.  DSM has been done for decades now, though I haven't
>>> heard of OpenMP (ish) implementations based on it.  fundamentally,
>> SGI's Altix and O3k (and O2k) were hardware versions of DSM upon which 
>> OpenMP was based.
> 
> I understand why you call SGI machines DSM, but I usually expect
> the 'distributed' to mean independent, not hardware-coherent memory systems.
> the distinction is significant mainly because choice of programming model
> is difficult and long-lasting.

Wasn't just me, SGI called it hardware DSM.  This is what it was as was 
based upon the Stanford Dash project.  We called it that internally, to 
customers, and widely in public.

>> It is possible to do it and to do it well, though if 
>> you want it fast it is going to cost you money (low latency).  This is 
>> where coupling this with something like Infinipath and similar 
>> technologies *might* be interesting.
> 
> SGI is still substantially faster than Infinipath - at least SGI 
> talks about sub-1-us latency, and bandwidth well past 2 GB/s...

Hmmm... I always look at real world code results as compared to 
synthetic benchmarks, as the latter rarely correlate well against real 
world code/situations (HPL, et al are examples of this not correlating 
well).

> 
>> The O2k/O3k did this pretty well, as does the Altix.  That Stanford Dash 
>> architecture is pretty good.
> 
> directory-based coherence is fundamental (and not unique to dash followons,
> and hopefully destined to wind up in commodity parts.)  but I think people
> have a rosy memory of a lot of older SMP machines - their fabric seemed

???

SGI Altix is the same (10+ year old design) as the O2k/O3k.  The O3k was 
a speed bump to the O2k.  Same with the Altix, with a glue logic change 
to incorporate the Itanium2.

Nothing rosy about this, it was a good design.

> better mainly because the CPUs were so slow.  modern fabrics are much 
> improved, but CPUs are much^2 better.  (I'd guess that memory speed improves
> on a curve similar to fabrics, but memory _capacity_ is areal like CPUs
> and disk capacity.)

They are actually on different curves.  John Mashey gave us a 
presentation where he talked about this.

> 
>>> quite modest speedups.  for instance, in the Intel whitepaper, they're 
>>> showing speedups of 3-8 for a 16x machine.  that's really not very good.
>> Hmmm.  A fair number of high quality commercial codes struggle to get 8x 
>> on 16 CPUs on SMPs, 3-8x isn't bad, or , 8x isn't bad.  3x isn't great.
> 
> 3/16 speedup is not worth running - or rather, is a sign that the program
> is broken and needs to be re-done.  sure, it may be common in industry to 
> damn the torpedos because time-to-market costs more.  but 3/16 is pretty
> pathetic, and certainly not appropriate for, eg, shared academic clusters.
> some other researcher is waiting in line for those those 13 "wasted" cpus...

Not arguing that inefficient program designs need to be improved.  I 
agree with this.

> 
>>> towards smarter system fabrics by AMD and Intel), affordable SMP is 
>>> likley to grow to ~32x within a year or two.
>> Yup, though this might stretch the meaning of the word "affordable".
> 
> why do you say that?  8-16-core machines are affordable today,

Define "affordable".  You will find that your definition differs from 
others.  Sometimes substantially.  When I was at SGI, I had in my head 
that $250k for a 16 way machine was affordable.  My thoughts have 
changed somewhat ...

> the latter perhaps depending on your application.  AMD and Intel 
> both promise to ship 4-core chips next year, so 32 seems pretty expectable.
> the main question is whether the SMP fabric will be improved.

If you meant "possible" rather than "affordable", yes, I agree with you. 
  There are lots of issues to resolve ahead of time.  Design of MB's is 
one of them, as the Opteron MBs seem not to be using a nice cube 
topology, rather more along the lines of a grid with crosslinks.  Works 
out great to use up all those nice HT links, works out not so nice to 
figure out the cost of a non-local memory access (unless I misunderstood 
the design doc in the manuals/spec sheets).

>> I admit freely that I like the semantics of programming with OpenMP.  It 
>> is easy to do, and it isn't all that hard to figure out how to get 
>> performance out of it, and where your performance has gone.   DSM layers 
> 
> really?  my limited experience with it is that trivial use is easy,
> but after that, the curve gets steep.  so it's great for someone who's 
> just starting out, and has some huge set of independent (often data-parallel)
> SIMD-like operations to perform.  add in some privacy, locking, reductions,
> try to control scheduling, and suddenly it's real work.  once you get to 
> the point of explicitly managed work queues (adaptive mesh refinement,
> for instance), you might as well be using MPI.

I was able to get 28x on 32 CPUs fairly easily for a number of Chemistry 
codes a while ago (7 years) using OpenMP, and a little code rewriting. 
Not much.  We were able to (in other cases) get about 240x on 256 CPUs 
for other problems, without serious rewriting.

The issue for OpenMP was memory placement on the NUMA.  OpenMP is not 
hard to use or to make scale in a number of problems.  Some problems it 
is just not possible due to the way they work, in which case you can 
write them as independent threads if you like (MPI like).

> 
>> do complicate it, but with good metrics (page faults, numa transactions, 
>> etc), it shouldn't be hard to build a reasonable DSM atop a fast message 
>> passing architecture in software.
> 
> fundamentally, these are at odds: the shared-memory programing model
> leads the programmer to think of all memory as uniform.  that's simply
> untrue on anything but very small multiprocessors today, and is extremely 
> far from the truth for anything large.  and fabrics are on a lower-order 
> curve than CPUs, so this "non-idealism" can only increase.

Hmmm... I spent lots of time working on the O2/O3k, getting codes to 
scale.  Something I noted back then was that MPI afficianados usually 
made similar arguments.  They always claimed that you simply *could not* 
get the performance we were seeing and getting, and our customers were 
seeing/getting.

My experience with this (from OpenMP days (1994 onwards) and MPI days 
(1997 onwards) ) is that OpenMP is substantially easier to program. 
Due to its nature, MPI will be more likely to accelerate on a wider 
variety of hardware as it makes fewer assumptions about the systems. 
You will get better performance on MPI as it enforces data locality of 
reference.  You have no choice there.  You also get lots of other 
baggage that you really have to worry about in addition to your algorithm.

Parallel programming has costs and tradeoffs.  As does anything good. 
You have to decide which ones make the most sense for you.

> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615