[Beowulf] Cluster OpenMP

Wed May 17 02:14:15 PDT 2006

> SGI is still substantially faster than Infinipath - at least SGI 
> talks about sub-1-us latency, and bandwidth well past 2 GB/s...

I didn't look extensively, but:
http://www.sgi.com/products/servers/altix/numalink.html

Claims 1 us (not sure if that is 1 or 2 significant digits), 3.2 GB/sec
per link.

A similar page: http://www.pathscale.com/infinipath-perf.html
Pathscale claims 1.29us latency, 954MB/sec per link.

Of course it's much more complicated than that.  I was somewhat
surprised at how much of a special case the Altix MPI latency
is, I found:
http://www.csm.ornl.gov/~dunigan/sgi/altixlat.png

Anyone have similar for a current altix (current OS, drivers, and
NUMAlink 4)?

So 1us latency to one other CPU ;-).  With infinipath I often see sub 1.0
us latencies [1].  Until I saw that graph, I thought shared memory +
NUMAlink enable 1.0us MPI latency.  In reality NUMAlink isn't involved
and that's only to the local CPU.  Corrections welcome.

Unfortunately for those hoping for 1.0us latencies on an Altix real
world communications often involve talking to non-local processors.

Look at the random ring latency benchmark, Pathscale seems to do
about 1/2 the latency of a similar size Altix [2].  For non-trivial
communication patterns (i.e. non matched nearest neighbor pairs)
it looks like Pathscale might have an advantage.

Granted shared memory is big differentiator, then again so is
price/performance.  Seems like reasonable Opteron + Infinipath clusters
are in the neighborhood of $1,200 per core these days.  I've not seen a
large Altix quote recently, but I was under the impression it was more
like 10 times that (when including storage and 3 year warranty).

So (no surprise) Opterons + Infinipath and Altix's have different
markets and applications that justify their purchases.

The best news for the consumer on the CPU side is that AMD has managed
to light a fire under Intel.  Nothing like lower latency, twice the
bandwidth, and less power to scare the hell out of an engineering
department.  Rumors claim reasonable Opteron competition will be out
this summer.

On the interconnect side there seems to be much more attention paid to
latency, message rates, and random ring latency these days.  I'm happy
to say that from what I can tell this is actually improving real world
scaling on real world applications.

So don't mind me while I cheer the leaders while painting a target on
them.  For the underdogs that got caught with their pants down, you now
know where you need to be.  I'm happy to say the underdogs seem to be
paying attention and the spirit of competition seems to be very healthy.

As a result it's a much more aggressive battle for the HPC market and
the HPC consumer wins.

[1] As long as I'm on the same node.  Home grown benchmark:
node001 node001 node001 node001
size=    1, 131072 hops, 4 nodes in  0.128 sec ( 0.973 us/hop)   4013 KB/sec

[2] at least from the data points at 
    http://icl.cs.utk.edu/hpcc/hpcc_results.cgi

> directory-based coherence is fundamental (and not unique to dash followons,
> and hopefully destined to wind up in commodity parts.)  but I think people
> have a rosy memory of a lot of older SMP machines - their fabric seemed
> better mainly because the CPUs were so slow.  modern fabrics are much 
> improved, but CPUs are much^2 better.  (I'd guess that memory speed improves

Seems like just the opposite.  Has performance really changed that
much since the 2.8 GHz northwood or the 1.4 GHz opteron?  Intel's IPC
has been dropping ever since.  Interconnect wise since then we have
mostly gotten what a factor of 4 in latency (6-8us for older Myrinet +
older gm on the slower clocked cards vs Infinipath) and 2.5 Gb ->
20 Gb (Myrinet vs IB DDR).

If it's really the fabric that is holding you back you could have 2,
getting 2 16x pci-e slots in a node isn't hard today.  Not saying it
would be easily price/performance justified, but it plausibly gives
you more performance.   

To me it looks like the interconnect vendors are anxiously awaiting
the next doubling in CPU performance, memory bandwidth, and cores per
socket to help justify their existence to a larger part of the market.
Seems like it's the CPU that has been the one slow to improve these
last years.

-- 
Bill Broadley
Computational Science and Engineering
UC Davis