[Beowulf] 512 nodes Myrinet cluster Challanges

Wed May 3 01:54:52 PDT 2006

Vincent,

Vincent Diepeveen wrote:
> Just measure the random ring latency of that 1024 nodes Myri system and 
> compare.
> There is several tables around with the random ring latency.
> 
> http://icl.cs.utk.edu/hpcc/hpcc_results.cgi

I just ran it on a 8 nodes dual Opteron 2.2 GHz with F card (Myrinet-2G) 
running MX-1.1.1 on 2.6.15 (one process per node):

------------------------------------------------------------------
Latency-Bandwidth-Benchmark R1.5.1 (c) HLRS, University of Stuttgart
Written by Rolf Rabenseifner, Gerrit Schulz, and Michael Speck, Germany

Major Benchmark results:
------------------------

Max Ping Pong Latency:                 0.002876 msecs
Randomly Ordered Ring Latency:         0.003223 msecs
Min Ping Pong Bandwidth:             246.988915 MB/s
Naturally Ordered Ring Bandwidth:    233.235547 MB/s
Randomly  Ordered Ring Bandwidth:    237.126381 MB/s

------------------------------------------------------------------

Detailed benchmark results:
Ping Pong:
Latency   min / avg / max:   0.002742 /   0.002800 /   0.002876 msecs
Bandwidth min / avg / max:    246.989 /    247.016 /    247.036 MByte/s
Ring:
On naturally ordered ring: latency=      0.003219 msec, bandwidth= 
233.235547 MB/s
On randomly  ordered ring: latency=      0.003223 msec, bandwidth= 
237.126381 MB/s

------------------------------------------------------------------

Benchmark conditions:
  The latency   measurements were done with        8 bytes
  The bandwidth measurements were done with  2000000 bytes
  The ring communication was done in both directions on 8 processes
  The Ping Pong measurements were done on
   -          56 pairs of processes for latency benchmarking, and
   -          56 pairs of processes for bandwidth benchmarking,
  out of 8*(8-1) =         56 possible combinations on 8 processes.
  (1 MB/s = 10**6 byte/sec)

------------------------------------------------------------------

This is nowhere near the few reported Myrinet results which, I am 
guessing, are all running GM on either C cards or D cards. There are no 
recent results running MX on recent hardware. You can also noticed that 
there is no QsNetIII results, which would be very close to the top in 
terms of latency.

I find it amusing that you have previously expressed reservation about 
the Linpack benchmark used in the Top500 but you blindely trust the HPCC 
results. A benchmark is useful only if it's widely used and if it is 
properly implemented. HPCC is neither. It has many many flaws that I 
have reported and that have been ignored so far. A major one is to 
penalize large clusters, specially on the ring latency test.

Today, MX on Myrinet-2G or 10G has a lower latency than Infiniband, even 
DDR/PCIE, and is about 1-2 us behind PathScale and QsNetIII. This gap 
will further reduce with the next MX release. Regarding scaling, the 
switch topology and the crossbar overheads are roughly the same.

> To quote an official who sells huge Myri clusters, and i'm 100% sure he 
> wants to keep anonymous: "you get what you pay for,
> and most of them we can sell a decent Myri network, just 1 organisation 
> had a preference for quadrics recently. On average however 10% considers 
> a different network than Myri. Money talks, they can't compete against 
> the number of gflops a
> dollar we deliver"

So, what he is saying is that Myrinet has an attractive
price-performance ratio for large clusters but you can get better if
spend more ? I would dare to say that it's a good thing :-)

Large clusters have different constraints. A major one is cabling for
example: this is a bad idea to cable 1024 nodes with short fat copper
cables. Money does not always talks first.

> It's not that organisations look for "what is the best network?"

You would really be surprised by the amount of benchmarking most
customers require when evaluating a machine. A lot of Gigabit Ethernet
clusters are good enough for customer needs, so they don't spend more
money on network. A lot of Myrinet-2G clusters are good enough for the
latency sensitive applications, so people don't need to pay twice as
much for Myrinet-10G.

You are somehow convinced that institutions buying clusters are brain
dead and always get ripped off. Some are, but most are not. You don't
have all of the informations used in their decision process, so you draw
invalid conclusions.

> They either give the job to a friend of them (because of some weird 
> demand that just 1 manufacturer can provide),
> or they have an open bid and if you can bid with a network that's $800 a 
> port, then that bid is gonna get taken over
> a bid that's $1500 a port.

The key is to set the right requirements in your RFP. Naive RFPs would
use broken benchmarks like HPCC. Smart RFPs would require benchmarking
real application cores under reliability and performance constraints.

It's not that "you get what you pay for", it's "you get what you ask for
at the best price".

> This where the network is one of the important choices to make for a 
> supercomputer. I'd argue nowadays, because
> the cpu's get so fast compared to latencies over networks, it's THE most 
> important choice.

In the vast majority of applications in production today, I would argue
that it's not. Why ? Because only a subset of codes have enough
communications to justify a 10x increase in network cost compared to
basic Gigabit Ethernet. Your application is very fine grain, because it 
does not compute much, but chess is not representative of HPC workloads...

> My government bought 600+ node network with infiniband and and and.... 
> dual P4 Xeons.
> Incredible.

again, you don't know the whole story: you don't know the deal they got
on the chips, you don't know if their applications runs fast enough on 
Xeons, you don't know if they could not have the same support service on 
Opteron (Dell does not sell AMD for example).

By the way, your gouvernment is also buying Myrinet ;-)
http://www.hpcwire.com/hpc/644562.html

> I believe personal in measuring at full system load.

Ok, you want to buy a 1024 nodes cluster. How do you measure at full
system load ? You ask to benchmark another 1024 nodes cluster ? You
can't, no vendor has a such a cluster ready for evaluation. Even if they
had one, things change so quickly in HPC, it would be obsolete very
quickly from a sale point of view.

The only way is to benchmark something smaller (256 nodes) and define
performance requirements at 1024 nodes. If the winning bid does not
match the acceptance criteria, you refuse the machine or you negociate a
"punitive package".

> The myri networks i ran on were not so good. When i asked the same big 
> blue guy the answer was:
>   "yes on paper it is good nah? However that's without the overhead that 
> you practical have from network
>    and other users".

Which machine, which NICs, which software ? We have 4 generations of
products with 2 different software interfaces, and it's all called
"Myrinet".

On *all* switched networks, there is a time when you share links with
other communications, unless you are on the same crossbar. Some sites do
care about process mapping (maximize job on same crossbar or same
switch), some don't. From the IBM guy's comment, I guess he doesn't know 
better.

> A network is just as good as its weakest link. With many users there is 
> always a user that hits that weak link.

There is no "weak" link in modern network fabrics, but there is
contention. Contention is hard to manage, but there is no real way
around except having a full crossbar like the Earth Simulator. Clos
topologies (Myrinet, IB) have contention, Torus topologies (Red Storm,
Blue Gene) have contention, that's life. If you don't understand it, you
will say the network is no good.

> That said, i'm sure some highend Myri component will work fine too.
> 
> This is the problem with *several* manufacturers basically.
> They usually have 1 superior switch that's nearly unaffordable, or just 
> used for testers,
> and in reality they deliver a different switch/router which sucks ass, 
> to say polite.
> This said without accusing any manufacturer of it.
> 
> But they do it all.

Not often in HPC. The HPC market is so small and so low-volume, you
cannot take the risk to alienate customers like that, they won't come
back. If they don't come back, you run out of business.

Furthermore, the customer accepts delivery of a machine on-site, testing
the real thing. If it does not match the RFP requirements, they can
refuse it and someone will lose a lot of money. It has happened many
times. It's not like the IT business when you buy something based on
third-party reviews and/or on specsheet. Some do that in HPC, and they
get what they deserve, but believe me, most don't.

Patrick