[Beowulf] 512 nodes Myrinet cluster Challanges

Vincent Diepeveen diep at xs4all.nl
Mon May 1 15:42:14 PDT 2006


Just measure the random ring latency of that 1024 nodes Myri system and 
compare.
There is several tables around with the random ring latency.

http://icl.cs.utk.edu/hpcc/hpcc_results.cgi

To quote an official who sells huge Myri clusters, and i'm 100% sure he 
wants to keep anonymous: "you get what you pay for,
and most of them we can sell a decent Myri network, just 1 organisation had 
a preference for quadrics recently. On average however 10% considers a 
different network than Myri. Money talks, they can't compete against the 
number of gflops a
dollar we deliver".

Myself selling mass market products i can do math with products very well. 
If you can put in a network that's say $800 a port,
versus a network that's $1500 a port at 512 nodes, then you can do just as i 
can the math what is more expensive.

Or better said, you can figure out yourself where you earn more at when 
selling a product at a contract jobs offer price.

It's not that organisations look for "what is the best network?"

They either give the job to a friend of them (because of some weird demand 
that just 1 manufacturer can provide),
or they have an open bid and if you can bid with a network that's $800 a 
port, then that bid is gonna get taken over
a bid that's $1500 a port.

This where the network is one of the important choices to make for a 
supercomputer. I'd argue nowadays, because
the cpu's get so fast compared to latencies over networks, it's THE most 
important choice.

Usually that means you get offered a P4 dual Xeon network. The type of Xeon 
prescotts cpu's with 4 cycle L1 latency...

My government bought 600+ node network with infiniband and and and.... dual 
P4 Xeons.
Incredible.

That's how market works simply with salesmen.
If it can make the node price cheaper, it gets used and sold simply.

Most clusters you have to do effort to figure out *what* network it is, and 
*how* it is configured.

I believe personal in measuring at full system load.

I should port my test from multiprocessor with linux primitives to MPI one 
day.

But even then, will you run it at your 1024 nodes cluster?

The myri networks i ran on were not so good. When i asked the same big blue 
guy the answer was:
   "yes on paper it is good nah? However that's without the overhead that 
you practical have from network
    and other users".

A network is just as good as its weakest link. With many users there is 
always a user that hits that weak link.

That said, i'm sure some highend Myri component will work fine too.

This is the problem with *several* manufacturers basically.
They usually have 1 superior switch that's nearly unaffordable, or just used 
for testers,
and in reality they deliver a different switch/router which sucks ass, to 
say polite.
This said without accusing any manufacturer of it.

But they do it all.

A serial number of a switch with 1 letter difference is all what 
distinguishes it sometimes.

Vincent

p.s. I still wonder what NUMAFLEX routers SGI delivered to my countries 512p 
teras system. I could be confused but i was under the impression they were a 
hell of a lot slower than the routers in the 256p and smaller partitions of 
that TERAS system.
Perhaps the problem was simply a BAD HARDWARE SCALING from 64 nodes to 128 
nodes.

My government, after getting delivered that partition in 2000 didn't do a 
single test at it. In november 2003 i was the first one
to run a test at 512p being only user at the partition.

----- Original Message ----- 
From: "David Kewley" <kewley at gps.caltech.edu>
To: <beowulf at beowulf.org>
Cc: "Vincent Diepeveen" <diep at xs4all.nl>; "Walid" <walid.shaari at gmail.com>
Sent: Monday, May 01, 2006 9:10 PM
Subject: Re: [Beowulf] 512 nodes Myrinet cluster Challanges


> On Monday 01 May 2006 13:37, Vincent Diepeveen wrote:
>> With so many nodes i'd go for either infiniband or quadrics, assuming the
>> largest partition also gets 512 nodes.
>>
>> Scales way better at so many nodes, as your software will need really a
>> lot of communications as you'll probably need quite a lot of RAM for the
>> applications at all nodes.
>
> Vicent,
>
> Are you saying that Infiniband and Quadrics scale way better at many nodes
> than Myrinet does?  Can you describe this in more detail?  I've seen no
> problems with scaling of 1024 nodes of Myrinet, and to the best of my
> knowledge and belief the network hardware & software are designed to scale
> very well.
>
> David
> 




More information about the Beowulf mailing list