[Beowulf] 512 nodes Myrinet cluster Challanges

Tue May 2 14:37:05 PDT 2006

Vincent,

So, I just get back from vacation today and I find this post in my huge
mailbox. Reason would tell me to not waste time and ignore it, but I
can't resist such a treat.

Diepeveen wrote:
> With so many nodes i'd go for either infiniband or quadrics, assuming 
> the largest partition also gets 512 nodes.

On which parameters do you build this opinion: link capacity,
type/length of cables, routing mechanism, switch ports count, price ?

> Scales way better at so many nodes, as your software will need really a 
> lot of
> communications as you'll probably need quite a lot of RAM for the 
> applications at all nodes.

Do you happen to have a lot of experience with CFD codes on large
clusters ? If so, which communication characteristics affect scaling in
this context ?

> Of course most want to sell you myri as it's simply cheaper; they might 
> earn more onto it a node.

Ahh, these nasty vendors still trying to rip off these stupid university
people...

First, Myricom has always published its price list on the web, which is 
not a common practice. Of course, resellers gets a discount because they 
add value (most of the time) and buy volume. You may be surprise to 
learn that this economic model is used across most of the capitalist 
world and is quite efficient to build and maintain distribution channels.

Second, Myrinet-2G is effectively cheaper than Myrinet-10G, because it's
an older product. However, the latency of Myrinet-2G is almost the same
as Myrinet-10G for very small message sizes. It is then an attractive
choice for applications that are not constraints by bandwidth but by
small message latency, such as most CFD codes.

However, Myrinet-10G is comparable in price with all other similar 
interconnects. HPC is a low-volume high-margin market, there is room for 
price adjustments and vendors position their products against their 
competitors. Newcomers may sell at or near cost in order to buy market 
share, but it is only sustainable as long as you can burn your VC money.

> You could consider putting 2 network cards in each node, assuming each 
> node is quite big, in order to give
> the highend network completely to the RAM communication.

Keeping a reasonable processes/NIC ratio is common sense. However, if
the link bandwidth is not the bottleneck, you just double the network
cost for nothing.

> Just avoid using all that commercial software for putting nodes to work 
> that most manufacturers try to sell you.
> My experience is that PDSH works pretty good to start work.

You really don't understand economics.

To sell you commercial software, vendors must have added value compared 
to free software offerings. This added value can be support (you may 
prefer to have IBM backing up GPFS instead of going solo on one of the 
free parallel file-systems), or performance, or functionality, or pretty 
GUIs, whatever.

If you pay for a product that has no added value compared to the free
(as in beer) alternative, you are stupid and you deserve to be robbed. 
That does not mean that all commercial software offered by vendors are 
not worth it.

Just pdsh, hum ? right...

> If a node dies, then with several small partitions, your other 
> partitions run further without problems. Just the partition with
> the dying node has a problem.

What type of problem ?!?

Patrick