[Beowulf] 512 nodes Myrinet cluster Challanges
diep at xs4all.nl
Mon May 1 13:37:49 PDT 2006
With so many nodes i'd go for either infiniband or quadrics, assuming the
largest partition also gets 512 nodes.
Scales way better at so many nodes, as your software will need really a lot
communications as you'll probably need quite a lot of RAM for the
applications at all nodes.
Of course most want to sell you myri as it's simply cheaper; they might earn
more onto it a node.
For this type of code, the network you use and the total amount of RAM are
the 2 most important choices.
You could consider putting 2 network cards in each node, assuming each node
is quite big, in order to give
the highend network completely to the RAM communication.
As i/o already has quite a huge latency, for the slow latency network for
i/o you could do with a huge bandwidth network
and bad latency and a real state of the art highend network for the memory
The problems you can expect depend largely on the number of users that's
gonna use your cluster simultaneously.
More users = more problems.
Just avoid using all that commercial software for putting nodes to work that
most manufacturers try to sell you.
My experience is that PDSH works pretty good to start work.
Does your software handle dying nodes and can the network hotswap them?
If not, just consider the odds that sometimes a node needs maintenance.
How do you want to divide the cluster, into 1 partition of 512 nodes, or do
you plan all kind of small partitions?
A network is of course more expensive when you have 1 huge cluster than when
you divide it in small partitions.
If a node dies, then with several small partitions, your other partitions
run further without problems. Just the partition with
the dying node has a problem.
Most likely that dying node just has some dust inside its psu :)
----- Original Message -----
From: "Walid" <walid.shaari at gmail.com>
To: <beowulf at beowulf.org>
Sent: Wednesday, April 26, 2006 11:34 AM
Subject: [Beowulf] 512 nodes Myrinet cluster Challanges
> Hi all,
> Does any one know what types of problems/challanges for big clusters?
> we are considering having a 512 node cluster that will be using
> Myrinet as its main interconnect, and would like to do our homework
> The cluster is meant to run an inhouse fluid simulation application
> that is I/O intensve, and requires large memory models.
> any hints, pointers will be apperciated
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf