[Beowulf] Cluster Diagram of 500 PC

Julien.Leduc at lri.fr Julien.Leduc at lri.fr
Wed Jul 11 03:10:15 PDT 2007


>>  How do we start and stop all nodes using a remote computer.
>
> IPMI is an excellent, portable, well-scriptable interface for control
and monitoring.  there are some vendor-specific alternatives, as well as
cruder mechanisms (controllable PDU's).
IPMI is sometimes OK, sometimes not that good: be carefull about your
exact needs.
IPMI is just a standard that can be implemented quite well, or so poorly,
it does not work most of the times (and at a 500 nodes scale, it is a
nightmare!).

I take care of a cluster that is similar in size to the one you want to
build, and that requires a lot of reboots (>460 000 rebooted nodes on a 9
month time slot => an average of 5 reboots per node per day).

By experience, some IPMI hardware implementations are not sufficient to
ensure efficient reboot, for example, we had some issues rebooting the
nodes when they were in the PXE boot stage, or blocked in grub with a
missing kernel, or worse: when running a freeBSD system.

controllable PDUs is not a good idea, because, it will burn your
harddrives and your nodes components pretty quickly, and with so many
nodes, you will loose many even if your reboot rate is low.

Many other solutions are OK: they tend to be scriptable though a telnet +
expect script, so it's OK as long as it can reboot all your nodes in any
situation.

Regards,

Julien Leduc








More information about the Beowulf mailing list