[Beowulf] Remote console management

Douglas Eadline deadline at clustermonkey.net
Sat Sep 24 10:21:29 PDT 2005


> We're getting ready to put together our next large Linux compute cluster.
> This time around, we'd like to be able to interact with the machines
> remotely.  By this I mean that if a machine is locked up, we'd like to be
> able to see what's on the console, power cycle it, mess with BIOS
> settings, and so on, WITHOUT having to drive to work, go into the cluster
> room, etc.
>
This brings up an interesting point and I realize this does come down to
a design philosophy, but cluster economics sometimes create non standard
solutions. So here is another way to look at "out of band monitoring".
Instead of adding  layers of monitoring and control, why not take that
cost and buy extra nodes. (but make sure you have a remote hard power
cycle capability). If a node dies and cannot be rebooted, turn it off, and
fix it later. Of course monitoring fans and temperatures is a good thing
(tm), but if node will not boot, and you have to play with the BIOS, then
I would consider it broken.

Because you have "over capacity" in your cluster (you bought extra nodes)
this does not impact the amount work that needs to get done. Indeed, prior
to the failure you can have the extra nodes working for you. You fully
understand that at various time one or two nodes will be off line. They
are taken out of the scheduler and there is no need to fix them right
away.

This approach also depends on what you are doing with your
cluster and the cost of nodes etc. In some cases out-of-band access
is a good thing. In other cases, the "STONIH-AFIT" (shoot the other node
in the head and fix it tomorrow" approach is also reasonable.


-- 
Doug

check out http://www.clustermonkey.net



More information about the Beowulf mailing list