[Beowulf] Remote console management
mwill at penguincomputing.com
Sun Sep 25 13:35:44 PDT 2005
This will of course impact the performance of
the compute jobs whenever I/O happens.
An alternative approach could be to deshuffle the money of the
distributed local storage
that you sketched out, and have cheaper diskless (and therefore almost
compute nodes (or with a non-raided single drive for scratchspace of
plus a gang of storage nodes that are dedicated access points to a bunch
of iscsi or fibre
attached drive enclosures.
Bruce Allen wrote:
> Good to "see you" in this discussion -- I think this thread would be
> the basis for a nice article.
> Spending the $$$ to buy some extra nodes won't work in our case. We
> don't just use the cluster for computing, we also use it for data
> storage. Each of the 400+ nodes will have four 250GB disks and a
> hardware RAID controller (3ware 9500 or Areca 1110). If a node is
> acting odd, we'd like to be able to diagnose/fix/reboot/restore it
> quickly if possible. To replicate the data from a distant tape-backed
> repository will take many hours. So having some 'extra' machines
> doesn't help us so much, since we wouldn't know what data to keep on
> them, and moving the data onto them when needed would normally take
> much longer than bringing back to life the node that's gone down.
> On Sat, 24 Sep 2005, Douglas Eadline wrote:
>>> We're getting ready to put together our next large Linux compute
>>> This time around, we'd like to be able to interact with the machines
>>> remotely. By this I mean that if a machine is locked up, we'd like
>>> to be
>>> able to see what's on the console, power cycle it, mess with BIOS
>>> settings, and so on, WITHOUT having to drive to work, go into the
>>> room, etc.
>> This brings up an interesting point and I realize this does come down to
>> a design philosophy, but cluster economics sometimes create non standard
>> solutions. So here is another way to look at "out of band monitoring".
>> Instead of adding layers of monitoring and control, why not take that
>> cost and buy extra nodes. (but make sure you have a remote hard power
>> cycle capability). If a node dies and cannot be rebooted, turn it
>> off, and
>> fix it later. Of course monitoring fans and temperatures is a good thing
>> (tm), but if node will not boot, and you have to play with the BIOS,
>> I would consider it broken.
>> Because you have "over capacity" in your cluster (you bought extra
>> this does not impact the amount work that needs to get done. Indeed,
>> to the failure you can have the extra nodes working for you. You fully
>> understand that at various time one or two nodes will be off line. They
>> are taken out of the scheduler and there is no need to fix them right
>> This approach also depends on what you are doing with your
>> cluster and the cost of nodes etc. In some cases out-of-band access
>> is a good thing. In other cases, the "STONIH-AFIT" (shoot the other node
>> in the head and fix it tomorrow" approach is also reasonable.
>> check out http://www.clustermonkey.net
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
Penguin Computing Corp.
mwill at penguincomputing.com
More information about the Beowulf