[Beowulf] Remote console management
ballen at gravity.phys.uwm.edu
Sun Sep 25 18:24:48 PDT 2005
> An alternative approach could be to deshuffle the money of the
> distributed local storage that you sketched out, and have cheaper
> diskless (and therefore almost stateless) compute nodes (or with a
> non-raided single drive for scratchspace of intermediate results) plus a
> gang of storage nodes that are dedicated access points to a bunch of
> iscsi or fibre attached drive enclosures.
Nope -- not enough bandwidth to the data. With our current plan our
bandwidth to the data will be 400 x 100 MB/sec = 40 GB/sec. This is
enough to read ALL 400 TB of data on the cluster in 10000 sec, or about
three hours. You can't even come close with centralized (non-distributed)
> Bruce Allen wrote:
>> Good to "see you" in this discussion -- I think this thread would be the
>> basis for a nice article.
>> Spending the $$$ to buy some extra nodes won't work in our case. We don't
>> just use the cluster for computing, we also use it for data storage. Each
>> of the 400+ nodes will have four 250GB disks and a hardware RAID controller
>> (3ware 9500 or Areca 1110). If a node is acting odd, we'd like to be able
>> to diagnose/fix/reboot/restore it quickly if possible. To replicate the
>> data from a distant tape-backed repository will take many hours. So having
>> some 'extra' machines doesn't help us so much, since we wouldn't know what
>> data to keep on them, and moving the data onto them when needed would
>> normally take much longer than bringing back to life the node that's gone
>> On Sat, 24 Sep 2005, Douglas Eadline wrote:
>>>> We're getting ready to put together our next large Linux compute cluster.
>>>> This time around, we'd like to be able to interact with the machines
>>>> remotely. By this I mean that if a machine is locked up, we'd like to be
>>>> able to see what's on the console, power cycle it, mess with BIOS
>>>> settings, and so on, WITHOUT having to drive to work, go into the cluster
>>>> room, etc.
>>> This brings up an interesting point and I realize this does come down to
>>> a design philosophy, but cluster economics sometimes create non standard
>>> solutions. So here is another way to look at "out of band monitoring".
>>> Instead of adding layers of monitoring and control, why not take that
>>> cost and buy extra nodes. (but make sure you have a remote hard power
>>> cycle capability). If a node dies and cannot be rebooted, turn it off, and
>>> fix it later. Of course monitoring fans and temperatures is a good thing
>>> (tm), but if node will not boot, and you have to play with the BIOS, then
>>> I would consider it broken.
>>> Because you have "over capacity" in your cluster (you bought extra nodes)
>>> this does not impact the amount work that needs to get done. Indeed, prior
>>> to the failure you can have the extra nodes working for you. You fully
>>> understand that at various time one or two nodes will be off line. They
>>> are taken out of the scheduler and there is no need to fix them right
>>> This approach also depends on what you are doing with your
>>> cluster and the cost of nodes etc. In some cases out-of-band access
>>> is a good thing. In other cases, the "STONIH-AFIT" (shoot the other node
>>> in the head and fix it tomorrow" approach is also reasonable.
>>> check out http://www.clustermonkey.net
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
> Michael Will
> Penguin Computing Corp.
> Sales Engineer
> 415-954-2899 fx
> mwill at penguincomputing.com
More information about the Beowulf