[Beowulf] Remote console management

Bruce Allen ballen at gravity.phys.uwm.edu
Sun Sep 25 18:24:48 PDT 2005


> An alternative approach could be to deshuffle the money of the 
> distributed local storage that you sketched out, and have cheaper 
> diskless (and therefore almost stateless) compute nodes (or with a 
> non-raided single drive for scratchspace of intermediate results) plus a 
> gang of storage nodes that are dedicated access points to a bunch of 
> iscsi or fibre attached drive enclosures.

Nope -- not enough bandwidth to the data.  With our current plan our 
bandwidth to the data will be 400 x 100 MB/sec = 40 GB/sec.  This is 
enough to read ALL 400 TB of data on the cluster in 10000 sec, or about 
three hours.  You can't even come close with centralized (non-distributed) 
storage systems.

Cheers,
 	Bruce

> Bruce Allen wrote:
>
>> Doug,
>> 
>> Good to "see you" in this discussion -- I think this thread would be the 
>> basis for a nice article.
>> 
>> Spending the $$$ to buy some extra nodes won't work in our case.  We don't 
>> just use the cluster for computing, we also use it for data storage. Each 
>> of the 400+ nodes will have four 250GB disks and a hardware RAID controller 
>> (3ware 9500 or Areca 1110).  If a node is acting odd, we'd like to be able 
>> to diagnose/fix/reboot/restore it quickly if possible.  To replicate the 
>> data from a distant tape-backed repository will take many hours. So having 
>> some 'extra' machines doesn't help us so much, since we wouldn't know what 
>> data to keep on them, and moving the data onto them when needed would 
>> normally take much longer than bringing back to life the node that's gone 
>> down.
>> 
>> Cheers,
>>     Bruce
>> 
>> 
>> On Sat, 24 Sep 2005, Douglas Eadline wrote:
>> 
>>> 
>>>> We're getting ready to put together our next large Linux compute cluster.
>>>> This time around, we'd like to be able to interact with the machines
>>>> remotely.  By this I mean that if a machine is locked up, we'd like to be
>>>> able to see what's on the console, power cycle it, mess with BIOS
>>>> settings, and so on, WITHOUT having to drive to work, go into the cluster
>>>> room, etc.
>>>> 
>>> This brings up an interesting point and I realize this does come down to
>>> a design philosophy, but cluster economics sometimes create non standard
>>> solutions. So here is another way to look at "out of band monitoring".
>>> Instead of adding  layers of monitoring and control, why not take that
>>> cost and buy extra nodes. (but make sure you have a remote hard power
>>> cycle capability). If a node dies and cannot be rebooted, turn it off, and
>>> fix it later. Of course monitoring fans and temperatures is a good thing
>>> (tm), but if node will not boot, and you have to play with the BIOS, then
>>> I would consider it broken.
>>> 
>>> Because you have "over capacity" in your cluster (you bought extra nodes)
>>> this does not impact the amount work that needs to get done. Indeed, prior
>>> to the failure you can have the extra nodes working for you. You fully
>>> understand that at various time one or two nodes will be off line. They
>>> are taken out of the scheduler and there is no need to fix them right
>>> away.
>>> 
>>> This approach also depends on what you are doing with your
>>> cluster and the cost of nodes etc. In some cases out-of-band access
>>> is a good thing. In other cases, the "STONIH-AFIT" (shoot the other node
>>> in the head and fix it tomorrow" approach is also reasonable.
>>> 
>>> 
>>> -- 
>>> Doug
>>> 
>>> check out http://www.clustermonkey.net
>>> 
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
>
> -- 
> Michael Will
> Penguin Computing Corp.
> Sales Engineer
> 415-954-2822
> 415-954-2899 fx
> mwill at penguincomputing.com



More information about the Beowulf mailing list