[Beowulf] Remote console management

Sat Sep 24 20:00:02 PDT 2005

Doug,

Good to "see you" in this discussion -- I think this thread would be the 
basis for a nice article.

Spending the $$$ to buy some extra nodes won't work in our case.  We don't 
just use the cluster for computing, we also use it for data storage. 
Each of the 400+ nodes will have four 250GB disks and a hardware RAID 
controller (3ware 9500 or Areca 1110).  If a node is acting odd, we'd like 
to be able to diagnose/fix/reboot/restore it quickly if possible.  To 
replicate the data from a distant tape-backed repository will take many 
hours. So having some 'extra' machines doesn't help us so much, since we 
wouldn't know what data to keep on them, and moving the data onto them 
when needed would normally take much longer than bringing back to life the 
node that's gone down.

Cheers,
 	Bruce

On Sat, 24 Sep 2005, Douglas Eadline wrote:

>
>> We're getting ready to put together our next large Linux compute cluster.
>> This time around, we'd like to be able to interact with the machines
>> remotely.  By this I mean that if a machine is locked up, we'd like to be
>> able to see what's on the console, power cycle it, mess with BIOS
>> settings, and so on, WITHOUT having to drive to work, go into the cluster
>> room, etc.
>>
> This brings up an interesting point and I realize this does come down to
> a design philosophy, but cluster economics sometimes create non standard
> solutions. So here is another way to look at "out of band monitoring".
> Instead of adding  layers of monitoring and control, why not take that
> cost and buy extra nodes. (but make sure you have a remote hard power
> cycle capability). If a node dies and cannot be rebooted, turn it off, and
> fix it later. Of course monitoring fans and temperatures is a good thing
> (tm), but if node will not boot, and you have to play with the BIOS, then
> I would consider it broken.
>
> Because you have "over capacity" in your cluster (you bought extra nodes)
> this does not impact the amount work that needs to get done. Indeed, prior
> to the failure you can have the extra nodes working for you. You fully
> understand that at various time one or two nodes will be off line. They
> are taken out of the scheduler and there is no need to fix them right
> away.
>
> This approach also depends on what you are doing with your
> cluster and the cost of nodes etc. In some cases out-of-band access
> is a good thing. In other cases, the "STONIH-AFIT" (shoot the other node
> in the head and fix it tomorrow" approach is also reasonable.
>
>
> -- 
> Doug
>
> check out http://www.clustermonkey.net
>