[Beowulf] Compute Node OS on Local Disk vs. Ram Disk

Tue Sep 30 12:04:51 PDT 2008

On Tue, 30 Sep 2008, Jon Forrest wrote:

> The trouble with rebooting nodes is that this takes human energy.

When using a queueing system, rebooting nodes can be automated easily: 
- the node to be rebooted is switched to "offline" state so that the 
scheduler doesn't attempt to start new jobs on it
- wait until the currently running job finishes
- reboot
- put the node back "online" so that the scheduler can again start 
jobs on it

All the steps except the reboot itself are interactions with the 
queueing system and can happen on the frontend/master node only. The 
reboot step requires some interaction with the node, either remote 
shell access to run /sbin/reboot or some other way to restart it 
(IPMI, remote power management, etc.)

> It's easier to keep nodes up as long possible

With the increasing number of nodes in clusters these days, the 
overall failure rate also increases. It's much easier to deal with 
failures when they are not seen as a catastrophe, "twist my fingers 
and hope that the node is coming up properly and everything still 
works" kind, but rather as nodes simply going up and down.

> This is a good idea. Can you write more about this?

The e-mail from Brian Oborn has described in a few words the 
principle, probably better than I could have done it myself. If you 
want more details, ask more precise questions and I guess that any of 
us could answer.

--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de