[Beowulf] Compute Node OS on Local Disk vs. Ram Disk
Bogdan.Costescu at iwr.uni-heidelberg.de
Tue Sep 30 12:04:51 PDT 2008
On Tue, 30 Sep 2008, Jon Forrest wrote:
> The trouble with rebooting nodes is that this takes human energy.
When using a queueing system, rebooting nodes can be automated easily:
- the node to be rebooted is switched to "offline" state so that the
scheduler doesn't attempt to start new jobs on it
- wait until the currently running job finishes
- put the node back "online" so that the scheduler can again start
jobs on it
All the steps except the reboot itself are interactions with the
queueing system and can happen on the frontend/master node only. The
reboot step requires some interaction with the node, either remote
shell access to run /sbin/reboot or some other way to restart it
(IPMI, remote power management, etc.)
> It's easier to keep nodes up as long possible
With the increasing number of nodes in clusters these days, the
overall failure rate also increases. It's much easier to deal with
failures when they are not seen as a catastrophe, "twist my fingers
and hope that the node is coming up properly and everything still
works" kind, but rather as nodes simply going up and down.
> This is a good idea. Can you write more about this?
The e-mail from Brian Oborn has described in a few words the
principle, probably better than I could have done it myself. If you
want more details, ask more precise questions and I guess that any of
us could answer.
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de
More information about the Beowulf