[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

Fri Oct 23 01:57:16 PDT 2009

Of course, one might say, a well configured HPC compute-node
shouldn't be getting to a hung point anyways; but in-practice I see a
few nodes every month that can be resurrected by a simple reboot.
Admittedly these nodes are quite senile.

I think that this is an interesting concept - and don't want to dismiss
it.
You could imagine jobs which checkpoint often, and automatically restart
themselves from
a checkpoint if a machine fails like this.

My philosophy though would be to leave a machine down till the cause of
the crash is established.
Now that you have IPMI and serial consoles you should be looking at
the IPMI logs and your /var/log/mcelog  to see if there are
uncorrectable ECC errors,
and enabling crash dumps and the Magic Sysrq keys.

Any cluster should be designed with a few extra nodes, which will
normally be idle
but will be used when one or two nodes are off on the Pat and Mick.
However, this doesn't help
when a large parallel run is brought down when a single node fails -
advice here is checkpoint
the jobs often.

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.