[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
rpnabar at gmail.com
Fri Oct 23 11:01:05 PDT 2009
On Fri, Oct 23, 2009 at 12:35 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
>> My philosophy though would be to leave a machine down till the cause of
>> the crash is established.
> absolutely. this is not an obvious principle to some people, though:
> it depends on whether your model of failures involves luck or causation ;)
> and having decent tools (IPMI SEL for finding UC ECCs/overheating/etc,
> console logging for panics) is what lets you rule out bad juju...
Other factors that sometimes make me violate this principle of "always
establish a crash cause":
1. Manpower to debug. Let's say the error has a cause but is
relatively infrequent. I might achieve a higher uptime by a simple
reboot until I get the time to fight this particular fire. People feel
nicer to have a crashed node humming away as soon as possible rather
than waiting for me to get the time to have a look at it and come to a
definite diagnosis. Forensics takes time.
2. Some errors are hardware precipitated. Aging, out-of-warranty
aging, hardware can sometimes need such a reboot compromise for
one-off random errors.
Maybe all the "nice" clusters out there never have this issue but for
me it is fairly common. Just confessing.
More information about the Beowulf