Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Rahul Nabar rpnabar at gmail.com
Fri Oct 23 11:01:05 PDT 2009


On Fri, Oct 23, 2009 at 12:35 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
>
>> My philosophy though would be to leave a machine down till the cause of
>> the crash is established.
>
> absolutely.  this is not an obvious principle to some people, though:
> it depends on whether your model of failures involves luck or causation ;)
> and having decent tools (IPMI SEL for finding UC ECCs/overheating/etc,
> console logging for panics) is what lets you rule out bad juju...

Other factors that sometimes make me violate this principle of "always
establish a crash cause":

1. Manpower to debug. Let's say the error has a cause but is
relatively infrequent. I might achieve a higher uptime by a simple
reboot until I get the time to fight this particular fire. People feel
nicer to have a crashed node humming away as soon as possible rather
than waiting for me to get the time to have a look at it and come to a
definite diagnosis. Forensics takes time.

2. Some errors are hardware precipitated. Aging, out-of-warranty
aging, hardware can sometimes need such a reboot compromise for
one-off random errors.

Maybe all the "nice" clusters out there never have this issue but for
me it is fairly common. Just confessing.

-- 
Rahul




More information about the Beowulf mailing list