[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Rahul Nabar rpnabar at gmail.comFri Oct 23 11:01:05 PDT 2009
- Previous message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Oct 23, 2009 at 12:35 PM, Mark Hahn <hahn at mcmaster.ca> wrote: > >> My philosophy though would be to leave a machine down till the cause of >> the crash is established. > > absolutely. this is not an obvious principle to some people, though: > it depends on whether your model of failures involves luck or causation ;) > and having decent tools (IPMI SEL for finding UC ECCs/overheating/etc, > console logging for panics) is what lets you rule out bad juju... Other factors that sometimes make me violate this principle of "always establish a crash cause": 1. Manpower to debug. Let's say the error has a cause but is relatively infrequent. I might achieve a higher uptime by a simple reboot until I get the time to fight this particular fire. People feel nicer to have a crashed node humming away as soon as possible rather than waiting for me to get the time to have a look at it and come to a definite diagnosis. Forensics takes time. 2. Some errors are hardware precipitated. Aging, out-of-warranty aging, hardware can sometimes need such a reboot compromise for one-off random errors. Maybe all the "nice" clusters out there never have this issue but for me it is fairly common. Just confessing. -- Rahul
- Previous message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
