[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
gerry.creager at tamu.edu
Fri Oct 23 13:42:38 PDT 2009
Greg Lindahl wrote:
> On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
>> 2. Some errors are hardware precipitated. Aging, out-of-warranty
>> aging, hardware can sometimes need such a reboot compromise for
>> one-off random errors.
>> Maybe all the "nice" clusters out there never have this issue but for
>> me it is fairly common. Just confessing.
> Why, exactly, are you assuming that your freezes are one-off random
> errors due to aging hardware? Sounds like you're either guessing, or
> you _are_ doing forensics, but aren't calling it forensics.
*MY* aging hardware usually just falls over dead when it's done with its
useful life. Too many intermittent errors/failures causes me to do
sufficient diagnostics to repair the node (if it's cheap and easy
enough) or drop it in the latest surplus run.
AATLT, Texas A&M University Tel: 979.862.3982
1700 Research Pkwy, Ste 160 Fax: 979.862.3983
College Station, TX Cell 979.229.5301
More information about the Beowulf