[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Gerry Creager gerry.creager at tamu.eduFri Oct 23 13:42:38 PDT 2009
- Previous message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Greg Lindahl wrote:
> On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
>
>> 2. Some errors are hardware precipitated. Aging, out-of-warranty
>> aging, hardware can sometimes need such a reboot compromise for
>> one-off random errors.
>>
>> Maybe all the "nice" clusters out there never have this issue but for
>> me it is fairly common. Just confessing.
>
> Why, exactly, are you assuming that your freezes are one-off random
> errors due to aging hardware? Sounds like you're either guessing, or
> you _are_ doing forensics, but aren't calling it forensics.
*MY* aging hardware usually just falls over dead when it's done with its
useful life. Too many intermittent errors/failures causes me to do
sufficient diagnostics to repair the node (if it's cheap and easy
enough) or drop it in the latest surplus run.
--
Gerry Creager
AATLT, Texas A&M University Tel: 979.862.3982
1700 Research Pkwy, Ste 160 Fax: 979.862.3983
College Station, TX Cell 979.229.5301
77843-3139 http://mesonet.tamu.edu
- Previous message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
