[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

Rahul Nabar rpnabar at gmail.com
Fri Oct 23 12:44:27 PDT 2009


On Fri, Oct 23, 2009 at 1:23 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:
>
>> 2. Some errors are hardware precipitated. Aging, out-of-warranty
>> aging, hardware can sometimes need such a reboot compromise for
>> one-off random errors.
>>
>> Maybe all the "nice" clusters out there never have this issue but for
>> me it is fairly common. Just confessing.
>
> Why, exactly, are you assuming that your freezes are one-off random
> errors due to aging hardware? Sounds like you're either guessing, or
> you _are_ doing forensics, but aren't calling it forensics.

Greg. You are right. My bad. In hindsight, that doesn't make much sense. Sorry.

-- 
Rahul



More information about the Beowulf mailing list