[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Rahul Nabar rpnabar at gmail.comThu Oct 22 17:56:16 PDT 2009
- Previous message: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it?
- Next message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I wanted to get some opinions about if watchdog timers are a good idea or not. I came across watchdogs again when reading through my IPMI manual. In principle it sounds neat: If the system hangs then get it to reboot after, say, 5 minutes automatically. But, in practice, maybe it is a terrible idea. Of course, one might say, a well configured HPC compute-node shouldn't be getting to a hung point anyways; but in-practice I see a few nodes every month that can be resurrected by a simple reboot. Admittedly these nodes are quite senile. The danger, seems to me: What if a node kept crashing (due to say, a bad HDD or something). Then a watchdog would merely keep rebooting this node a hundred times. Not such a good thing. Have you guys used watchdog timers? Maybe there is a way to build a circuit-breaker around the principle so that if a node reboots automatically more than 3 times then watchdog gives up? If one had to do the watchdogging should one do the resets locally using the IPMI local interface (hogs cpu cycles) or a central Nagios-like system that could issue such a command. Many scenarios seem possible. The prospect of a automated system doing a reboot at 3am seems more tempting than me having to do this manually. -- Rahul
- Previous message: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it?
- Next message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
