Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Rahul Nabar rpnabar at gmail.com
Thu Oct 22 17:56:16 PDT 2009


I wanted to get some opinions about if watchdog timers are a good idea
or not. I came across watchdogs again when reading through my IPMI
manual. In principle it sounds neat: If the system hangs then get it
to reboot after, say, 5 minutes automatically. But, in practice, maybe
it is a terrible idea.

Of course, one might say, a well configured HPC compute-node
shouldn't be getting to a hung point anyways; but in-practice I see a
few nodes every month that can be resurrected by a simple reboot.
Admittedly these nodes are quite senile.

The danger, seems to me: What if a node kept crashing (due to say,  a
bad HDD or something). Then a watchdog would merely keep rebooting
this node a hundred times. Not such a good thing.

Have you guys used watchdog timers? Maybe there is a way to build a
circuit-breaker around the principle so that if a node reboots
automatically more than 3 times then watchdog gives up?

If one had to do the watchdogging should one do the resets locally
using the IPMI local interface (hogs cpu cycles) or a central
Nagios-like system that could issue such a command. Many scenarios
seem possible. The prospect of a automated system doing a reboot at
3am seems more tempting than me having to do this manually.

-- 
Rahul



More information about the Beowulf mailing list