[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

Thu Oct 22 19:03:58 PDT 2009

I tried this on a Supermicro board and a Sun box.  On both systems the 
system would reboot randomly so I tuned it off.  This is a serious 
problem of false positives.  In a cluster, you may need to notify the 
scheduler in someway when a node reboots.  Can someone elaborate on 
this?  Specifically for torque, PBS and Sun GE.

Regarding this:

Have you guys used watchdog timers? Maybe there is a way to build a
circuit-breaker around the principle so that if a node reboots
automatically more than 3 times then watchdog gives up?

It would far simpler to request the vendor to program thier firmware to 
log each reboot and set a limitation there as well as event 
notifications via email, snmp or other means.  The more configurable the 
better.

Kevin

Rahul Nabar wrote:
> I wanted to get some opinions about if watchdog timers are a good idea
> or not. I came across watchdogs again when reading through my IPMI
> manual. In principle it sounds neat: If the system hangs then get it
> to reboot after, say, 5 minutes automatically. But, in practice, maybe
> it is a terrible idea.
>
> Of course, one might say, a well configured HPC compute-node
> shouldn't be getting to a hung point anyways; but in-practice I see a
> few nodes every month that can be resurrected by a simple reboot.
> Admittedly these nodes are quite senile.
>
> The danger, seems to me: What if a node kept crashing (due to say,  a
> bad HDD or something). Then a watchdog would merely keep rebooting
> this node a hundred times. Not such a good thing.
>
> Have you guys used watchdog timers? Maybe there is a way to build a
> circuit-breaker around the principle so that if a node reboots
> automatically more than 3 times then watchdog gives up?
>
> If one had to do the watchdogging should one do the resets locally
> using the IPMI local interface (hogs cpu cycles) or a central
> Nagios-like system that could issue such a command. Many scenarios
> seem possible. The prospect of a automated system doing a reboot at
> 3am seems more tempting than me having to do this manually.
>
>   

-- 
Kevin C. Abbey
System Administrator
Rutgers University - BioMaPS Institute

Email: kabbey at biomaps.rutgers.edu

Hill Center - Room 259
110 Frelinghuysen Road
Piscataway, NJ  08854
Phone and Voice mail: 732-445-3288  

Wright-Rieman Laboratories Room 201
610 Taylor Rd.
Piscataway, NJ  08854-8087
Phone: 732-445-2069
Fax: 732-445-5958