Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Kevin Abbey kabbey at biomaps.rutgers.edu
Thu Oct 22 19:03:58 PDT 2009


I tried this on a Supermicro board and a Sun box.  On both systems the 
system would reboot randomly so I tuned it off.  This is a serious 
problem of false positives.  In a cluster, you may need to notify the 
scheduler in someway when a node reboots.  Can someone elaborate on 
this?  Specifically for torque, PBS and Sun GE.


Regarding this:

Have you guys used watchdog timers? Maybe there is a way to build a
circuit-breaker around the principle so that if a node reboots
automatically more than 3 times then watchdog gives up?


It would far simpler to request the vendor to program thier firmware to 
log each reboot and set a limitation there as well as event 
notifications via email, snmp or other means.  The more configurable the 
better.

Kevin


Rahul Nabar wrote:
> I wanted to get some opinions about if watchdog timers are a good idea
> or not. I came across watchdogs again when reading through my IPMI
> manual. In principle it sounds neat: If the system hangs then get it
> to reboot after, say, 5 minutes automatically. But, in practice, maybe
> it is a terrible idea.
>
> Of course, one might say, a well configured HPC compute-node
> shouldn't be getting to a hung point anyways; but in-practice I see a
> few nodes every month that can be resurrected by a simple reboot.
> Admittedly these nodes are quite senile.
>
> The danger, seems to me: What if a node kept crashing (due to say,  a
> bad HDD or something). Then a watchdog would merely keep rebooting
> this node a hundred times. Not such a good thing.
>
> Have you guys used watchdog timers? Maybe there is a way to build a
> circuit-breaker around the principle so that if a node reboots
> automatically more than 3 times then watchdog gives up?
>
> If one had to do the watchdogging should one do the resets locally
> using the IPMI local interface (hogs cpu cycles) or a central
> Nagios-like system that could issue such a command. Many scenarios
> seem possible. The prospect of a automated system doing a reboot at
> 3am seems more tempting than me having to do this manually.
>
>   

-- 
Kevin C. Abbey
System Administrator
Rutgers University - BioMaPS Institute

Email: kabbey at biomaps.rutgers.edu


Hill Center - Room 259
110 Frelinghuysen Road
Piscataway, NJ  08854
Phone and Voice mail: 732-445-3288  

Wright-Rieman Laboratories Room 201
610 Taylor Rd.
Piscataway, NJ  08854-8087
Phone: 732-445-2069
Fax: 732-445-5958




More information about the Beowulf mailing list