[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Kevin Abbey kabbey at biomaps.rutgers.eduThu Oct 22 19:03:58 PDT 2009
- Previous message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I tried this on a Supermicro board and a Sun box. On both systems the system would reboot randomly so I tuned it off. This is a serious problem of false positives. In a cluster, you may need to notify the scheduler in someway when a node reboots. Can someone elaborate on this? Specifically for torque, PBS and Sun GE. Regarding this: Have you guys used watchdog timers? Maybe there is a way to build a circuit-breaker around the principle so that if a node reboots automatically more than 3 times then watchdog gives up? It would far simpler to request the vendor to program thier firmware to log each reboot and set a limitation there as well as event notifications via email, snmp or other means. The more configurable the better. Kevin Rahul Nabar wrote: > I wanted to get some opinions about if watchdog timers are a good idea > or not. I came across watchdogs again when reading through my IPMI > manual. In principle it sounds neat: If the system hangs then get it > to reboot after, say, 5 minutes automatically. But, in practice, maybe > it is a terrible idea. > > Of course, one might say, a well configured HPC compute-node > shouldn't be getting to a hung point anyways; but in-practice I see a > few nodes every month that can be resurrected by a simple reboot. > Admittedly these nodes are quite senile. > > The danger, seems to me: What if a node kept crashing (due to say, a > bad HDD or something). Then a watchdog would merely keep rebooting > this node a hundred times. Not such a good thing. > > Have you guys used watchdog timers? Maybe there is a way to build a > circuit-breaker around the principle so that if a node reboots > automatically more than 3 times then watchdog gives up? > > If one had to do the watchdogging should one do the resets locally > using the IPMI local interface (hogs cpu cycles) or a central > Nagios-like system that could issue such a command. Many scenarios > seem possible. The prospect of a automated system doing a reboot at > 3am seems more tempting than me having to do this manually. > > -- Kevin C. Abbey System Administrator Rutgers University - BioMaPS Institute Email: kabbey at biomaps.rutgers.edu Hill Center - Room 259 110 Frelinghuysen Road Piscataway, NJ 08854 Phone and Voice mail: 732-445-3288 Wright-Rieman Laboratories Room 201 610 Taylor Rd. Piscataway, NJ 08854-8087 Phone: 732-445-2069 Fax: 732-445-5958
- Previous message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
