[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
akshar bhosale akshar.bhosale at gmail.comThu Oct 22 18:50:25 PDT 2009
- Previous message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
hi rahul, same thing happens at our side.node gets reboot due to asr and it doesnt crash.can u suggest any remedy? On Fri, Oct 23, 2009 at 6:26 AM, Rahul Nabar <rpnabar at gmail.com> wrote: > I wanted to get some opinions about if watchdog timers are a good idea > or not. I came across watchdogs again when reading through my IPMI > manual. In principle it sounds neat: If the system hangs then get it > to reboot after, say, 5 minutes automatically. But, in practice, maybe > it is a terrible idea. > > Of course, one might say, a well configured HPC compute-node > shouldn't be getting to a hung point anyways; but in-practice I see a > few nodes every month that can be resurrected by a simple reboot. > Admittedly these nodes are quite senile. > > The danger, seems to me: What if a node kept crashing (due to say, a > bad HDD or something). Then a watchdog would merely keep rebooting > this node a hundred times. Not such a good thing. > > Have you guys used watchdog timers? Maybe there is a way to build a > circuit-breaker around the principle so that if a node reboots > automatically more than 3 times then watchdog gives up? > > If one had to do the watchdogging should one do the resets locally > using the IPMI local interface (hogs cpu cycles) or a central > Nagios-like system that could issue such a command. Many scenarios > seem possible. The prospect of a automated system doing a reboot at > 3am seems more tempting than me having to do this manually. > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091023/1ca9fba2/attachment.html
- Previous message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
