Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

ed in 92626 ed92626 at gmail.com
Fri Oct 23 09:01:28 PDT 2009


On Thu, Oct 22, 2009 at 5:56 PM, Rahul Nabar <rpnabar at gmail.com> wrote:

> I wanted to get some opinions about if watchdog timers are a good idea
> or not. I came across watchdogs again when reading through my IPMI
> manual. In principle it sounds neat: If the system hangs then get it
> to reboot after, say, 5 minutes automatically. But, in practice, maybe
> it is a terrible idea.
>


> Of course, one might say, a well configured HPC compute-node
> shouldn't be getting to a hung point anyways; but in-practice I see a
> few nodes every month that can be resurrected by a simple reboot.
> Admittedly these nodes are quite senile.
>
> Some BIOS's have a setting for this, times to reboot before quitting.


> The danger, seems to me: What if a node kept crashing (due to say,  a
> bad HDD or something). Then a watchdog would merely keep rebooting
> this node a hundred times. Not such a good thing.
>
> Have you guys used watchdog timers? Maybe there is a way to build a
> circuit-breaker around the principle so that if a node reboots
> automatically more than 3 times then watchdog gives up?
>

You could also do something at the system level to prevent it. If the system
boots and the previous_uptime is less that one hour shut down the system.
The WD timer will not wake it up.

>
> If one had to do the watchdogging should one do the resets locally
> using the IPMI local interface (hogs cpu cycles) or a central
> Nagios-like system that could issue such a command. Many scenarios
> seem possible. The prospect of a automated system doing a reboot at
> 3am seems more tempting than me having to do this manually.
>
> Also almost all systems that can do this also send out a page and an email
on the event, so someone will know about it.

Ed



>  --
> Rahul
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20091023/a4ca8da7/attachment.html


More information about the Beowulf mailing list