[Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caFri Oct 23 10:35:55 PDT 2009
- Previous message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> You could imagine jobs which checkpoint often, and automatically restart > themselves from > a checkpoint if a machine fails like this. I find that apps (custom or commercial) normally need some help to restart. (some need to be pointed at the checkpoint to start with, others need to be told it's a restart, rather than from-scratch, etc). and I expect that anything but a dedicated, single-person cluster will also be running a scheduler, which means that the app would, upon starting, need to queue a restart of itself as a dependency. we actually have one group that does this, but their main script already contains multiple iterations of gromacs (apparently to force re-load-balance.) their code contains a intelligence about catching crashed processes, finding where to pick up, etc. (this group also tends to be one that asks interesting questions about, for instance, when attributes propagate across NFS vs Lustre filesystems. I had never looked very closely, but various NFS clients have quite different behavior, including some oldish versions that will cache stale attrs *indefinitely*.) anyway, we strongly encourage checkpointing, and usually say that you should checkpoint as frequently as you can without inducing a significant IO overhead. our main clusters have Lustre filesystems that can sustain several GB/s, so I usually rule-of-thumb it as "a couple times a day, and more often with higher node-count. fortunately our node failure rate is fairly low, so we don't push very hard. it's easy to imagine a large-scale job needing to checkpoint ~hourly, though: if your spontaneous node failure rate is 1/node-year, then a 365-node job is 1/day, and that's not a very big job... > My philosophy though would be to leave a machine down till the cause of > the crash is established. absolutely. this is not an obvious principle to some people, though: it depends on whether your model of failures involves luck or causation ;) and having decent tools (IPMI SEL for finding UC ECCs/overheating/etc, console logging for panics) is what lets you rule out bad juju...
- Previous message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Next message: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
