Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Reuti reuti at staff.uni-marburg.de
Wed Nov 3 09:36:17 PST 2004


Jeff Moyer wrote:
> ==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds:
> 
> [snip]
> 
> brian.dobbing> Background: The reason we're looking for a checkpoint/restart 
> brian.dobbins> option has more to do with preempting a running job (of a lower
> brian.dobbins> priority) by checkpointing it than it does with saving the
> brian.dobbins> state in case of a crash.  While functionally these may be
> brian.dobbins> pretty close or the same, if that gives rise to another
> brian.dobbins> solution, I'd like to hear it.  In essence, we have some
> brian.dobbins> Monte Carlo sims which are highly parallel, and could run
> brian.dobbins> 24-7 for many months, but we want to be able to submit a
> brian.dobbins> high priority CFD code that will take over, run for a few
> brian.dobbins> days or so, and then have the system automagically restart
> brian.dobbins> the MC sim.
> 
> How about sending the process a SIGSTOP followed by a SIGCONT when you are
> ready to resume execution?  So long as your memory footprints of the two
> apps won't exhaust physical ram + swap, this should be okay.  This assumes
> a great deal about the robustness of your long running job, though.
> 

For parallel jobs this will lead to timing problems (depending on the 
parallel libs used - you have to adjust at least any timeout for missing 
communication, which may arrise in the libs). - Reuti




More information about the Beowulf mailing list