[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
jmoyer at redhat.com
Wed Nov 3 06:46:35 PST 2004
==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds:
brian.dobbing> Background: The reason we're looking for a checkpoint/restart
brian.dobbins> option has more to do with preempting a running job (of a lower
brian.dobbins> priority) by checkpointing it than it does with saving the
brian.dobbins> state in case of a crash. While functionally these may be
brian.dobbins> pretty close or the same, if that gives rise to another
brian.dobbins> solution, I'd like to hear it. In essence, we have some
brian.dobbins> Monte Carlo sims which are highly parallel, and could run
brian.dobbins> 24-7 for many months, but we want to be able to submit a
brian.dobbins> high priority CFD code that will take over, run for a few
brian.dobbins> days or so, and then have the system automagically restart
brian.dobbins> the MC sim.
How about sending the process a SIGSTOP followed by a SIGCONT when you are
ready to resume execution? So long as your memory footprints of the two
apps won't exhaust physical ram + swap, this should be okay. This assumes
a great deal about the robustness of your long running job, though.
More information about the Beowulf