[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Glen Gardner Glen.Gardner at verizon.net
Wed Nov 3 15:31:48 PST 2004


Yo uwill probably end up having to use b locking message passing to make 
the processes wait at each chekpoint. The end result is that you lose a 
significant amount if performance t owaiting for all the programs to get 
to an appropriate chekpoint and wait for some kind of validation.


Glen Gardner


Reuti wrote:

> Jeff Moyer wrote:
>
>> ==> Regarding [Beowulf] Checkpoint / Restart on 2.6 series kernels 
>> for clusters?; Brian Dobbins <brian.dobbins at yale.edu> adds:
>>
>> [snip]
>>
>> brian.dobbing> Background: The reason we're looking for a 
>> checkpoint/restart brian.dobbins> option has more to do with 
>> preempting a running job (of a lower
>> brian.dobbins> priority) by checkpointing it than it does with saving 
>> the
>> brian.dobbins> state in case of a crash. While functionally these may be
>> brian.dobbins> pretty close or the same, if that gives rise to another
>> brian.dobbins> solution, I'd like to hear it. In essence, we have some
>> brian.dobbins> Monte Carlo sims which are highly parallel, and could run
>> brian.dobbins> 24-7 for many months, but we want to be able to submit a
>> brian.dobbins> high priority CFD code that will take over, run for a few
>> brian.dobbins> days or so, and then have the system automagically 
>> restart
>> brian.dobbins> the MC sim.
>>
>> How about sending the process a SIGSTOP followed by a SIGCONT when 
>> you are
>> ready to resume execution? So long as your memory footprints of the two
>> apps won't exhaust physical ram + swap, this should be okay. This 
>> assumes
>> a great deal about the robustness of your long running job, though.
>>
>
> For parallel jobs this will lead to timing problems (depending on the 
> parallel libs used - you have to adjust at least any timeout for 
> missing communication, which may arrise in the libs). - Reuti
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Glen E. Gardner, Jr.
AA8C
AMSAT MEMBER 10593
Glen.Gardner at verizon.net


http://members.bellatlantic.net/~vze24qhw/index.html






More information about the Beowulf mailing list