[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
andrewxwang at yahoo.com.tw
Wed Nov 3 15:38:18 PST 2004
Did you look at SGE+Berkeley Lab checkpoint? This is
And also LAM+berkeley lab checkpoint?
"The LAM/MPI Checkpoint/Restart Framework:
--- Brian Dobbins <brian.dobbins at yale.edu> ªº°T®§¡G
> Hi guys,
> I have just begun looking into a checkpoint /
> restart capability for
> clusters, but looking into the archives here and
> doing a search has
> shown few viable solutions. Some, like CKPOX (1),
> appear to be only
> written for the 2.4 series kernels, and I recall
> seeing one product that
> seemed to indicate it had full support for these
> operations, but it was
> a commercial product.
> What solutions have people on this list used for
> this functionality?
> Amy I restricted to going back to the 2.4 series?
> (I'd prefer to run
> 2.6 on the AMD64 hardware I've got.)
> Additionally, though this is a much wider question
> (and one tackled
> before!), what are people's pros and cons of the
> various queuing
> systems? I've played with OpenPBS before, and
> 'seen' SGE, but once
> again, I thought it'd be nice to hear what some of
> the heavy hitters on
> this list prefer.
> Background: The reason we're looking for a
> checkpoint/restart option
> has more to do with preempting a running job (of a
> lower priority) by
> checkpointing it than it does with saving the state
> in case of a crash.
> While functionally these may be pretty close or the
> same, if that gives
> rise to another solution, I'd like to hear it. In
> essence, we have some
> Monte Carlo sims which are highly parallel, and
> could run 24-7 for many
> months, but we want to be able to submit a high
> priority CFD code that
> will take over, run for a few days or so, and then
> have the system
> automagically restart the MC sim.
> Any advice would be great!
> Thanks very much for your time,
> - Brian
> Brian Dobbins
> Yale Mechanical Engineering
> Brian Dobbins <brian.dobbins at yale.edu>
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
More information about the Beowulf