[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Brian Dobbins brian.dobbins at yale.edu
Tue Nov 2 13:22:15 PST 2004

Hi guys,

  I have just begun looking into a checkpoint / restart capability for
clusters, but looking into the archives here and doing a search has
shown few viable solutions.  Some, like CKPOX (1), appear to be only
written for the 2.4 series kernels, and I recall seeing one product that
seemed to indicate it had full support for these operations, but it was
a commercial product.

  What solutions have people on this list used for this functionality?
Amy I restricted to going back to the 2.4 series?  (I'd prefer to run
2.6 on the AMD64 hardware I've got.)

  Additionally, though this is a much wider question (and one tackled
before!), what are people's pros and cons of the various queuing
systems?  I've played with OpenPBS before, and 'seen' SGE, but once
again, I thought it'd be nice to hear what some of the heavy hitters on
this list prefer.

  Background: The reason we're looking for a checkpoint/restart option
has more to do with preempting a running job (of a lower priority) by
checkpointing it than it does with saving the state in case of a crash.
While functionally these may be pretty close or the same, if that gives
rise to another solution, I'd like to hear it.  In essence, we have some
Monte Carlo sims which are highly parallel, and could run 24-7 for many
months, but we want to be able to submit a high priority CFD code that
will take over, run for a few days or so, and then have the system
automagically restart the MC sim.

  Any advice would be great!

  Thanks very much for your time,
  - Brian

Brian Dobbins
Yale Mechanical Engineering

