[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Brian Dobbins brian.dobbins at yale.eduTue Nov 2 13:22:15 PST 2004
- Previous message: [Beowulf] pxe boot an arbitrary distro from a CD drive on the headnode?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi guys, I have just begun looking into a checkpoint / restart capability for clusters, but looking into the archives here and doing a search has shown few viable solutions. Some, like CKPOX (1), appear to be only written for the 2.4 series kernels, and I recall seeing one product that seemed to indicate it had full support for these operations, but it was a commercial product. What solutions have people on this list used for this functionality? Amy I restricted to going back to the 2.4 series? (I'd prefer to run 2.6 on the AMD64 hardware I've got.) Additionally, though this is a much wider question (and one tackled before!), what are people's pros and cons of the various queuing systems? I've played with OpenPBS before, and 'seen' SGE, but once again, I thought it'd be nice to hear what some of the heavy hitters on this list prefer. Background: The reason we're looking for a checkpoint/restart option has more to do with preempting a running job (of a lower priority) by checkpointing it than it does with saving the state in case of a crash. While functionally these may be pretty close or the same, if that gives rise to another solution, I'd like to hear it. In essence, we have some Monte Carlo sims which are highly parallel, and could run 24-7 for many months, but we want to be able to submit a high priority CFD code that will take over, run for a few days or so, and then have the system automagically restart the MC sim. Any advice would be great! Thanks very much for your time, - Brian Brian Dobbins Yale Mechanical Engineering -- Brian Dobbins <brian.dobbins at yale.edu>
- Previous message: [Beowulf] pxe boot an arbitrary distro from a CD drive on the headnode?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
