[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Andrew Wang andrewxwang at yahoo.com.tw
Wed Nov 3 15:38:18 PST 2004


Did you look at SGE+Berkeley Lab checkpoint? This is
the HOWTO:
http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005.pdf

And also LAM+berkeley lab checkpoint?
"The LAM/MPI Checkpoint/Restart Framework:
System-Initiated Checkpointing":
http://www.lam-mpi.org/papers/lacsi2003/

Andrew.



 --- Brian Dobbins <brian.dobbins at yale.edu> 的訊息:
> Hi guys,
> 
>   I have just begun looking into a checkpoint /
> restart capability for
> clusters, but looking into the archives here and
> doing a search has
> shown few viable solutions.  Some, like CKPOX (1),
> appear to be only
> written for the 2.4 series kernels, and I recall
> seeing one product that
> seemed to indicate it had full support for these
> operations, but it was
> a commercial product.
> 
>   What solutions have people on this list used for
> this functionality?
> Amy I restricted to going back to the 2.4 series? 
> (I'd prefer to run
> 2.6 on the AMD64 hardware I've got.)
> 
>   Additionally, though this is a much wider question
> (and one tackled
> before!), what are people's pros and cons of the
> various queuing
> systems?  I've played with OpenPBS before, and
> 'seen' SGE, but once
> again, I thought it'd be nice to hear what some of
> the heavy hitters on
> this list prefer.
> 
>   Background: The reason we're looking for a
> checkpoint/restart option
> has more to do with preempting a running job (of a
> lower priority) by
> checkpointing it than it does with saving the state
> in case of a crash.
> While functionally these may be pretty close or the
> same, if that gives
> rise to another solution, I'd like to hear it.  In
> essence, we have some
> Monte Carlo sims which are highly parallel, and
> could run 24-7 for many
> months, but we want to be able to submit a high
> priority CFD code that
> will take over, run for a few days or so, and then
> have the system
> automagically restart the MC sim.
> 
>   Any advice would be great!
> 
>   Thanks very much for your time,
>   - Brian
> 
> Brian Dobbins
> Yale Mechanical Engineering
> 
> -- 
> Brian Dobbins <brian.dobbins at yale.edu>
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>  

-----------------------------------------------------------------
Yahoo!奇摩Messenger6.0
更即時有趣的即時通訊世界,立即下載最新版!
http://tw.messenger.yahoo.com/



More information about the Beowulf mailing list