Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Andrew Wang andrewxwang at yahoo.com.tw
Wed Nov 3 15:38:18 PST 2004


Did you look at SGE+Berkeley Lab checkpoint? This is
the HOWTO:
http://gridengine.sunsource.net/project/gridengine/howto/APSTC-TB-2004-005.pdf

And also LAM+berkeley lab checkpoint?
"The LAM/MPI Checkpoint/Restart Framework:
System-Initiated Checkpointing":
http://www.lam-mpi.org/papers/lacsi2003/

Andrew.



 --- Brian Dobbins <brian.dobbins at yale.edu> ªº°T®§¡G
> Hi guys,
> 
>   I have just begun looking into a checkpoint /
> restart capability for
> clusters, but looking into the archives here and
> doing a search has
> shown few viable solutions.  Some, like CKPOX (1),
> appear to be only
> written for the 2.4 series kernels, and I recall
> seeing one product that
> seemed to indicate it had full support for these
> operations, but it was
> a commercial product.
> 
>   What solutions have people on this list used for
> this functionality?
> Amy I restricted to going back to the 2.4 series? 
> (I'd prefer to run
> 2.6 on the AMD64 hardware I've got.)
> 
>   Additionally, though this is a much wider question
> (and one tackled
> before!), what are people's pros and cons of the
> various queuing
> systems?  I've played with OpenPBS before, and
> 'seen' SGE, but once
> again, I thought it'd be nice to hear what some of
> the heavy hitters on
> this list prefer.
> 
>   Background: The reason we're looking for a
> checkpoint/restart option
> has more to do with preempting a running job (of a
> lower priority) by
> checkpointing it than it does with saving the state
> in case of a crash.
> While functionally these may be pretty close or the
> same, if that gives
> rise to another solution, I'd like to hear it.  In
> essence, we have some
> Monte Carlo sims which are highly parallel, and
> could run 24-7 for many
> months, but we want to be able to submit a high
> priority CFD code that
> will take over, run for a few days or so, and then
> have the system
> automagically restart the MC sim.
> 
>   Any advice would be great!
> 
>   Thanks very much for your time,
>   - Brian
> 
> Brian Dobbins
> Yale Mechanical Engineering
> 
> -- 
> Brian Dobbins <brian.dobbins at yale.edu>
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>  

-----------------------------------------------------------------
Yahoo!©_¼¯Messenger6.0
§ó§Y®É¦³½ìªº§Y®É³q°T¥@¬É¡A¥ß§Y¤U¸ü³Ì·sª©¡I
http://tw.messenger.yahoo.com/



More information about the Beowulf mailing list