[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Chris Samuel csamuel at vpac.orgWed Nov 3 17:53:42 PST 2004
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 4 Nov 2004 04:36 am, Reuti wrote: > For parallel jobs this will lead to timing problems (depending on the > parallel libs used - you have to adjust at least any timeout for missing > communication, which may arrise in the libs). - Reuti My understanding is that the latest version of LAM-MPI supports checkpointing of parallel jobs. Their page http://www.lam-mpi.org/about/overview/ says: ----8< quote 8<---- Checkpoint/Restart MPI applications running under LAM/MPI can be checkpointed to disk and restarted at a later time. LAM requires a 3rd party single-process checkpoint/restart toolkit for actually checkpointing and restarting a single MPI process - LAM takes care of the parallel coordination. Currently, the Berkeley Labs Checkpoint/Restart package (Linux only) is supported. The infrastructure allows for easy addition of new checkpoint/restart packages. ----8< quote 8<---- The Berkeley labs package they mention (http://ftg.lbl.gov/checkpoint) is a kernel module (not a kernel patch) for 2.4 series (though they have an open bug report about porting this to the 2.6 series in their Bugzilla at https://mantis.lbl.gov/bugzilla/show_bug.cgi?id=748) on IA32 (Opteron support is bug 749 and depends on the 2.6 support). Chris -- Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041104/6eddb05c/attachment.bin
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
