Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Chris Samuel csamuel at vpac.org
Wed Nov 3 17:53:42 PST 2004


On Thu, 4 Nov 2004 04:36 am, Reuti wrote:

> For parallel jobs this will lead to timing problems (depending on the
> parallel libs used - you have to adjust at least any timeout for missing
> communication, which may arrise in the libs). - Reuti

My understanding is that the latest version of LAM-MPI supports checkpointing 
of parallel jobs.

Their page http://www.lam-mpi.org/about/overview/  says:

 ----8< quote 8<----

Checkpoint/Restart
 MPI applications running under LAM/MPI can be checkpointed to disk and 
restarted at a later time. LAM requires a 3rd party single-process 
checkpoint/restart toolkit for actually checkpointing and restarting a single 
MPI process - LAM takes care of the parallel coordination. Currently, the 
Berkeley Labs Checkpoint/Restart package (Linux only) is supported. The 
infrastructure allows for easy addition of new checkpoint/restart packages. 

 ----8< quote 8<----

The Berkeley labs package they mention (http://ftg.lbl.gov/checkpoint) is a 
kernel module (not a kernel patch) for 2.4 series (though they have an open 
bug report about porting this to the 2.6 series in their Bugzilla at 
https://mantis.lbl.gov/bugzilla/show_bug.cgi?id=748) on IA32 (Opteron support 
is bug 749 and depends on the 2.6 support).

Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.scyld.com/pipermail/beowulf/attachments/20041104/6eddb05c/attachment.bin


More information about the Beowulf mailing list