[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
reuti at staff.uni-marburg.de
Wed Nov 3 09:44:39 PST 2004
> I must say though that from what I know checkpointing/restarting
> serial codes is OK.
> Checkpointing parallel jobs is problematic, and from what I've read
> not recommended (the various processes are passing
> messages, and how do you checkpoint in a consistent state?).
I would send a signal from SGE only to the head node of a let's say MPI
job. This rank 0 job has to set some special fields and broadcast this
to the slave processes. The slaves must check this from time to time and
send their state to the head node (and shut down in a proper way), which
is performing the storing of the information in any checkpointing place
on a shared file system (maybe we get different nodes the next time). I
think it's possible to program it (when it's included in the design of
the program), but adding it later to an already existing program is not
so easy. - Reuti
More information about the Beowulf