[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Wed Nov 3 09:44:39 PST 2004

Hi,

<snip>
> 
> I must say though that from what I know checkpointing/restarting
> serial codes is OK.
> Checkpointing parallel jobs is problematic, and from what I've read
> not recommended (the various processes are passing
> messages, and how do you checkpoint in a consistent state?).
> 

I would send a signal from SGE only to the head node of a let's say MPI 
job. This rank 0 job has to set some special fields and broadcast this 
to the slave processes. The slaves must check this from time to time and 
send their state to the head node (and shut down in a proper way), which 
is performing the storing of the information in any checkpointing place 
on a shared file system (maybe we get different nodes the next time). I 
think it's possible to program it (when it's included in the design of 
the program), but adding it later to an already existing program is not 
so easy. - Reuti