[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Reuti reuti at staff.uni-marburg.deWed Nov 3 09:44:39 PST 2004
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, <snip> > > I must say though that from what I know checkpointing/restarting > serial codes is OK. > Checkpointing parallel jobs is problematic, and from what I've read > not recommended (the various processes are passing > messages, and how do you checkpoint in a consistent state?). > I would send a signal from SGE only to the head node of a let's say MPI job. This rank 0 job has to set some special fields and broadcast this to the slave processes. The slaves must check this from time to time and send their state to the head node (and shut down in a proper way), which is performing the storing of the information in any checkpointing place on a shared file system (maybe we get different nodes the next time). I think it's possible to program it (when it's included in the design of the program), but adding it later to an already existing program is not so easy. - Reuti
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
