Two heads are better than one! :)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joseph Landman landman at scalableinformatics.comFri Nov 1 04:29:24 PST 2002
- Previous message: ethernet bonding problems
- Next message: Two heads are better than one! :)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, 2002-11-01 at 00:01, Donald Becker wrote: [...] > > It is more complex than that, in that you would need to preserve state > > changes over the length of the program, and PVM/MPI/et al do not > > preserve this state information. > > One rule of thumb: people that application-independent checkpointing is > possible haven't actually considered the implementation and > implications. In real life the most practical way to handle the issue is > - having the system handle checkpoint signal support > - making it easy to write, gather and restore checkpoint files, and > - providing examples of application-supported checkpointing Agreed. This is how the SGI checkpointing work, which was (IIRC) modeled on the Cray checkpointing. Not everything could be checkpointed though, and the system code walked through its checkoff list of items to see if the program was indeed checkpointable. It is not just program state that needs to be maintained, but dynamic systems (pipes, open files, sockets) that need to examined, torn down and rebuilt. Only for certain subsets of these can you successfully rebuild after a tear down, which is why the checkpointing only worked in some cases. > > > The folks at LANL had a fault tolerant MPI at one point, but I haven't > > heard much of it recently. > > I would like to see a paper on the real-life result. I'm guessing that > the overhead overwhelms any possible saving even with frequent node > failure. That's exactly the sort of result that makes for a useful > paper -- "You must have a much better idea than this, or it won't work." They had something published on the web site about 1 year ago. If I find it again, I'll post the URL. -- Joseph Landman, Ph.D Scalable Informatics LLC email: landman at scalableinformatics.com web: http://scalableinformatics.com phone: +1 734 612 4615
- Previous message: ethernet bonding problems
- Next message: Two heads are better than one! :)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
