MPI dies
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tony Skjellum tony at MPI-Softtech.ComFri Sep 15 17:11:32 PDT 2000
- Previous message: MPI dies
- Next message: [Fwd: Topcluster listing]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Victor, while the MPI standard doesn't support this model, at least MPI/Pro and MPICH tend to work in master slave mode when a slave node dies. A potential model for this is to build configuration like this Split MPI_COMM_WORLD into sub-communicators array with the master as node zero, and the Kth slave as node 1 of the Kth communicator. Use these communicators for doing point-to-point messaging between host/slaves. DO NOT DO COLLECTIVE COMMUNICATION between slaves. FYI, you can do the same just using MPI_COMM_WORLD, but this seems safer. Given these caveats, one believes that the application will survive the death of one of the slaves. However, the state of MPI_Finalize() is ambiguous. MPI 1.x clarified the Finalize as a barrier, so a hang should/could occur depending on the way this barrier is implemented there, but you should be able to run. Note, all of this describes behavior of two specific implementations, but these implementations have bent to demands of users (maybe it started this way by accident, but people like the behavior). We don't know what LAM does, and if it can work that way. We have customers who operate under circumstances analogous to those described above, and continue to compute for days/weeks even though slaves die. If the master dies, of course it is all over. Tony Anthony Skjellum, PhD, President (tony at mpi-softtech.com) MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759 +1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com "Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters." On Fri, 15 Sep 2000, Victor Karyo wrote: > Is there a technique to handle node failure? Shortly, I'll be working on an > that algorithm is naturally parallel and divided into course-grain "blocks". > I want to use a master/worker scheme. The master is to be set to reissue > blocks if the block doesn't return from the worker fast enough on the > assumption the node has failed. I know I can't rejoin a node after it > fails, but if the node fails will the whole app die? > > Also, is there a way to detect the number of nodes other than at > initialization, so I can tell if a node has died? > > (I plan on using MPI-Pro on a RH6.2 8-way single-proc Intel cluster with > 100mbps switched ethernet.) > > Thanks > Victor Karyo. > > > > > There are some efforts to build fault tolerating MPI's, but standard > MPI-1.x is supposed to kill the parallel application if a node dies, > or else the underlying system must transparently solve the fault. > > > Anthony Skjellum, PhD, President (tony at mpi-softtech.com) > MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS > 39759 > +1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com > "Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters." > > On Thu, 14 Sep 2000, Horatio B. Bogbindero wrote: > > > > > what happens if a node in MPI dies? is the entire computation lost? > > > > > > --------------------- > > william.s.yu at ieee.org > > > > I bought some used paint. It was in the shape of a house. > > -- Steven Wright > > > > > > > > _______________________________________________ > > Beowulf mailing list > > Beowulf at beowulf.org > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > _______________________________________________ > Beowulf mailing list > Beowulf at beowulf.org > http://www.beowulf.org/mailman/listinfo/beowulf > > _________________________________________________________________________ > Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com. > > Share information about yourself, create your own public profile at > http://profiles.msn.com. > > > _______________________________________________ > Beowulf mailing list > Beowulf at beowulf.org > http://www.beowulf.org/mailman/listinfo/beowulf >
- Previous message: MPI dies
- Next message: [Fwd: Topcluster listing]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
