Fault tolerance and MPI
tony at MPI-Softtech.Com
Mon Feb 5 06:49:12 PST 2001
You can see our initial paper on this subject at
It contains references to other known works in this area.
Anthony Skjellum, PhD, President (tony at mpi-softtech.com)
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."
On Mon, 5 Feb 2001 Carl_Notfors at vdgc.com.sg wrote:
> Our computational model is quite simple. We have a master node and a
> number of slave nodes. All communication is between the master and the
> slaves, ie. no internode communication, so all communication is done with
> MPI_Send and MPI_Recv (we are using LAM/MPI).
> The problem with MPI is that there is no fault tolerance, if a slave node
> "dies" the whole process goes down. According to the LAM documentation it
> should be possible to achieve some fault tolerance but we have as yet not
> tried this.
> Is there anyone who has got this working? Is there fault tolerance in any
> othe MPI implementations? Would it be better to use PVM if you want fault
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf