updating the Linux kernel
David Lombard
david.lombard at mscsoftware.com
Mon Jun 12 08:11:08 PDT 2000
Crutcher Dunnavant wrote:
>
> Now, I might completly miss something here, but shouldn't all *distibuted*
> parallel programs assume that a node may not return. After all, what do you
> assume about hardware failures? ...
Um, no.
It all depends upon the software. PVM does provide the ability to
recover from a node failure, while an MPI program will just tank.
> ... So, while it may not be a *good* way to do it,
> In a properlly paralized application, shouldn't you be able to take down any
> random node other than the job allocation node, AT ANY TIME, and have that job
> reallocated.> reallocated (Yeah, you lose the local work, but those tasks should be
> checkpointed frequently)...
As for checkpointing, that too is an "it depends" answer.
Application-level checkpointing may be available to varying degrees --
it can be a non-trivial task. System-level checkpointing generally
can't handle sockets, and that rules out both PVM and MPI.
--
David N. Lombard
MSC.Software
More information about the Beowulf
mailing list