What happens with a failed node? (Scyld)

Wed Feb 6 21:56:30 PST 2002

On Wed, 06 Feb 2002, Tony Stocker wrote:

> 
> Hi All,
> 
> Quick question.  What happens if a compute node fails or loses it network 
> connectivity while processing something (non-parrallelized)?  How long does 
> it take the host node to realize something is wrong?  What does the host 
> node do then?  Does it send mail reporting which node went down and what was 
> running on it at the time?

The master node and continually pings all the slave nodes and the slave
nodes continually ping the master node.  If the master node doesn't get
a ping response in 30 seconds, it automaticlly sets the node to down (it
doesn't tell the node this, but changes its internal representation of
the nodes state.

If the slave node doesn't get a ping response in 30 seconds, the node
will reboot.  On boot it will then try to connect to the master again,
and if there are problems it will keep rebooting until it can connect.

There is no mail sent, just the cluster trying to auto-fix itself.
> 
> What about if the node was running a parrallelized program that is also 
> being run by other elements of the cluster?  What's the node-fault 
> procedures/setup in that case?

The status of your parallelized program depends on what you're using to
parallelize it.  The implementation of MPI that we ship (mpich) will end
up falling over if one of its nodes disappears under its feet, and as
far as I know, so will all other implementations.  It is for this reason
that we recommend users with long-running programs have their programs
regularly checkpoint, so that in the unlikely event that there is a
problem, minimal work will be lost.

Sean
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020207/36e8e24c/attachment.sig>