What happens with a failed node? (Scyld)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tony Stocker akostocker at hotmail.comThu Feb 7 08:37:58 PST 2002
- Previous message: Falt Tolerant MPICH
- Next message: What happens with a failed node? (Scyld)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Sean, Okay, if there's no mail sent and the slave node keeps rebooting itself (for instance if its network connection is down) or if the slave node never comes back up (it died). What happens to the process that was running on it? Does the host node reassign it to another slave node after some period of time? What becomes of this "lost" process? If there's no information provided to a user that their process was lost when the node went down, and the host node never reassigns it to be completed then it's conceivable that an entire string of processing could be brought to a halt because of this silent failure. Since the host node maintains the master process list, it should be aware that a process was running on a node that it now lists as down. What happens to the representation of this process in the table? Thanks for the help, -Tony >From: Sean Dilda <agrajag at scyld.com> >To: tonystocker at mail.com >CC: beowulf at beowulf.org >Subject: Re: What happens with a failed node? (Scyld) >Date: Thu, 7 Feb 2002 00:56:30 -0500 > >On Wed, 06 Feb 2002, Tony Stocker wrote: > > > > > Hi All, > > > > Quick question. What happens if a compute node fails or loses it >network > > connectivity while processing something (non-parrallelized)? How long >does > > it take the host node to realize something is wrong? What does the host > > node do then? Does it send mail reporting which node went down and what >was > > running on it at the time? > >The master node and continually pings all the slave nodes and the slave >nodes continually ping the master node. If the master node doesn't get >a ping response in 30 seconds, it automaticlly sets the node to down (it >doesn't tell the node this, but changes its internal representation of >the nodes state. > >If the slave node doesn't get a ping response in 30 seconds, the node >will reboot. On boot it will then try to connect to the master again, >and if there are problems it will keep rebooting until it can connect. > >There is no mail sent, just the cluster trying to auto-fix itself. > > > > What about if the node was running a parrallelized program that is also > > being run by other elements of the cluster? What's the node-fault > > procedures/setup in that case? > >The status of your parallelized program depends on what you're using to >parallelize it. The implementation of MPI that we ship (mpich) will end >up falling over if one of its nodes disappears under its feet, and as >far as I know, so will all other implementations. It is for this reason >that we recommend users with long-running programs have their programs >regularly checkpoint, so that in the unlikely event that there is a >problem, minimal work will be lost. > > >Sean ><< attach3 >> _________________________________________________________________ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx
- Previous message: Falt Tolerant MPICH
- Next message: What happens with a failed node? (Scyld)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
