[Beowulf] Kill zombies after a parallel run
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at caltech.eduTue May 2 14:23:31 PDT 2006
- Previous message: [Beowulf] 512 nodes Myrinet cluster Challanges
- Next message: [Beowulf] Kill zombies after a parallel run
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
mg <mg.mailing-list at laposte.net> wrote: > I use MPICH-1.2.5.2 to generate and run an FEM parallel application. > > During a parallel run, one process can crash, leaving the other > processes run and OS commands have to be used for kill these zombies. > So, does someone have a solution to avoid zombies after a failed > parallel run: can the crashed process kill the other processes? I think what you're saying is one compute node dies and this causes the master and processes on the other nodes to run forever, or at least not exit even if they have stopped using CPU. Or are you really asking about processes that show up in the unix "Zombie" state? Assuming the former, I've only ever dealt with this on PVM but I'm assuming MPICH has similar functions. In general you want the slaves to send an "I'm DONE" message before they exit, at which point the parallel system generally delivers a "Slave Exited" message. If the master sees the EXITED message without having first seen the DONE message the master node can then send all the other nodes a "Time to Die" message so that they can clean up as best they can and then exit. For this to work, of course, the slaves have to check for this message once in a while. The master could also send a direct "kill" message to those processes, causing them to go away, but probably leaving scratch files and other detritus around. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] 512 nodes Myrinet cluster Challanges
- Next message: [Beowulf] Kill zombies after a parallel run
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
