[Beowulf] Kill zombies after a parallel run
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Toon Knapen toon.knapen at fft.beMon May 8 02:07:30 PDT 2006
- Previous message: [Beowulf] Kill zombies after a parallel run
- Next message: [Beowulf] 512 nodes Myrinet cluster Challanges
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
David Mathog wrote: > mg <mg.mailing-list at laposte.net> wrote: >> I use MPICH-1.2.5.2 to generate and run an FEM parallel application. >> >> During a parallel run, one process can crash, leaving the other >> processes run and OS commands have to be used for kill these zombies. >> So, does someone have a solution to avoid zombies after a failed >> parallel run: can the crashed process kill the other processes? > > I think what you're saying is one compute node dies and this causes the > master and processes on the other nodes to run forever, or at least > not exit even if they have stopped using CPU. Or are you really > asking about processes that show up in the unix "Zombie" state? > > Assuming the former, <snip> I think what the OP is asking is how to kill (automagicallY) all processes in a parallel run once one process crashed (due to segmentation failure or soth.) Generally if one process (in the whole bunch of processes) crashes, all other processes will wait eternally from the moment they try to communicate with the crashed process or at the MPI_Finalize. So how can one kill all remaining processes? toon
- Previous message: [Beowulf] Kill zombies after a parallel run
- Next message: [Beowulf] 512 nodes Myrinet cluster Challanges
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
