[Beowulf] Kill zombies after a parallel run

David Kewley kewley at gps.caltech.edu
Tue May 2 13:02:46 PDT 2006


I don't have a solution for your case, but here's an idea: MPICH-GM (MPICH 
for the Myrinet GM protocol) has an option to mpirun.ch_gm that would do 
what you want, if you were running Myrinet/GM:

     --gm-kill <n>   Kill all processes <n> seconds after the first exits.

Other than that, a resource manager may do what you want -- our resource 
manager, LSF, does this for us.  It even mostly works. :)

David

On Tuesday 02 May 2006 00:49, mg wrote:
> Hi all,
>
> I use MPICH-1.2.5.2 to generate and run an FEM parallel application.
>
> During a parallel run, one process can crash, leaving the other
> processes run and OS commands have to be used for kill these zombies.
> So, does someone have a solution to avoid zombies after a failed
> parallel run: can the crashed process kill the other processes?
>
> Thanks,
> Mathieu
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list