[Beowulf] mpirun issue

Reuti reuti at staff.uni-marburg.de
Tue Oct 21 05:53:26 PDT 2008


Hi,

Am 21.10.2008 um 01:18 schrieb Luis Alejandro Del Castillo Riley:

> hi fellows i have a cluster with 1 master 10 nodes with intel Xeon  
> Quad core.
> Fedora core 6
> PGI 7.0-7
> mpich 1.2.5.2

the last version of MPICH from 2005 is 1.2.7p1. For newer  
installations I would suggest to look into Open MPI.

> machines.x86_64 with a 10 node names

Means only the 10 nodes?

> when i try to run:
>  mpirun -v -arch x86_64  -keep_pg -nolocal -np 9 mm5.mpp
>
> i had no error but when a run with
>  mpirun -v -arch x86_64  -keep_pg -nolocal -np 10 mm5.mpp
>
> they take around 40 min to send me and error :
> bm_list_4667: (1526.781250) wakeup_slave: unable to interrupt slave  
> 0 pid 4666

With so many time, I would suggest to login to all nodes and check with:

$ ps -e f

(f w/o -) the ditribution and startup of the porcesses. Is it doing  
nothing for 40 minutes or running fine until it crashes?

-- Reuti



More information about the Beowulf mailing list