[Beowulf] mpirun issue
reuti at staff.uni-marburg.de
Tue Oct 21 05:53:26 PDT 2008
Am 21.10.2008 um 01:18 schrieb Luis Alejandro Del Castillo Riley:
> hi fellows i have a cluster with 1 master 10 nodes with intel Xeon
> Quad core.
> Fedora core 6
> PGI 7.0-7
> mpich 126.96.36.199
the last version of MPICH from 2005 is 1.2.7p1. For newer
installations I would suggest to look into Open MPI.
> machines.x86_64 with a 10 node names
Means only the 10 nodes?
> when i try to run:
> mpirun -v -arch x86_64 -keep_pg -nolocal -np 9 mm5.mpp
> i had no error but when a run with
> mpirun -v -arch x86_64 -keep_pg -nolocal -np 10 mm5.mpp
> they take around 40 min to send me and error :
> bm_list_4667: (1526.781250) wakeup_slave: unable to interrupt slave
> 0 pid 4666
With so many time, I would suggest to login to all nodes and check with:
$ ps -e f
(f w/o -) the ditribution and startup of the porcesses. Is it doing
nothing for 40 minutes or running fine until it crashes?
More information about the Beowulf