[Beowulf] LAM trouble

Jeffrey B. Layton laytonjb at charter.net
Tue Apr 11 15:14:26 PDT 2006


Howdy!

   I apologize for posting this problem here, but I tried the LAM
list and didn't hear anything, so I thought I would cast my net
a bit wider in search of help.
   I'm having trouble starting an MPI code (NPB bt) that was
built with PGI 6.1 and LAM-7.1.2. I get the following messages
when I try to start the code (lamboot):
 

n-1<24201> ssi:boot:base:linear: booting n0 (n2004)
n-1<24201> ssi:boot:base:linear: booting n1 (n2005)
n-1<24201> ssi:boot:base:linear: booting n2 (n2006)
n-1<24201> ssi:boot:base:linear: booting n3 (n2007)
n-1<24201> ssi:boot:base:linear: booting n4 (n2008)
n-1<24201> ssi:boot:base:linear: booting n5 (n2009)
n-1<24201> ssi:boot:base:linear: booting n6 (n2010)
n-1<24201> ssi:boot:base:linear: booting n7 (n2011)
n-1<24201> ssi:boot:base:linear: finished
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun chose a different RPI than its peers.  For example, at least
the following two processes mismatched in their RPI selections:

    MPI_COMM_WORLD rank 0: tcp (v7.1.0)
    MPI_COMM_WORLD rank 3: usysv (v7.1.0)

All MPI processes must choose the same RPI module and version when
they start.  Check your SSI settings and/or the local environment
variables on each node.



   I'm using PBS to start the job and here are the relevant parts
of the script:

NET=tcp
lamboot -b -v -ssh rpi $NET $PBS_NODEFILE
mpirun -O -v C ./${EXE} >>  ${OUTFILE}
lamhalt


where $EXE and $OUTFILE are defined appropriately in the
script.
   Does anyone have any ideas?

TIA!

Jeff



More information about the Beowulf mailing list