Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] LAM trouble

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Jeffrey B. Layton laytonjb at charter.net
Tue Apr 11 15:14:26 PDT 2006


Howdy!

   I apologize for posting this problem here, but I tried the LAM
list and didn't hear anything, so I thought I would cast my net
a bit wider in search of help.
   I'm having trouble starting an MPI code (NPB bt) that was
built with PGI 6.1 and LAM-7.1.2. I get the following messages
when I try to start the code (lamboot):
 

n-1<24201> ssi:boot:base:linear: booting n0 (n2004)
n-1<24201> ssi:boot:base:linear: booting n1 (n2005)
n-1<24201> ssi:boot:base:linear: booting n2 (n2006)
n-1<24201> ssi:boot:base:linear: booting n3 (n2007)
n-1<24201> ssi:boot:base:linear: booting n4 (n2008)
n-1<24201> ssi:boot:base:linear: booting n5 (n2009)
n-1<24201> ssi:boot:base:linear: booting n6 (n2010)
n-1<24201> ssi:boot:base:linear: booting n7 (n2011)
n-1<24201> ssi:boot:base:linear: finished
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun chose a different RPI than its peers.  For example, at least
the following two processes mismatched in their RPI selections:

    MPI_COMM_WORLD rank 0: tcp (v7.1.0)
    MPI_COMM_WORLD rank 3: usysv (v7.1.0)

All MPI processes must choose the same RPI module and version when
they start.  Check your SSI settings and/or the local environment
variables on each node.



   I'm using PBS to start the job and here are the relevant parts
of the script:

NET=tcp
lamboot -b -v -ssh rpi $NET $PBS_NODEFILE
mpirun -O -v C ./${EXE} >>  ${OUTFILE}
lamhalt


where $EXE and $OUTFILE are defined appropriately in the
script.
   Does anyone have any ideas?

TIA!

Jeff



More information about the Beowulf mailing list