[Beowulf] LAM_MPI problem on PBS

Reuti reuti at staff.uni-marburg.de
Tue Aug 23 09:39:46 PDT 2005


Hi,

you don't need a hostfile on your own at all, as PBS will select the 
nodes for your job. So the question is still: did you compile LAM/MPI to 
honor the TM interface of PBS? Please also have a look here:

http://www.lam-mpi.org/faq/category12.php3#question3

Onur Destanog(lu wrote:
> Hi,
> 
> this is my PBS script;
> #PBS -N firstscp
> #PBS -l nodes=1:ppn=2
> #PBS -l mem=4mb
> #PBS -l walltime=1:00:00
> #PBS -V
> #PBS -m bea
> PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin
> export PATH
> lamboot -v
> mpirun -v C first
> lamhalt -v
> 
> my systems /home directory is nfs shared between all nodes, so there
> is onl one hosts file in user niyazi's home directory, this is the
> hosts file;
> 
> node00
> node01
> node02
> node03
> node04
> node05
> 
> node00 is not my execution node it only runs pbs_server and pbs_sched.
> 
> when i run the script i encounter some problems like these;
> 
> one error file;
> 
> n-1<2289> ssi:boot:base:linear: booting n0 (localhost)
> n-1<2289> ssi:boot:base:linear: finished
> 
> one output file:
> 
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
> 
> 2294 first running on n0 (o)
> Hello, I am 0 of the nodes : 1 

What are you printing with 0 and 1 here? Rank and total number number of 
ranks?

> 
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
> 
> Shutting down LAM
> hreq: received HALT_ACK from n0 (bee01.bee-hive)
> LAM halted
> 
> so what's is going wrong?

It doesn't look so wrong, but is executed outside of PBS control. The 
boot schema started just one daemon it seems and this one is used by 
using the option C.

-- Reuti




More information about the Beowulf mailing list