mpich question

Jeffery A. White j.a.white at larc.nasa.gov
Thu Sep 13 10:46:56 PDT 2001


Dear group,

  I am trying to figure out how to use the -p4pg option in mpirun and I
am experiencing some difficulties.

  My cluster configuration is as follows:

node0 :
machine : Dual processor Supermicro Super 370DLE
cpu     : 1 GHz Pentium 3 
O.S.    : Redhat Linux 7.1
kernel  : 2.4.2-2smp
mpich   : 1.2.1

nodes1->18 :
machine : Compaq xp1000
cpu     : 667 MHz DEC alpha 21264
O.S.    : Redhat Linux 7.0
kernel  : 2.4.2
mpich   : 1.2.1

nodes 19->34 :
machine : Microway Screamer
cpu     : 667 MHz DEC alpha 21164
O.S.    : Redhat Linux 7.0
kernel  : 2.4.2
mpich   : 1.2.1

The heterogeneous nature of the machine has made me migrate from using
the -machinefile option to the -p4pg option. I have been 
trying to get a 2 processor job to run while submitting the mpirun
command from node0 (-nolocal is specified) and using either nodes 1
and 2 or nodes 2 and 3. If I use the -machinefile approach I am able to
run on any homogeneous combination of nodes. However, if I use the -p4pg
approach I have not been able to run unless my mpi master
node is node1. As long as node1 is the mpi master node then I can use
any one of nodes 2 through 18 as the 2nd processor. THe following 4 runs
illustrates what I have gotten to work as well as
what doesn't work (and the subsequent error message). Runs 1, 2 and 3
worked and run 4 failed. 

1) When submitting from node0 using the -machinefile option to run on
   nodes 1 and 2 using mpirun configured as:

mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

the machinefile file vulcan.hosts contains:

node1
node2

the PIXXXX file created contains:

node1 0
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node2 1
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver

and the -v option reports

running
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2
LINUX ch_p4 processors
Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI10802

and the program executes successfully

2) When submitting from node0 using the -p4pg option to run on
   nodes 1 and 2 using mpirun configured as:

mpirun -v -nolocal -p4pg vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

the p4pg file vulcan.hosts contains:

node1 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
node2 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

and the -v options reports

running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
on 1 LINUX ch_p4 processors

and the program executes successfully

3) When submitting from node0 using the -machinefile option to run on
   nodes 2 and 3 using mpirun configured as:

mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

the machinefile file vulcan.hosts contains:

node2
node3

the PIXXXX file created contains:

node2 0
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node3 1
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver

and the -v options report

running
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2
LINUX ch_p4 processors
Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI11592

and the program executes successfully

4) When submitting from node0 using the -p4pg option to run on
   nodes 2 and 3 using mpirun configured as:

mpirun -v -nolocal -p4pg vulcan.hosts
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver

the p4pg file vulcan.hosts contains:

node2 0
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver
node3 1
/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver

and the -v options report

running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver
on 1 LINUX ch_p4 processors

and the following error message is genreated

rm_10957:  p4_error: rm_start: net_conn_to_listener failed: 34133

Thanks for your help,
 
Jeffery A. White
email : j.a.white at larc.nasa.gov
Phone : (757) 864-6882 ; Fax : (757) 864-6243
URL   : http://hapb-www.larc.nasa.gov/~jawhite/





More information about the Beowulf mailing list