[Fwd: Problem using -p4pg and procgroup file]
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jeffery A. White j.a.white at larc.nasa.govMon Sep 17 07:15:37 PDT 2001
- Previous message: Network RAM : Comm. issues
- Next message: PVM: problem distributing on smp nodes with different number of processors
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear Group, I recently sent an email message to the mpich help address and the response I got back was rather unsatisfactory. Basically they told me to RTFM. Considering that I had attached several pages of MPICH debug information that (I think) showed a problem I was not satisfied with thier response. I would greatly appreciate any help/suggestion anyone on the list can provide. To view a complete description of the problem and the mpich debug information please see the attached file. Thanks, Jeff White Jeffery A. White email : j.a.white at larc.nasa.gov Phone : (757) 864-6882 ; Fax : (757) 864-6243 URL : http://hapb-www.larc.nasa.gov/~jawhite/ -------------- next part -------------- To whom it may concern, I am trying to figure out how to use the -p4pg option in mpirun and I am experiencing some difficulties. My cluster configuration is as follows: node0 : machine : Dual processor Supermicro Super 370DLE cpu : 1 GHz Pentium 3 O.S. : Redhat Linux 7.1 kernel : 2.4.2-2smp mpich : 1.2.1 nodes1->18 : machine : Compaq xp1000 cpu : 667 MHz DEC alpha 21264 O.S. : Redhat Linux 7.0 kernel : 2.4.2 mpich : 1.2.1 nodes 19->34 : machine : Microway Screamer cpu : 667 MHz DEC alpha 21164 O.S. : Redhat Linux 7.0 kernel : 2.4.2 mpich : 1.2.1 The heterogeneous nature of the machine has made me migrate from using the -machinefile option to the -p4pg option. I have been trying to get a 2 processor job to run while submitting the mpirun command from node0 (-nolocal is specified) and using either nodes 1 and 2 or nodes 2 and 3. If I use the -machinefile approach I am able to run on any homogeneous combination of nodes. However, if I use the -p4pg approach I have not been able to run unless my mpi master node is node1. As long as node1 is the mpi master node then I can use any one of nodes 2 through 18 as the 2nd processor. THe following 4 runs illustrates what I have gotten to work as well as what doesn't work (and the subsequent error message). Runs 1, 2 and 3 worked and run 4 failed. 1) When submitting from node0 using the -machinefile option to run on nodes 1 and 2 using mpirun configured as: mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver the machinefile file vulcan.hosts contains: node1 node2 the PIXXXX file created contains: node1 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver node2 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver and the -v option reports running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2 LINUX ch_p4 processors Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI10802 the /var/log/messages file on node0 contains : no events during this time frame the /var/log/messages file on node1 contains : Sep 13 15:49:32 hyprwulf1 xinetd[21912]: START: shell pid=23013 from=192.168.47.31 Sep 13 15:49:32 hyprwulf1 pam_rhosts_auth[23013]: allowed to jawhite at hyprwulf-boot0.hapb as jawhite Sep 13 15:49:32 hyprwulf1 PAM_unix[23013]: (rsh) session opened for user jawhite by (uid=0) Sep 13 15:49:32 hyprwulf1 in.rshd[23014]: jawhite at hyprwulf-boot0.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI15564 -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases' the /var/log/messages file on node2 contains : Sep 13 15:49:32 hyprwulf2 xinetd[13163]: START: shell pid=13490 from=192.168.47.32 Sep 13 15:49:32 hyprwulf2 pam_rhosts_auth[13490]: allowed to jawhite at hyprwulf-boot1.hapb as jawhite Sep 13 15:49:32 hyprwulf2 PAM_unix[13490]: (rsh) session opened for user jawhite by (uid=0) Sep 13 15:49:32 hyprwulf2 in.rshd[13491]: jawhite at hyprwulf-boot1.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver node1 34248 \-p4amslave' and the program executes successfully 2) When submitting from node0 using the -p4pg option to run on nodes 1 and 2 using mpirun configured as: mpirun -v -nolocal -p4pg vulcan.hosts /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver the progroup file vulcan.hosts contains: node1 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver node2 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver and the -v options reports running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver on 1 LINUX ch_p4 processors the /var/log/messages file on node0 contains : no events during this time frame the /var/log/messages file on node1 contains : Sep 13 15:41:46 hyprwulf1 xinetd[21912]: START: shell pid=22978 from=192.168.47.31 Sep 13 15:41:46 hyprwulf1 pam_rhosts_auth[22978]: allowed to jawhite at hyprwulf-boot0.hapb as jawhite Sep 13 15:41:46 hyprwulf1 PAM_unix[22978]: (rsh) session opened for user jawhite by (uid=0) Sep 13 15:41:46 hyprwulf1 in.rshd[22979]: jawhite at hyprwulf-boot0.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases' the /var/log/messages file on node2 contains : Sep 13 15:41:46 hyprwulf2 xinetd[13163]: START: shell pid=13472 from=192.168.47.32 Sep 13 15:41:46 hyprwulf2 pam_rhosts_auth[13472]: allowed to jawhite at hyprwulf-boot1.hapb as jawhite Sep 13 15:41:46 hyprwulf2 PAM_unix[13472]: (rsh) session opened for user jawhite by (uid=0) Sep 13 15:41:46 hyprwulf2 in.rshd[13473]: jawhite at hyprwulf-boot1.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver node1 34240 \-p4amslave' and the program executes successfully 3) When submitting from node0 using the -machinefile option to run on nodes 2 and 3 using mpirun configured as: mpirun -v -keep_pg -nolocal -np 2 -machinefile vulcan.hosts /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver the machinefile file vulcan.hosts contains: node2 node3 the PIXXXX file created contains: node2 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver node3 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver and the -v options report running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver on 2 LINUX ch_p4 processors Created /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI11592 the /var/log/messages file on node0 contains : no events during this time frame the /var/log/messages file on node1 contains : no events during this time frame the /var/log/messages file on node2 contains : Sep 13 15:35:29 hyprwulf2 xinetd[13163]: START: shell pid=13451 from=192.168.47.31 Sep 13 15:35:29 hyprwulf2 pam_rhosts_auth[13451]: allowed to jawhite at hyprwulf-boot0.hapb as jawhite Sep 13 15:35:29 hyprwulf2 PAM_unix[13451]: (rsh) session opened for user jawhite by (uid=0) Sep 13 15:35:29 hyprwulf2 in.rshd[13452]: jawhite at hyprwulf-boot0.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/PI15167 -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases' the /var/log/messages file on node3 contains : Sep 13 15:35:29 hyprwulf3 xinetd[11167]: START: shell pid=11435 from=192.168.47.33 Sep 13 15:35:29 hyprwulf3 pam_rhosts_auth[11435]: allowed to jawhite at hyprwulf-boot2.hapb as jawhite Sep 13 15:35:29 hyprwulf3 PAM_unix[11435]: (rsh) session opened for user jawhite by (uid=0) Sep 13 15:35:29 hyprwulf3 in.rshd[11436]: jawhite at hyprwulf-boot2.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/VULCAN_solver node2 33713 \-p4amslave' and the program executes successfully 4) When submitting from node0 using the -p4pg option to run on nodes 2 and 3 using mpirun configured as: mpirun -v -nolocal -p4pg vulcan.hosts /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver the progroup file vulcan.hosts contains: node2 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver node3 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver and the -v options report running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver on 1 LINUX ch_p4 processors the /var/log/messages file on node0 contains : no events during this time frame the /var/log/messages file on node1 contains : Sep 13 14:54:48 hyprwulf1 xinetd[21912]: START: shell pid=22917 from=192.168.47.31 Sep 13 14:54:48 hyprwulf1 pam_rhosts_auth[22917]: allowed to jawhite at hyprwulf-boot0.hapb as jawhite Sep 13 14:54:48 hyprwulf1 PAM_unix[22917]: (rsh) session opened for user jawhite by (uid=0) Sep 13 14:54:48 hyprwulf1 in.rshd[22918]: jawhite at hyprwulf-boot0.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases' the /var/log/messages file on node2 contains : no events during this time frame the /var/log/messages file on node3 contains : Sep 13 14:54:48 hyprwulf3 xinetd[11167]: START: shell pid=11395 from=192.168.47.32 Sep 13 14:54:48 hyprwulf3 pam_rhosts_auth[11395]: allowed to jawhite at hyprwulf-boot1.hapb as jawhite Sep 13 14:54:48 hyprwulf3 PAM_unix[11395]: (rsh) session opened for user jawhite by (uid=0) Sep 13 14:54:48 hyprwulf3 in.rshd[11396]: jawhite at hyprwulf-boot1.hapb as jawhite: cmd='/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver node2 34232 \-p4amslave' and the following error message is generated rm_10957: p4_error: rm_start: net_conn_to_listener failed: 34133 It appear that in case 4 even though I have requested node2 and node3 be used that a process is being rhsh'd to node1 instead. The log message from node3 indicates it expects to connect to node2 (partial proof that really did request node2) but since there is no process on node2 an error occurs. The information below is the output stream from case 4 after envoking the -echo and -mpiversion options ++ echo 'default_arch = LINUX' ++ echo 'default_device = ch_p4' ++ echo 'machine = ch_p4' ++ '[' 1 -le 5 ']' ++ arg=-mpiversion ++ shift ++ '[' -x /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args ']' ++ device_knows_arg=0 ++ . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args ++ '[' 0 '!=' 0 ']' +++ echo -mpiversion +++ sed s/%a//g ++ proginstance=-mpiversion ++ '[' '' = '' -a '' = '' -a '!' -x -mpiversion ']' ++ fake_progname=-mpiversion ++ '[' 1 -le 4 ']' ++ arg=-nolocal ++ shift ++ nolocal=1 ++ '[' 1 -le 3 ']' ++ arg=-p4pg ++ shift ++ '[' -x /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args ']' ++ device_knows_arg=0 ++ . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args +++ '[' 1 -gt 1 ']' +++ p4pgfile=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts +++ shift +++ leavePGFile=1 +++ device_knows_arg=1 ++ '[' 1 '!=' 0 ']' ++ continue ++ '[' 1 -le 1 ']' ++ arg=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ++ shift ++ '[' -x /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args ']' ++ device_knows_arg=0 ++ . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4.args ++ '[' 0 '!=' 0 ']' +++ echo /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver +++ sed s/%a//g ++ proginstance=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ++ '[' '' = '' -a -mpiversion = '' -a '!' -x /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ']' ++ '[' '' = '' -a -x /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ']' ++ progname=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ++ '[' 1 -le 0 ']' ++ '[' 1 -le 0 ']' ++ '[' '' = '' -a /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver = '' ']' ++ '[' -n -mpiversion -a -n /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ']' ++ echo 'Unrecognized argument -mpiversion ignored.' ++ larch= ++ '[' -z '' ']' ++ larch=LINUX ++ '[' -n 'sed -e s@/tmp_mnt/@/@g' ']' +++ pwd +++ sed -e s@/tmp_mnt/@/@g ++ PWDtest=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ++ '[' '!' -d /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ']' ++ '[' -n /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ']' +++ echo /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases +++ sed -e s@/tmp_mnt/@/@g ++ PWDtest2=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ++ /bin/rm -f /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410 +++ eval 'echo test > /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410' ++ '[' '!' -s /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410 ']' ++ PWD=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ++ /bin/rm -f /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/.mpirtmp16410 ++ '[' -n /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ']' ++ PWD_TRIAL=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases +++ echo /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver +++ sed 's/\/.*//' ++ tail= ++ '[' '' = '' ']' ++ true ++ '[' '' = '' -a -x /usr/local/pkgs/mpich_1.2.1/bin/tarch ']' +++ /usr/local/pkgs/mpich_1.2.1/bin/tarch ++ arch=LINUX ++ '[' LINUX = IRIX64 -a '(' LINUX = IRIX -o LINUX = IRIXN32 ')' ']' ++ archlist=LINUX ++ '[' ch_p4 = '' ']' ++ '[' ch_p4 = p4 -o ch_p4 = execer -o ch_p4 = sgi_mp -o ch_p4 = ch_p4 -o ch_p4 = ch_p4-2 -o ch_p4 = globus -o ch_p4 = globus ']' ++ '[' '' = '' ']' ++ MPI_HOST= ++ '[' LINUX = ipsc860 ']' +++ hostname ++ MPI_HOST=hyprwulf00 ++ '[' hyprwulf00 = '' ']' ++ '[' /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases '!=' '' ']' +++ pwd +++ sed -e s%/tmp_mnt/%/%g ++ PWD_TRIAL=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ++ '[' '!' -d /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ']' ++ '[' 1 = 1 ']' ++ cnt=1 ++ '[' 0 -gt 1 ']' ++ echo 'running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver on 1 LINUX ch_p4 processors' + argsset=1 + mpirun_version= + mpirun_version=/usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4 + exitstat=1 + '[' -n /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4 ']' + '[' -x /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4 ']' + . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.ch_p4 ++ exitstatus=1 ++ '[' -z 1 ']' ++ '[' -n '' ']' ++ '[' -n '' ']' ++ '[' '' = shared ']' ++ MPI_MAX_CLUSTER_SIZE=1 ++ . /usr/local/pkgs/mpich_1.2.1/bin/mpirun.pg +++ '[' 1 = '' ']' +++ '[' 0 = 0 ']' +++ narch=1 +++ arch1=LINUX +++ archlist1=LINUX +++ archlocal=LINUX +++ np1=1 +++ '[' 1 = 1 ']' +++ procFound=0 +++ machinelist= +++ archuselist= +++ nprocuselist= +++ curarch=1 +++ nolocalsave=1 +++ archlocal=LINUX +++ '[' 1 -le 1 ']' +++ eval 'arch=$arch1' ++++ arch=LINUX +++ eval 'archlist=$archlist1' ++++ archlist=LINUX +++ '[' -z LINUX ']' +++ eval 'np=$np1' ++++ np=1 +++ '[' -z 1 ']' +++ eval 'mFile=$machineFile1' ++++ mFile= +++ '[' -n '' -a -r '' ']' +++ '[' -z '' ']' +++ '[' ch_p4 = ibmspx -a -x /usr/local/bin/getjid ']' +++ machineDir=/usr/local/pkgs/mpich_1.2.1/share +++ machineFile=/usr/local/pkgs/mpich_1.2.1/share/machines.LINUX +++ '[' -r /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX ']' +++ break +++ '[' -z /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX -o '!' -s /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX -o '!' -r /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX ']' ++++ expr hyprwulf00 : '\([^\.]*\).*' +++ MPI_HOSTLeader=hyprwulf00 +++ '[' '' = yes ']' +++ '[' 1 = 0 -o 1 -gt 1 ']' +++ '[' 1 -gt 1 -o 1 = 1 ']' ++++ cat /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX ++++ sed -e '/^#/d' -e 's/#.*^//g' ++++ grep -v '^hyprwulf00\([ -\.:]\)' ++++ head -1 ++++ tr '\012' ' ' +++ machineavail=mpi1 +++ KeepHost=0 +++ loopcnt=0 +++ '[' -z 1 ']' +++ '[' 1 = 0 -a 1 -gt 1 ']' ++++ expr 1 - 0 +++ nleft=1 +++ '[' 1 -lt 0 ']' +++ '[' 0 -lt 1 ']' +++ nfound=0 +++ nprocmachine=1 ++++ expr mpi1 : '.*:\([0-9]*\)' +++ ntest= +++ '[' -n '' -a '' '!=' 0 ']' ++++ expr mpi1 : '\([^\.]*\).*' +++ machineNameLeader=mpi1 +++ '[' 1 = 1 -o 0 = 1 -o '(' mpi1 '!=' hyprwulf00 -a mpi1 '!=' hyprwulf00 ')' ']' +++ '[' 1 -gt 1 ']' +++ machinelist= mpi1 +++ archuselist= LINUX +++ nprocuselist= 1 ++++ expr 0 + 1 +++ procFound=1 ++++ expr 0 + 1 +++ nfound=1 ++++ expr 1 - 1 +++ nleft=0 +++ '[' 1 = 1 ']' +++ break ++++ expr 0 + 1 +++ loopcnt=1 +++ '[' 1 = 0 -a 1 -gt 1 ']' +++ '[' 1 -lt 1 ']' ++++ expr 1 + 1 +++ curarch=2 +++ procFound=0 +++ nolocal=1 +++ machineFile= +++ '[' 2 -le 1 ']' +++ nolocal=1 +++ '[' 1 '!=' 1 ']' +++ break ++ prognamemain=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver ++ '[' -z /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts ']' ++ /bin/sync ++ '[' '' = '' ']' ++ p4workdir=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ++ startpgm=/home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases ++ '[' '' '!=' '' ']' ++ MPIRUN_DEVICE=ch_p4 ++ export MPIRUN_DEVICE ++ '[' 0 = 1 ']' ++ doitall=eval ++ '[' 1 = 1 ']' ++ '[' '' = /dev/null ']' ++ doitall=eval /usr/bin/rsh -n mpi1 ++ eval /usr/bin/rsh -n mpi1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases +++ /usr/bin/rsh -n mpi1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver -p4pg /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts -p4wd /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases default_arch = LINUX default_device = ch_p4 machine = ch_p4 Unrecognized argument -mpiversion ignored. running /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver on 1 LINUX ch_p4 processors rm_11548: p4_error: rm_start: net_conn_to_listener failed: 34288 bm_list_23231: p4_error: interrupt SIGINT: 2 p0_23230: p4_error: interrupt SIGINT: 2 Broken pipe P4 procgroup file is /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Sample_cases/vulcan.hosts. The result from cat /usr/local/pkgs/mpich_1.2.1/share/machines.LINUX is mpi1 mpi2 mpi3 mpi4 mpi5 mpi6 mpi7 mpi8 mpi9 mpi10 mpi11 mpi12 mpi13 mpi14 mpi15 mpi16 mpi17 mpi18 however our /etc/hosts file contains entries to mpi1 node1 mpi2 node2 so using the p4pg file vulcan.hosts containing: node2 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver node3 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver or mpi2 0 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver mpi3 1 /home0/jawhite/Vulcan/DEC_21264/Ver_4.3/Executable/VULCAN_solver both produce the same result/error message. Removing mpi1 from the machines.LINUX file seems to fix the problem by shifting the master process to mpi2/node2. But I suspect that if I requested nodes 3 and 4 the error would happen again. I had hoped that using the p4pg file would have allowed me to pick any node as my master node.
- Previous message: Network RAM : Comm. issues
- Next message: PVM: problem distributing on smp nodes with different number of processors
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
