Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] job runs with mpirun on a node but not if submitted via Torque.

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Rahul Nabar rpnabar at gmail.com
Tue Mar 31 15:54:55 PDT 2009


I've a strange OpenMPI/Torque problem while trying to run a job on our
Opteron-SC-1435 based cluster:

Each node has 8 cpus.

If I got to a node and run like so then the job works:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

Same job if I submit through PBS/Torque then it starts running but the
individual processes keep crashing:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

I know that the --hostfile directive is not needed in the latest
torque-OpenMPI jobs.

I also tried including:

mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17
${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

Still does not work.

What could be going wrong? Are there other things I need to worry
about when PBS steps in? Any tips?

The ${DACAPOEXE_PAR} refers to a fortran executable for the
computational chemistry code DACAPO.

What;s the differences between submitting a job on a node via mpirun
directly vs via Torque. Shouldn't these both be transparent to the
fortran calls. I am assuming don't have to dig into the fortran code.
Any debug tips?

Thanks!

-- 
Rahul



More information about the Beowulf mailing list