Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] job runs with mpirun on a node but not if submitted via Torque.

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Rahul Nabar rpnabar at gmail.com
Tue Mar 31 16:58:45 PDT 2009


On Tue, Mar 31, 2009 at 6:43 PM, Don Holmgren <djholm at fnal.gov> wrote:
>
> How are your individual MPI processes crashing when run under Torque?  Are
> there any error messages?

Thanks Don! There aren't any useful error messages.

My job hierarchy is actually like so:

{shell_script sumitted to Torque} --> calls Python--> Loop until
convergence {Calls a fortran executable}

The fortran executable is the one that has the mpi calls to parrellize
over processors.

The crash is *not* so bad that torque kills the job. What happens is
that the fortran exec crashes and python continues to loop it over and
over again. The crash is only whenever I submit via torque.

If I do this instead

mpirun fron node --> shell wrapper--> calls Python--> Loop until
convergence {Calls a fortran executable}

Then everything works fine. Note that the Python and shell are not
truely parallelized. The fortran is the only place where actual
parallelization happens.

> The environment for a Torque job on a worker node under openMPI is inherited
> from the pbs_mom process.  Sometimes differences between this environment
> and
> the standard login environment can cause troubles.

Exactly. Can I somehow obtain a dump of this environment to compare
the direct mprun vs the torque run? What would be the best way? Just a
dump from set? Any crucial variables to look for? Maybe a ulimit?

>
> Instead of logging into the node directly, you might want to try an
> interactive
> job (use "qsub -I") and then try your mpirun.

I'm trying that now.

-- 
Rahul




More information about the Beowulf mailing list