[Beowulf] job runs with mpirun on a node but not if submitted via Torque.

Tue Mar 31 16:58:45 PDT 2009

On Tue, Mar 31, 2009 at 6:43 PM, Don Holmgren <djholm at fnal.gov> wrote:
>
> How are your individual MPI processes crashing when run under Torque?  Are
> there any error messages?

Thanks Don! There aren't any useful error messages.

My job hierarchy is actually like so:

{shell_script sumitted to Torque} --> calls Python--> Loop until
convergence {Calls a fortran executable}

The fortran executable is the one that has the mpi calls to parrellize
over processors.

The crash is *not* so bad that torque kills the job. What happens is
that the fortran exec crashes and python continues to loop it over and
over again. The crash is only whenever I submit via torque.

If I do this instead

mpirun fron node --> shell wrapper--> calls Python--> Loop until
convergence {Calls a fortran executable}

Then everything works fine. Note that the Python and shell are not
truely parallelized. The fortran is the only place where actual
parallelization happens.

> The environment for a Torque job on a worker node under openMPI is inherited
> from the pbs_mom process.  Sometimes differences between this environment
> and
> the standard login environment can cause troubles.

Exactly. Can I somehow obtain a dump of this environment to compare
the direct mprun vs the torque run? What would be the best way? Just a
dump from set? Any crucial variables to look for? Maybe a ulimit?

>
> Instead of logging into the node directly, you might want to try an
> interactive
> job (use "qsub -I") and then try your mpirun.

I'm trying that now.

-- 
Rahul