[Beowulf] job runs with mpirun on a node but not if submitted via Torque.
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Rahul Nabar rpnabar at gmail.comTue Mar 31 16:58:45 PDT 2009
- Previous message: [Beowulf] job runs with mpirun on a node but not if submitted via Torque.
- Next message: [Beowulf] job runs with mpirun on a node but not if submitted via Torque.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, Mar 31, 2009 at 6:43 PM, Don Holmgren <djholm at fnal.gov> wrote: > > How are your individual MPI processes crashing when run under Torque? Are > there any error messages? Thanks Don! There aren't any useful error messages. My job hierarchy is actually like so: {shell_script sumitted to Torque} --> calls Python--> Loop until convergence {Calls a fortran executable} The fortran executable is the one that has the mpi calls to parrellize over processors. The crash is *not* so bad that torque kills the job. What happens is that the fortran exec crashes and python continues to loop it over and over again. The crash is only whenever I submit via torque. If I do this instead mpirun fron node --> shell wrapper--> calls Python--> Loop until convergence {Calls a fortran executable} Then everything works fine. Note that the Python and shell are not truely parallelized. The fortran is the only place where actual parallelization happens. > The environment for a Torque job on a worker node under openMPI is inherited > from the pbs_mom process. Sometimes differences between this environment > and > the standard login environment can cause troubles. Exactly. Can I somehow obtain a dump of this environment to compare the direct mprun vs the torque run? What would be the best way? Just a dump from set? Any crucial variables to look for? Maybe a ulimit? > > Instead of logging into the node directly, you might want to try an > interactive > job (use "qsub -I") and then try your mpirun. I'm trying that now. -- Rahul
- Previous message: [Beowulf] job runs with mpirun on a node but not if submitted via Torque.
- Next message: [Beowulf] job runs with mpirun on a node but not if submitted via Torque.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
