[Beowulf] the solution for qdel fail.....
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
William Scullin wscullin at cct.lsu.eduThu Jan 6 15:56:37 PST 2005
- Previous message: [Beowulf] the solution for qdel fail.....
- Next message: [Beowulf] the solution for qdel fail.....
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Howdy, The --gm-kill is specific to clusters using myrinet and mostly is there to ensure that slave processes using myrinet's mpi hang up when the master process is done running. The number after the --gm-kill is the timeout in seconds. I am not sure which version, type, or member of the PBS family you are using. If you are using PBS Pro (also probably true for torque and Open PBS), you should be able to place two scripts in /var/spool/PBS/mom_priv/ called prologue and epilogue on every compute node. They must be owned by root and be executable / readable / writable only by root. The prologue script will run before every job and the epilogue script will run after every job. In the epilogue and prologue scripts we use, we clean the nodes of all lingering user processes and do some basic checking of node health. Even if an epilogue script misses a process â or a user a user launches a process outside of the queuing system â the prologue will still catch it before the next job starts to run. Best, William On Thu, 2005-01-06 at 14:33, Jerry Xu wrote: > Hey, Huang: > > I found one solution that works for me, maybe you can try it and see > whether it works for you. > > in your pbs script, try to add this "kill -gm 5" syntax between the > processor number and your program > > like this > > mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram > > it works for me. > > Jerry. > > /********************************************************** > Hi, > > We have a new system set up. The vendor set up the PBS for us. For > administration reasons, we created a new queue "dque" (set to default) > using the "qmgr" command: > > create queue dque queue_type=e > s q dqueue enabled=true, started=true > > I was able to submit jobs using the "qsub" command to queue "dque". > However, when I use "qdel" to kill a job, the job disappears from the > job list shown by "qstat -a", but the executable is still running on > the compute nodes. Every time I have to login the corresponding the > compute node and kill the running job. > > I am wondering if I missed something in setting up the queue so that I > am unable to kill the job completely using "qdel". > > Thanks. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ William Scullin System Administrator Center for Computation and Technology 342 Johnston Hall Louisiana State University Baton Rouge, Louisiana 70803 voice: 225 578 6888 fax: 225 578 5362 aim: WilliamAtLSU ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Previous message: [Beowulf] the solution for qdel fail.....
- Next message: [Beowulf] the solution for qdel fail.....
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
