Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] the solution for qdel fail.....

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Fred L Youhanaie fly at anydata.co.uk
Thu Jan 6 16:36:40 PST 2005


Hi,

Since you are using PBS you may want to consider mpiexec, 
http://www.osc.edu/~pw/mpiexec/, it is basically a replacement for 
mpirun, but with tight integration with PBS, so once you issue qdel for 
a job it will kill all the subtasks on the remote nodes. It will also do 
a better resource accounting, e.g. total cpu used by all nodes, and will 
eliminate the need for ssh/rsh :)

Even if you are not using mpi, you can spawn multiple instances of a 
program with the '--comm=none' option.

Cheers
f.


William Scullin wrote:
> Howdy,
> 
> 	The --gm-kill is specific to clusters using myrinet and mostly is there
> to ensure that slave processes using myrinet's mpi hang up when the
> master process is done running. The number after the --gm-kill is the
> timeout in seconds.
> 
> 	I am not sure which version, type, or member of the PBS family you are
> using. If you are using PBS Pro (also probably true for torque and Open
> PBS), you should be able to place two scripts in
> /var/spool/PBS/mom_priv/ called prologue and epilogue on every compute
> node. They must be owned by root and be executable / readable / writable
> only by root. The prologue script will run before every job and the
> epilogue script will run after every job. In the epilogue and prologue
> scripts we use, we clean the nodes of all lingering user processes and
> do some basic checking of node health.
> 
> 	Even if an epilogue script misses a process – or a user a user launches
> a process outside of the queuing system – the prologue will still catch
> it before the next job starts to run.
> 
> 	Best,
> 	William
>  
> On Thu, 2005-01-06 at 14:33, Jerry Xu wrote:
> 
>>Hey, Huang:
>>
>>  I found one solution that works for me, maybe you can try it and see
>>whether it works for you.
>>
>>in your pbs script, try to add this "kill -gm 5" syntax between the
>>processor number and your program
>>
>>like this 
>>
>>mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram
>>
>>it works for me.
>>
>>Jerry.
>>
>>/**********************************************************
>>Hi,
>>
>>We have a new system set up. The vendor set up the PBS for us. For
>>administration reasons, we created a new queue "dque" (set to default)
>>using the "qmgr" command:
>>
>>create queue dque queue_type=e
>>s q dqueue enabled=true, started=true
>>
>>I was able to submit jobs using the "qsub" command to queue "dque".
>>However, when I use "qdel" to kill a job, the job disappears from the
>>job list shown by "qstat -a", but the executable is still running on
>>the compute nodes. Every time I have to login the corresponding the
>>compute node and kill the running job.
>>
>>I am wondering if I missed something in setting up the queue so that I
>>am unable to kill the job completely using "qdel".
>>
>>Thanks.





More information about the Beowulf mailing list