[Beowulf] the solution for qdel fail.....
Fred L Youhanaie
fly at anydata.co.uk
Thu Jan 6 16:36:40 PST 2005
Since you are using PBS you may want to consider mpiexec,
http://www.osc.edu/~pw/mpiexec/, it is basically a replacement for
mpirun, but with tight integration with PBS, so once you issue qdel for
a job it will kill all the subtasks on the remote nodes. It will also do
a better resource accounting, e.g. total cpu used by all nodes, and will
eliminate the need for ssh/rsh :)
Even if you are not using mpi, you can spawn multiple instances of a
program with the '--comm=none' option.
William Scullin wrote:
> The --gm-kill is specific to clusters using myrinet and mostly is there
> to ensure that slave processes using myrinet's mpi hang up when the
> master process is done running. The number after the --gm-kill is the
> timeout in seconds.
> I am not sure which version, type, or member of the PBS family you are
> using. If you are using PBS Pro (also probably true for torque and Open
> PBS), you should be able to place two scripts in
> /var/spool/PBS/mom_priv/ called prologue and epilogue on every compute
> node. They must be owned by root and be executable / readable / writable
> only by root. The prologue script will run before every job and the
> epilogue script will run after every job. In the epilogue and prologue
> scripts we use, we clean the nodes of all lingering user processes and
> do some basic checking of node health.
> Even if an epilogue script misses a process â€“ or a user a user launches
> a process outside of the queuing system â€“ the prologue will still catch
> it before the next job starts to run.
> On Thu, 2005-01-06 at 14:33, Jerry Xu wrote:
>> I found one solution that works for me, maybe you can try it and see
>>whether it works for you.
>>in your pbs script, try to add this "kill -gm 5" syntax between the
>>processor number and your program
>>mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram
>>it works for me.
>>We have a new system set up. The vendor set up the PBS for us. For
>>administration reasons, we created a new queue "dque" (set to default)
>>using the "qmgr" command:
>>create queue dque queue_type=e
>>s q dqueue enabled=true, started=true
>>I was able to submit jobs using the "qsub" command to queue "dque".
>>However, when I use "qdel" to kill a job, the job disappears from the
>>job list shown by "qstat -a", but the executable is still running on
>>the compute nodes. Every time I have to login the corresponding the
>>compute node and kill the running job.
>>I am wondering if I missed something in setting up the queue so that I
>>am unable to kill the job completely using "qdel".
More information about the Beowulf