[Beowulf] Re: scheduler and perl

Wed Aug 2 09:51:26 PDT 2006

"Xu, Jerry" wrote:
> 
> Hi, I am maintaining a cluster while lots user uses perl to submit
tons of jobs
> which seems to me like abusing the system.

The qsub in SGE, and probably others, allows a repeat count to be
set with a single qsub.  So if they are using perl to qsub 1000 jobs
which correspond to i=1,1000 have them use just one qsub with the
repeat count method instead.

Additionally, if they are setting up thousands of jobs, each of
of which runs for very short times (< one second) through the
queue system it will be much better to have them submit scripts
that run N of those processes in a chunk within a single qsub job. 
That is, if you have 100 nodes and there are 1000 jobs to run,
they might run 20 in each of 50 jobs (or some other similar mix.)
There is some overhead and typically >=1 second wait times built
into most queue systems, and they work better when the jobs are
"long" compared to these times.  

I ran into both of these issues with my parallelblast implementation.
SGE just couldn't start the jobs on the nodes fast enough, so it
ended up using an outer SGE wrapper to start the "mother" job which
then used PVM to start the individual jobs on each node.  That's sort
of an odd application though as it had to run in a certain way
on all nodes at more or less the same time.

Other than that, you may want to have a users meeting where the 
various types of jobs run on the system are described by the people
who run them, so that some rational load sharing policy can be worked
out.  Not so much "who gets the most time" - which is pure politics and
good luck with that.  Rather: "how not to run jobs in such a way
that they hog the system for no good reason, keeping others from
getting work done."

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech