[Beowulf] scheduler policy design
tjrc at sanger.ac.uk
Wed Apr 25 06:15:18 PDT 2007
On 25 Apr 2007, at 8:42 am, Toon Knapen wrote:
> Interesting. However this approach requires that the IO profile of
> the application is known.
> Additionally it requires the users of the application (which are
> generally not IT guys) to know and understand this info and pass it
> on to the scheduler when they launch their app.
> In your experience, do you manage to convince real-life users to
> provide this info?
Not easily. :-)
And this is the problem with getting scheduling right, and exactly
what we were saying at the beginning of this discussion. You can't
hope to schedule optimally if the scheduler doesn't know the profile
of the application; the more information it knows the better the job
it will do. But if your users, like mine, can't or won't supply this
information, then you're very limited in what you can achieve, and
your system will be vulnerable to denial of service because of
strange mixes of jobs starting on the machines causing them to run
out of various resources, and there is basically nothing you will be
able to do about it.
The compromise we ended up with is this set of LSF queues on our
system (a cluster with about 1500 job slots):
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND
yesterday 500 Open:Active 200 10 - - 1
0 1 0
normal 30 Open:Active - - - - 281 110
hugemem 30 Open:Active - - - - 3
0 3 0
long 3 Open:Active - - - - 4022 2987
basement 1 Open:Active 300 200 - - 127 0
a special purpose high priority queue for the "I need it yesterday"
crowd. No run length limits, but very limited in terms of how many
slots the user can use.
queue intended for shortish jobs (around 1 hour). Absolute wall
clock limit of 8 hours, after which jobs are killed.
queue for longer jobs with an absolute wall clock limit of 24 hours.
special purpose queue for the two large memory SGI Altix nodes.
Users submitting jobs to this queue *must* supply memory
requirements; the submission is rejected if they do not.
queue for long running or low priority jobs. No time limits, but
can't use more than a small fraction of the total cluster.
All the queues except hugemem also have a default memory limit of 1.9
GB; any job exceeding this limit is killed. If the user wants to
raise this limit they can, up to 7.9 GB, but they are then forced by
the same mechanism as the hugemem queue to supply proper memory
Here's an example of what happens if they don't:
--- EXAMPLE ---
14:07:31 tjrc at bc-9-1-03:~$ bsub -M 6000000 uname -a
Job submission rejected.
You are specifying your own memory limit, so you must also supply
select[mem] and rusage[mem] resource requirement parameters. For
-M2000000 -R'select[mem>2000] rusage[mem=2000]'
Remember that memory limits are set in KB, resource memory in MB.
Sorry about that. Blame Platform.
If you do not understand what this means, read the lsfintro manpage and
the following web page:
If you still don't understand after that, contact ssg-isg(at)
Request aborted by esub. Job not submitted.
--- EXAMPLE ---
All this is designed so that users who can't or won't supply detailed
parameters to LSF can still submit work, but they either are limited
in terms of how many jobs they can run at once (in the yesterday and
basement queues) or they run the risk of their job being killed if it
goes astray and uses too much time or memory (in the normal and long
Thus, it gives the users an incentive to understand their code and
use the cluster carefully and responsibly. Until we put the hard run
limits in place, the cluster was being brought to its knees at least
once a week by users just being careless, and that was why we
eventually had to be somewhat more draconian. It's worked though;
the cluster has not had a similar DoS event since putting these rules
More information about the Beowulf