[Beowulf] scheduler policy design
tjrc at sanger.ac.uk
Tue Apr 24 06:47:01 PDT 2007
On 24 Apr 2007, at 1:30 pm, Toon Knapen wrote:
> Tim Cutts wrote:
>>> but what if you have a bi-cpu bi-core machine to which you assign
>>> 4 slots. Now one slot is being used by a process which performs
>>> heavy IO. Suppose another process is launched that performs heavy
>>> IO. In that case the latter process should wait until the first
>>> one is done to avoid slowing down the efficiency of the system.
>>> Generally however, clusters take only time and memory
>>> requirements into account.
>> I think that varies. LSF records the current I/O of a node as one
>> of its load indices, so you can request a node which is doing less
>> than a certain amount of I/O. I imagine the same is true of SGE,
>> but I wouldn't know.
> Indeed, using SGE you could also take this into account. However if
> someone submits 4 jobs, the jobs do not directly start to generate
> heavy I/O. So the scheduler might think that the 4 jobs can easily
> coexist on this same node. However, after a few minutes all 4 jobs
> start eating disk BW and slow the node down horribly. What would
> your suggestion be to solve this ?
With LSF, you use resource reservation, using an rusage statement.
Let's say, for example, that you want to keep IO on the node below 15
MB/sec (just for argument's sake) and you know that your code
performs I/O at 5 MB/sec. Let's also assume that the node can only
15 MB/sec total (which is pathetic, I know, but serves to illustrate
the example). This means you know that you only want to start a job
if the current I/O load is less than 10 MB/sec. So, you tell LSF the
bsub -R"select[io <= 10000] rusage[io=5000]" ...
So, to show what LSF does in this case, on a single machine with four
This machine, given the above other conditions, would become
overloaded if LSF started four jobs on it, but it can cope with
three. This is what happens:
Initial state: 0 jobs running, io load is 0. reserved io is 0.
load+reserved is <= 10000, so LSF starts a job.
State: 1 job running, io load is 0, reserved io is 5000
load+reserved still <= 10000, so LSF starts another job
State: 2 jobs running, io load is 0, reserved io is 10000
load+reserved is still <= 10000, so LSF starts another job
State: 3 jobs running, io load is 0, reserved io is 15000
load+reserved is now >10000, so LSF will not start the fourth job,
even though a processor is available, and the three currently running
jobs haven't started performing their massive I/O yet.
This scheme works quite well, but has some caveats:
1) It is still vulnerable to someone submitting an I/O intensive job
without appropriate resource requirements (but that's back to my
original point; if you don't give the scheduler the right
information, it can't possibly schedule optimally). You can always
implement an esub rule to force people to add the appropriate
resources (I do precisely that for memory intensive jobs, using
exactly this technique).
2) The syntax Platform use only works well for jobs which use a
resource throughout their life, or for a limited period at the
beginning. For cases where it only does something for a limited
period at the end, you *have* to reserve the resource for the entire
lifetime of the job. This isn't optimal, but without a time machine
it's hard to do it any other way.
More information about the Beowulf