[Beowulf] scheduler policy design
Peter St. John
peter.st.john at gmail.com
Tue Apr 24 08:01:40 PDT 2007
... So, you tell LSF the following:
bsub -R"select[io <= 10000] rusage[io=5000]" ...
2) The syntax Platform use only works well for jobs which use a
resource throughout their life, or for a limited period at the
beginning. For cases where it only does something for a limited
period at the end, you *have* to reserve the resource for the entire
lifetime of the job. This isn't optimal, but without a time machine
it's hard to do it any other way.
This issue of reserving the IO resource for the lifetime of the job seems
like a real annoyance; so I wonder if you could try something like this:
I imaging splitting the job into three processes over it's lifetime; two
will have resources posted to the scheduler, and one will be a child that
First, schedule a job with compute CPU/RAM resources but no IO (because that
won't be needed for a long time). This job then forks a detached child, with
parameters the address of a pointer variable and a size needed for working
memory. The detached process allocates the memory, and then writes the
address of that memory to the pointer variable, and then sleeps. Meanwhile,
the parent process does it's many minutes of compute, and when it needs
memory for intermediate results it writes to the address proveded by the
pointer (that is, it uses the memory "owned" by the detached process).
Eventually this parent is done computing and wants to write it's final
results to the disk farm, and would need a ton of IO bandwidth. It posts the
3rd process as a job to the scheduler, this time with small CPU and memory
needs but with a big request for bandwidth. Then it dies.
The scheduler now has a new job to schedule, which wants to be on the same
node where the detached child sleeps with the chunk of RAM, and wants IO.
The RAM is idle until this job can be scheduled, but the CPU is freed for
the next compute job in the queue. Eventually the scheduler has the
bandwidth, spawns the write-to-farm process, which uses the sleeping child
memory, writes to the farm, kills the child, and ends.
This seems to trade a chunk of memory (which the OS can't give back to the
scheduler until it gets consumed by the IO job) for CPU time (which gets
freed up when the computation is done) and avoids locking IO resources for a
longer time than just the life of the write job itself.
Could something along these lines help?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf