[Beowulf] cluster scheduler for dynamic tree-structured jobs?

Sat May 15 07:33:08 PDT 2010

On 05/15/10 03:24, Andrew Piskorski wrote:
> Folks, I could use some advice on which cluster job scheduler (batch
> queuing system) would be most appropriate for my particular needs.
> I've looked through docs for SGE, Slurm, etc., but without first-hand
> experience with each one it's not at all clear to me which I should
> choose...
>
> I've used Sun Grid Engine for this in the past, but the result was
> very klunky and hard to maintain.  SGE seems to have all the necessary
> features underneath, but no good programming API, and its command-line
> tools often behave in ways that make them a poor substitute.
>
> Here's my current list of needs/wants, starting with the ones that
> probably make my use case more unusual:
>
> 1. I have lots of embarrassingly parallel tree-structured jobs which I
> dynamically generate and submit from top-level user code (which
> happens to be written in R).  E.g., my user code generates 10 or 100
> or 1000 jobs, and each of those jobs might itself generate N jobs.
> Any given job cannot complete until all its children complete.
>
> Also, multiple users may be submitting unrelated jobs at the same
> time, some of their jobs should have higher priority than others, etc.
> (The usual reasons for wanting to use a cluster scheduler in the first
> place, I think.)
>
> Thus, merely assigning the individual jobs to compute nodes is not
> enough, I need the cluster scheduler to also understand the tree
> relationships between the jobs.  Without that, it'd be too easy to get
> into a live-lock situation, where all the nodes are tied up with jobs,
> none of which can complete because they are waiting for child jobs
> which cannot be scheduled.
>    

I'm not quite sure I understand what you're doing, but if you make all 
your execution hosts submit hosts as well you can submit jobs within 
your running jobs. You can use "-now y -sync y" in your jobs to ensure 
that the parent doesn't exit until its children have exited.

> 2. Sometimes I can statically figure out the full tree structure of my
> jobs ahead of time, but other times I can't or won't, so I definitely
> need a scheduler that lets me submit new sub-jobs on the fly, from any
> node in the cluster.
>
> 3. The jobs are ultimately all submitted by a small group of people
> who talk to each other, so I don't really care about any fancy
> security, cost accounting, "grid" support, or other such features
> aimed at large and/or loosely coupled organizations.
>
> 4. I really, really want a good API for programmably interacting with
> the cluster scheduler and ALL of its features.  I don't care too much
> what language the API is in as long as it's reasonably sane and I can
> readily write glue code to interface it to my language of choice.
>    

I haven't looked at it much, but I think DRMAA will work for that in SGE.

> 5. Although I don't currently do any MPI programming, I would very
> much like the option to do so in the future, and integrate it smoothly
> with the cluster scheduler.  I assume pretty much all cluster
> schedulers have that, though.  (Erlang integration might also be nice.)
>    

SGE does indeed do MPI integration. I doubt it does Erlang integration 
out of the box but the integration is just a collection of pre- and 
post-job scripts so you should be able to write it yourself if you have to.

> 6. Each of my individual leaf-node jobs will typically take c. 3 to 30
> minutes to complete, so my use shouldn't stress the scheduler's own
> performance too much.  However, sometimes I screw that up and submit
> tons of jobs that each want to run for only a small amount of time,
> say 2 minutes or less, so it'd be nice if the scheduler is
> sufficiently efficient and low-latency to keep up with that.
>    

SGE's scheduler latency is tunable to a certain degree. As you decrease 
the maximum latency you increase the load so you might need beefier 
hardware to accommodate it.

> 7. When I submit a job, I should be able to easily (and optionally)
> give the scheduler my estimates of how much RAM and cpu time the job
> will need.  The scheduler should track what resources the job ACTUALLY
> uses, and make it easy for me to monitor job status for both running
> and completed jobs, and then use that information to improve my
> resource estimates for future jobs.  (AKA good APIs, yet again.)
>    

SGE can give you this with requestable complexes, although I don't think 
it'll learn from your estimates.

> 8. Of course the scheduler must have a good way to track all the basic
> information about my nodes:  CPU sockets and cores, RAM, etc.  Ideally
> it'd also be straightforward for me to extend the database of node
> properties as I see fit.  Bonus points if it uses a good database
> (e.g. SQLite, PostgreSQL) and a reasonable data model for that stuff.
>
> Thanks in advance for your help and advice!
>    

SGE does this and can make it available as XML.

-- 
-- Skylar Thompson (skylar at cs.earlham.edu)
-- http://www.cs.earlham.edu/~skylar/