[Beowulf] cluster scheduler for dynamic tree-structured jobs?

Mon May 17 10:02:19 PDT 2010

Andrew Piskorski wrote:

> Folks, I could use some advice on which cluster job scheduler (batch
> queuing system) would be most appropriate for my particular needs.
> I've looked through docs for SGE, Slurm, etc., but without first-hand
> experience with each one it's not at all clear to me which I should
> choose...

I think that most of the ones out there will do what you need.  I am most familiar with PBS Pro (since I used to work for them) and a little SGE.  Are you considering commercial offerings or are you restricting yourself to the free ones?

> 1. I have lots of embarrassingly parallel tree-structured jobs which I
> dynamically generate and submit from top-level user code (which
> happens to be written in R).  E.g., my user code generates 10 or 100
> or 1000 jobs, and each of those jobs might itself generate N jobs.
> Any given job cannot complete until all its children complete.

So you generate the job list from the top of the tree, but need to process from the bottom?  

Under PBS Pro, you would use job dependencies to do this.  Have a 'meta-job' that does a recursive descent of the tree and 1) generates a job for each node (initially in a 'held' state), then 2) generates jobs for all the children of the node and makes the job generated in #1 dependent upon the completion of the children, and then 3) recursively do #1 & #2 for all the children.  Then release all the jobs from their held state.

> Also, multiple users may be submitting unrelated jobs at the same
> time, some of their jobs should have higher priority than others, etc.
> (The usual reasons for wanting to use a cluster scheduler in the first
> place, I think.)

Yup.  Pretty much part and partial of any of the Workload management offerings.

> Thus, merely assigning the individual jobs to compute nodes is not
> enough, I need the cluster scheduler to also understand the tree
> relationships between the jobs.

Right.  Job dependencies do this in PBS Pro.

> 2. Sometimes I can statically figure out the full tree structure of my
> jobs ahead of time, but other times I can't or won't, so I definitely
> need a scheduler that lets me submit new sub-jobs on the fly, from any
> node in the cluster.

Most jobs can have additional dependencies added at a later point in time, as long as the job still exists.  Remember that once all the children of a job complete, then the job can be run.  You can circumvent this by putting a hold on the job if you know that it has additional children you want to submit at a later point.

> 3. The jobs are ultimately all submitted by a small group of people
> who talk to each other, so I don't really care about any fancy
> security, cost accounting, "grid" support, or other such features
> aimed at large and/or loosely coupled organizations.

If you can get everyone to cooperate and work with each other, that's usually the best solution.  For the times you cannot then quotas, fair-share policies, and job prioritizations are the tools you use.

> 4. I really, really want a good API for programmably interacting with
> the cluster scheduler and ALL of its features.  I don't care too much
> what language the API is in as long as it's reasonably sane and I can
> readily write glue code to interface it to my language of choice.

Most of the commercial offerings have API and various GUI portals available.

> 5. Although I don't currently do any MPI programming, I would very
> much like the option to do so in the future, and integrate it smoothly
> with the cluster scheduler.  I assume pretty much all cluster
> schedulers have that, though.  (Erlang integration might also be nice.)

Should be no problem for any of the major offerings.

> 6. Each of my individual leaf-node jobs will typically take c. 3 to 30
> minutes to complete, so my use shouldn't stress the scheduler's own
> performance too much.  However, sometimes I screw that up and submit
> tons of jobs that each want to run for only a small amount of time,
> say 2 minutes or less, so it'd be nice if the scheduler is
> sufficiently efficient and low-latency to keep up with that.

That's actually a fairly challenging issue for many job scheduling engines and really depends on the total system/cluster size and configuration.  Most should be able to handle it, but I will say that when you start getting down to that length of job, there are a lot of hidden gotchas that come to the surface, like disk I/O (if you are using shared data) and handling job failures (just to name a couple).

The bottom line is that short jobs are very inefficient and you should try and avoid them if possible.

> 7. When I submit a job, I should be able to easily (and optionally)
> give the scheduler my estimates of how much RAM and cpu time the job
> will need.  The scheduler should track what resources the job ACTUALLY
> uses, and make it easy for me to monitor job status for both running
> and completed jobs, and then use that information to improve my
> resource estimates for future jobs.  (AKA good APIs, yet again.)

Most of that is available through the job submission command line.

> 8. Of course the scheduler must have a good way to track all the basic
> information about my nodes:  CPU sockets and cores, RAM, etc.  Ideally
> it'd also be straightforward for me to extend the database of node
> properties as I see fit.  Bonus points if it uses a good database
> (e.g. SQLite, PostgreSQL) and a reasonable data model for that stuff.

Again, management of node resources is part of pretty much all the offerings.  Using a "true" database for this configuration data is usually not done (in my experience) mainly because it's pretty much overkill and has its own set of scaling limitations (in a 10000+ node cluster, you do not want all the node repeatedly accessing a single database for their configuration status). 

Good luck,

-bill