[Beowulf] cluster scheduler for dynamic tree-structured jobs?

Sat May 15 08:44:50 PDT 2010

On Sat, May 15, 2010 at 07:33:08AM -0700, Skylar Thompson wrote:

> I'm not quite sure I understand what you're doing, but if you make all
> your execution hosts submit hosts as well you can submit jobs within
> your running jobs. You can use "-now y -sync y" in your jobs to ensure

Yes, that's what I did with SGE, that part works fine.  SGE's other
behaviors often leave much to be desired.

E.g., "reschedule_unknown".  By default, SGE marks a node as down only
when the node's execd daemon comes back *up*!  So if the node hits a
kernel oops, reboots, and successfully restarts its execd, everything
is fine - SGE notices that the machine crashed, and reschedules
whatever job was running on it at the time.

But if the node just stays down permanently, or worse, if it goes
entirely catatonic, SGE *never* considers the node down, and will
*never* reschedule the job elsewhere!  The job remains in limbo
indefinitely until some human intervenes.

Of course there is a setting to make SGE behave in a more sane way,
it's called "reschedule_unknown".  It basically defines a timeout,
where if SGE can't get a response from a node within that time, SGE
restarts that node's jobs elsewhere.  This was all exceedingly
non-obvious.  I only figured it out by reading Templeton's detailed
"FridayTutorial.pdf" slides discussing many practical aspects of SGE,
which unfortunately have since vanished from the web:

  http://www.globusworld.org/documents/FridayTutorial.pdf 
  http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/doc/htmlman/htmlman5/sge_conf.html?pathrev=V62u2_TAG 

Unfortunately, even after the reschedule_unknown fix I still see
occasional job lockups with SGE, where my master process stalls
indefinitely until I manually notice and tell SGE to kill and restart
some hung child job.  I haven't yet sunk the debugging time into
figuring out just what the heck is really going on there.  (And it
could well be something that's not SGE's fault at all, of course.)

That isn't the only snafu I've had with SGE, just one of the more
memorable one.  I am by no means an SGE expert, nor even a
particularly experienced user, but it has mostly struck me as klunky
and rather programmer unfriendly.

Basically, I ended up using SGE due to historical accident, and my
hands-on experience with it has encouraged me to take a step back and
evaluate other toolkit options.

> > 4. I really, really want a good API for programmably interacting with
> > the cluster scheduler and ALL of its features.  I don't care too much

> I haven't looked at it much, but I think DRMAA will work for that in SGE.

Not as far as I could tell from reading the SGE docs a while back, no.
It looked as if DRMAA only covers a very limited subset of SGE's
functionality, not enough to cover the features I need.

I did not (yet) check the source to see how SGE's DRMAA support is
implemented, but the docs made it sound as if they were rolling it
from scratch rather than simply building on top of some clear
pre-existing SGE API.

> > 8. Of course the scheduler must have a good way to track all the basic
> > information about my nodes:  CPU sockets and cores, RAM, etc.  Ideally
> > it'd also be straightforward for me to extend the database of node

> SGE does this and can make it available as XML.

Which reminds me, I need to look harder to figure out WHERE exactly
SGE stores its node configuration data, and how I can perhaps extend
it with additional information, like the network topology between my
nodes.  This is probably simple but it wasn't obvious from the
(voluminous) SGE docs.

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/