[Beowulf] cluster scheduler for dynamic tree-structured jobs?

Sat May 15 09:34:31 PDT 2010

On Sat, May 15, 2010 at 10:44 AM, Andrew Piskorski <atp at piskorski.com> wrote:
> Yes, that's what I did with SGE, that part works fine.  SGE's other
> behaviors often leave much to be desired.

Just because the default settings of SGE do not follow your workflow
does not mean that "SGE's other behaviors often leave much to be
desired."

There are SGE users who do exactly not want SGE to automatically
re-run jobs due to unreachable nodes -- for example, a network failure
can partition a single SGE cluster into 2 sub-clusters, and thus every
job can be run twice if the default is to re-run whenever nodes are
not reachable.

The SGE mailing "users" list is always responsive (Thanks to Reuti and
others who contribute), so anything you don't like or understand in
SGE, you should:

1) Google (very important)
2) Check the SGE manpage, HOWTO, admin guide
3) Ask on the list

http://gridengine.sunsource.net/maillist.html

>> > 4. I really, really want a good API for programmably interacting with
>> > the cluster scheduler and ALL of its features.  I don't care too much
>
>> I haven't looked at it much, but I think DRMAA will work for that in SGE.

DRMAA is for job submission and some job monitoring, and if you want
to interact with your scheduler, like changing the scheduling
algorithms, then I don't think it can be easily done with anything
available in the free/opensource world or commercial market.

Rayson

>
> Not as far as I could tell from reading the SGE docs a while back, no.
> It looked as if DRMAA only covers a very limited subset of SGE's
> functionality, not enough to cover the features I need.
>
> I did not (yet) check the source to see how SGE's DRMAA support is
> implemented, but the docs made it sound as if they were rolling it
> from scratch rather than simply building on top of some clear
> pre-existing SGE API.
>
>> > 8. Of course the scheduler must have a good way to track all the basic
>> > information about my nodes:  CPU sockets and cores, RAM, etc.  Ideally
>> > it'd also be straightforward for me to extend the database of node
>
>> SGE does this and can make it available as XML.
>
> Which reminds me, I need to look harder to figure out WHERE exactly
> SGE stores its node configuration data, and how I can perhaps extend
> it with additional information, like the network topology between my
> nodes.  This is probably simple but it wasn't obvious from the
> (voluminous) SGE docs.
>
> --
> Andrew Piskorski <atp at piskorski.com>
> http://www.piskorski.com/
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>