[Beowulf] Re: cluster scheduler for dynamic tree-structured jobs?

Andrew Piskorski atp at piskorski.com
Sat May 15 05:33:39 PDT 2010


On Sat, May 15, 2010 at 06:24:54AM -0400, Andrew Piskorski wrote:

> 1. I have lots of embarrassingly parallel tree-structured jobs which I
> dynamically generate and submit from top-level user code (which
> happens to be written in R).  E.g., my user code generates 10 or 100
> or 1000 jobs, and each of those jobs might itself generate N jobs.
> Any given job cannot complete until all its children complete.

Condor's "MW" master-worker API and DAGMan both sound potentially
useful for my tree-structured jobs.  However...

Does MW support multiple levels of masters and workers?  (That's what
I need.)  The docs never mention it, not even when discussing the
scalability limitations of a single master process, so I presume it
does not.  MW also requires both Condor and Condor-PVM.

  http://www.cs.wisc.edu/condor/mw/
  http://www.cs.wisc.edu/condor/mw/overview.html
  http://www.cs.wisc.edu/condor/pvm/

Since Condor does not itself understand inter-job dependencies at all,
it seems that two MW master programs running at the same time could
readily deadlock each other.  At least, I don't see anything in either
MW or Condor proper that would prevent or ameliorate that risk.

>From its docs, DAGMan is purely static, it has to know about all the
jobs ahead of time before any of them start, and cannot dynamically
submit new jobs (no good for me).  It sits as a separate layer above
Condor; Condor itself does not understand inter-job dependencies at all.

DAGMan's docs also say it has no way to recover if even a single one
of its jobs fail, it aborts the entire DAG.  That seems strange, as
I'd have thought that Condor itself must support some sort of job
restart when a node goes down (or is otherwise removed from the Condor
pool) - does it really not?

  http://www.cs.wisc.edu/condor/dagman/
  http://www.cs.wisc.edu/condor/manual/v6.1/2_11Inter_job_Dependencies.html

The DAGMan stuff sounds like a research hack that's not really fully
supported by Condor.  AFAICT MW and DAGMan are also entirely unrelated
to each other.

Does anybody actually use either of those tools?

And of course, it's not clear whether Condor in general would really
meet the needs I laid out earlier.

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/



More information about the Beowulf mailing list