[Beowulf] SGE + policy

Robert G. Brown rgb at phy.duke.edu
Thu May 27 08:03:09 PDT 2004

On Thu, 27 May 2004, Gerry Creager N5JXS wrote:

> This is really a first-cut response, with 2 visible possibilities...
> 1.  Use 2 license servers, one with 'i' licenses available for short 
> jobs, and one with 'j' licenses available for longer jobs.  For i < j, 
> starvation of the short jobs shouldn't occur too often, save when 
> there's too many masters' students trying to get their projects done in 
> time to graduate and the deadline's tomorrow.
> 2.  Priority queuing where short jobs have the nod, and longer jobs are 
> put aside and required to temporarily relinquish licenses.  Liketo to 
> require programming resources to accomplish this one.
> Good question.

Thanks for the suggestions.

The lack of even coarse grained kernel-style job control for a cluster
continues to be a source of frustration.  load balancing queueing systems
are getting to be pretty good, but this isn't a problem in load
balanced queueing, and a kernel that used load balanced queueing as a
scheduler algorithm would be terrible.  No, wait!  It would be DOS (for
a single CPU).

With xmlsysd I have access to the data required to implement a queueing
system WITH a crude scheduler algorithm with a granularity of (say)
order minutes.  I've actually hacked out a couple of tries at a simple
script-level control system in perl (before per got threads).  One would
expect that with threads it would be pretty easy to write a script based
scheduler that issues STOP and CONT signals to tasks on some sort of
RR/priority basis every minute.  It wouldn't deal with license
starvation, since I don't know how a running matlab task can
"temporarily relinquish a license" while it is stopped, but it would
manage the problem of being able to use a cluster for a mix of
prioritized long and short running jobs without resource-starving the
short ones.

I have a personal interest in this outside of econ because I am, after
all, a bottom feeder in the cluster world.  If I could ever arrange it
so that my jobs just "got out of the way" when competing jobs were
queued on a cluster according to policy, priority, ownership etc. I
might be able to wheedle more cycles out of my friends...;-)


> Gerry
> Robert G. Brown wrote:
> > Dear Perfect Masters of Grid Computing:
> > 
> > Economics is preparing to set up a small pilot cluster at Duke and the
> > following question has come up.
> > 
> > Primary tasks:  matlab and stata jobs, run either interactively/remote
> > or (more likely) in batch mode.  Jobs include both "short" jobs that
> > might take 10-30 minutes run by e.g. 1-2nd year graduate students as
> > part of their coursework and "long" jobs that might take hours to days
> > run by more advanced students, postdocs, faculty.
> > 
> > Constraint:  matlab requires a license managed by a license manager.
> > There are a finite number of licenses (currently less than the number of
> > CPUs) spread out across the pool of CPUs.
> > 
> > Concern:  That long running jobs will get into the queue (probably SGE
> > managed queue) and starve the short running jobs for either licenses or
> > CPUs or both.  Students won't be able to finish their homework in a
> > timely way because long running jobs de facto hog the resource once they
> > are given a license/CPU.
> > 
> > I am NOT an SGE expert, although I've played with it a bit and read a
> > fair bit of the documention.  SGE appears to run in FIFO mode, which of
> > course would lead to precisely the sort of resource starvation feared or
> > equal share mode.  Equal share mode appears to solve a different
> > resource starvation problem -- that produced by a single user or group
> > saturating the queue with lots of jobs, little or big, so that others
> > submitting after they've loaded the queue have to wait days or weeks to
> > get on.  However, it doesn't seem to have anything to do with job
> > 
> >>>control<< according to a policy -- stopping a long running job so that
> > 
> > a short running job can pass through.
> > 
> > It seems like this would be a common problem in shared environments with
> > a highly mixed workload and lots of users (and indeed is the problem
> > addressed by e.g. the kernel scheduler in almost precisely the same
> > context on SMP or UP machines).  Recognizing that the license management
> > problem will almost certainly be beyond the scope of any solution
> > without some hacking and human-level policy, are there any well known
> > solutions to this well known problem?  Can SGE actually automagically
> > control jobs (stopping and starting jobs as a sort of coarse-grained
> > scheduler to permit high priority jobs to pass through long running low
> > priority jobs)?  Is there a way to solve this with job classes or
> > wrapper scripts that is in common use?
> > 
> > At your feet, your humble student waits, oh masters of SGE and Grids...
> > 
> >     rgb
> > 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list