[Beowulf] SGE + policy

Robert G. Brown rgb at phy.duke.edu
Thu May 27 07:19:00 PDT 2004


Dear Perfect Masters of Grid Computing:

Economics is preparing to set up a small pilot cluster at Duke and the
following question has come up.

Primary tasks:  matlab and stata jobs, run either interactively/remote
or (more likely) in batch mode.  Jobs include both "short" jobs that
might take 10-30 minutes run by e.g. 1-2nd year graduate students as
part of their coursework and "long" jobs that might take hours to days
run by more advanced students, postdocs, faculty.

Constraint:  matlab requires a license managed by a license manager.
There are a finite number of licenses (currently less than the number of
CPUs) spread out across the pool of CPUs.

Concern:  That long running jobs will get into the queue (probably SGE
managed queue) and starve the short running jobs for either licenses or
CPUs or both.  Students won't be able to finish their homework in a
timely way because long running jobs de facto hog the resource once they
are given a license/CPU.

I am NOT an SGE expert, although I've played with it a bit and read a
fair bit of the documention.  SGE appears to run in FIFO mode, which of
course would lead to precisely the sort of resource starvation feared or
equal share mode.  Equal share mode appears to solve a different
resource starvation problem -- that produced by a single user or group
saturating the queue with lots of jobs, little or big, so that others
submitting after they've loaded the queue have to wait days or weeks to
get on.  However, it doesn't seem to have anything to do with job
>>control<< according to a policy -- stopping a long running job so that
a short running job can pass through.

It seems like this would be a common problem in shared environments with
a highly mixed workload and lots of users (and indeed is the problem
addressed by e.g. the kernel scheduler in almost precisely the same
context on SMP or UP machines).  Recognizing that the license management
problem will almost certainly be beyond the scope of any solution
without some hacking and human-level policy, are there any well known
solutions to this well known problem?  Can SGE actually automagically
control jobs (stopping and starting jobs as a sort of coarse-grained
scheduler to permit high priority jobs to pass through long running low
priority jobs)?  Is there a way to solve this with job classes or
wrapper scripts that is in common use?

At your feet, your humble student waits, oh masters of SGE and Grids...

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list