[Beowulf] SGE + policy
landman at scalableinformatics.com
Thu May 27 08:49:30 PDT 2004
Consumable resources are your friends. As is the grid-engine mailing
list. This is a bit different than the other license management
problems discussed there, so the previous solutions are probably not
appropriate, but could be used as the basis for something that is
Create 2 consumables, one for the short runs, one for the longer runs.
Have the sum of the consumable slots equal the number of licenses.
Interactive/short users submit to their consumable, and long running
users submit to theirs.
You really don't want to invoke checkpointing under SGE at this time.
SGE (and most schedulers under linux) don't checkpoint, they kill the
job. Condor has application libs which do checkpoint, but I have no
idea how reliably this works. Moreover, checkpointing a job with an
open license request tends to lock that license. If the application
itself can checkpoint, you can hook that into the SGE checkpoint
mechanism to use that checkpoint rather than the OS feature.
This is not a great solution (2 consumables), but it does work, and can
be adjusted/tuned on a live cluster trivially. You setup scripts to
launch jobs that detect the return code of the Matlab/other job, and
send the appropriate return code back to SGE (99 as I remember) to have
the job rescheduled if no licenses are available. This of course
requires Matlab/other to be able to intelligently let you know if there
are no licenses available...
Robert G. Brown wrote:
>Dear Perfect Masters of Grid Computing:
>Economics is preparing to set up a small pilot cluster at Duke and the
>following question has come up.
>Primary tasks: matlab and stata jobs, run either interactively/remote
>or (more likely) in batch mode. Jobs include both "short" jobs that
>might take 10-30 minutes run by e.g. 1-2nd year graduate students as
>part of their coursework and "long" jobs that might take hours to days
>run by more advanced students, postdocs, faculty.
>Constraint: matlab requires a license managed by a license manager.
>There are a finite number of licenses (currently less than the number of
>CPUs) spread out across the pool of CPUs.
>Concern: That long running jobs will get into the queue (probably SGE
>managed queue) and starve the short running jobs for either licenses or
>CPUs or both. Students won't be able to finish their homework in a
>timely way because long running jobs de facto hog the resource once they
>are given a license/CPU.
>I am NOT an SGE expert, although I've played with it a bit and read a
>fair bit of the documention. SGE appears to run in FIFO mode, which of
>course would lead to precisely the sort of resource starvation feared or
>equal share mode. Equal share mode appears to solve a different
>resource starvation problem -- that produced by a single user or group
>saturating the queue with lots of jobs, little or big, so that others
>submitting after they've loaded the queue have to wait days or weeks to
>get on. However, it doesn't seem to have anything to do with job
>>>control<< according to a policy -- stopping a long running job so that
>a short running job can pass through.
>It seems like this would be a common problem in shared environments with
>a highly mixed workload and lots of users (and indeed is the problem
>addressed by e.g. the kernel scheduler in almost precisely the same
>context on SMP or UP machines). Recognizing that the license management
>problem will almost certainly be beyond the scope of any solution
>without some hacking and human-level policy, are there any well known
>solutions to this well known problem? Can SGE actually automagically
>control jobs (stopping and starting jobs as a sort of coarse-grained
>scheduler to permit high priority jobs to pass through long running low
>priority jobs)? Is there a way to solve this with job classes or
>wrapper scripts that is in common use?
>At your feet, your humble student waits, oh masters of SGE and Grids...
More information about the Beowulf