[Beowulf] SGE + policy
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Patrice Seyed apseyed at bu.eduThu May 27 08:16:05 PDT 2004
- Previous message: [Beowulf] SGE + policy
- Next message: [Beowulf] SGE + policy
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dr. Brown, The convention that I've used as suggested by the gurus on the SGE mailing list is the use of the concept of express queues, which uses a resource assignment for "express" and subordinate queuing. SGE usually sets one queue per host (if its dual this needs to be modified slightly). Say your first node has one cpu and is called node-1. Set up a usual queue called "node-1.q" with one job slot, and set it up to be subordinate to "express-1.q" at the 1 job level, and create a queue called "express-1.q" that has one job slot, create a resource called "express" for this queue, and set a soft/hard limit of rt to 2:00. Basically addresses the scenario where a user wants to submit a job that takes less than 2 hours and all the regular queues are full. They can submit their jobs with the "-l express=1" option and the job will go into an express queue belonging to one of the hosts, will suspend the long job in the regular queue until the express job is complete. What makes this work is the restriction the hard limit of 2 hours for this suspension mechanism. I hope this helps. Regarding the license managing you could do something with consumable resources/tracking, also I think that you can use FlexLM. http://bioteam.net/dag/sge-flexlm-integration/ http://gridengine.sunsource.net/project/gridengine/howto/howto.html -Patrice -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Robert G. Brown Sent: Thursday, May 27, 2004 10:19 AM To: Beowulf Mailing List Subject: [Beowulf] SGE + policy Dear Perfect Masters of Grid Computing: Economics is preparing to set up a small pilot cluster at Duke and the following question has come up. Primary tasks: matlab and stata jobs, run either interactively/remote or (more likely) in batch mode. Jobs include both "short" jobs that might take 10-30 minutes run by e.g. 1-2nd year graduate students as part of their coursework and "long" jobs that might take hours to days run by more advanced students, postdocs, faculty. Constraint: matlab requires a license managed by a license manager. There are a finite number of licenses (currently less than the number of CPUs) spread out across the pool of CPUs. Concern: That long running jobs will get into the queue (probably SGE managed queue) and starve the short running jobs for either licenses or CPUs or both. Students won't be able to finish their homework in a timely way because long running jobs de facto hog the resource once they are given a license/CPU. I am NOT an SGE expert, although I've played with it a bit and read a fair bit of the documention. SGE appears to run in FIFO mode, which of course would lead to precisely the sort of resource starvation feared or equal share mode. Equal share mode appears to solve a different resource starvation problem -- that produced by a single user or group saturating the queue with lots of jobs, little or big, so that others submitting after they've loaded the queue have to wait days or weeks to get on. However, it doesn't seem to have anything to do with job >>control<< according to a policy -- stopping a long running job so that a short running job can pass through. It seems like this would be a common problem in shared environments with a highly mixed workload and lots of users (and indeed is the problem addressed by e.g. the kernel scheduler in almost precisely the same context on SMP or UP machines). Recognizing that the license management problem will almost certainly be beyond the scope of any solution without some hacking and human-level policy, are there any well known solutions to this well known problem? Can SGE actually automagically control jobs (stopping and starting jobs as a sort of coarse-grained scheduler to permit high priority jobs to pass through long running low priority jobs)? Is there a way to solve this with job classes or wrapper scripts that is in common use? At your feet, your humble student waits, oh masters of SGE and Grids... rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: [Beowulf] SGE + policy
- Next message: [Beowulf] SGE + policy
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
