[Beowulf] Reliable Job Queueing and Notification
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Chris Dagdigian dag at sonsorol.orgWed Oct 17 11:47:00 PDT 2007
- Previous message: [Beowulf] Reliable Job Queueing and Notification
- Next message: [Beowulf] Reliable Job Queueing and Notification
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Sean, For what it's worth, Grid Engine (SGE) has a utility binary called "qevent" that is not part of the official binary distribution but can be built from the source distribution (http:// gridengine.sunsource.net). Do a google search for "sge + qevent" and you'll at least hit a few SGE mailing list messages that cover what it does. You might also want to check out the DRMAA stuff (http://drmaa.org/ wiki/) -- it is supposed to be a DRM-neutral way of submitting jobs to a queuing system. I'm not very familiar with DRMAA so I can't tell you offhand if the current spec includes notification of completed events or not. Another option that would work with SGE would be the use of queue level epilog scripts that execute each time a job leaves the system for whatever reason. You can put a heck of a lot of logic and programmable activities/notifications into a custom epilog script. A third option is the use of job dependency syntax within grid engine. For each of your web service initiated tasks you would submit 2 jobs -- the first job is your "worker" job. The second job is your "notifier" job and it is submitted to SGE with a flag that says "this job is dependent on the worker job". Once your notifier job is fired up it can do whatever sort of results checking and notification would be required. Regards, Chris On Oct 16, 2007, at 10:08 AM, Sean Ward wrote: > I've started work on a web service which contains several > potentially long running processing steps (molecular dynamics), > which are perfect to farm out to the fairly large (90 node) Beowulf > I have access to. The primary issue is translating requests from > the event driven web service, to job queues, and back again upon > completion. Specifically, the major queuing systems I have > immediate access to (Sun Grid Engine and Condor) only support e- > mail based notification of job completion. Starting jobs isn't an > issue, as my service can simply ssh over and execute shell scripts > as needed to start things up, the problem is reliably being > informed when the jobs fail or complete, via any programmatic > method (such as executing a shell script, calling a web service via > SOAP/etc, or an asynchronous message library). My other problem, > ensuring that these web service requests don't starve in house jobs > on the Beowulf is easily handled via the priority levels built into > all the various job managers, although being able to checkpoint a > long running job would be a plus (such as is supported by Condor). > > I am currently investigating modifications to either Condor (more > complex to update, but checkpoint is useful) or Ruby Queue (very > easy to update for reliable notification) to solve this issue, but > wanted to be sure I wasn't overlooking any existing solutions to > programmatic based queuing and receiving notifications on jobs in a > Beowulf environment...
- Previous message: [Beowulf] Reliable Job Queueing and Notification
- Next message: [Beowulf] Reliable Job Queueing and Notification
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
