[Beowulf] Functionality of schedulers

Mark Hahn hahn at mcmaster.ca
Thu Mar 1 07:44:06 PST 2012


> Preferably the state of the first job should be frozen, and saved to
> disk, so that it can be restarted again when the higher priority job has
> finished.

well, maybe.  that process (checkpoint/restore) really makes sense
only if the preemtor is giant and/or long.  otherwise SIGSTOP is 
a much better solution (it implies that you should have swap, but 
you should have swap anyway.)

> Is this at all possible (we are using torque/maui, and I couldn't find
> this feature there)?

this code (even moab) has all sorts of problems keeping track of 
suspension.  the weak spot is usually that when you suspend a parallel
job and the preemptor doesn't use all the cpus, you can't go starting
random other jobs on these pseudo-free cpus.  LSF wasn't all that great 
about this little detail either, at least back in 6.x versions.

it's kind of amazing how poor all the schedulers are, really.
classic example of how projects get sclerotic by adding features...

regards, mark hahn



More information about the Beowulf mailing list