[Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

Fri Jun 8 23:54:19 PDT 2018

On Saturday, 9 June 2018 12:16:16 AM AEST Paul Edmon wrote:

> Yeah this one is tricky.  In general we take the wildwest approach here, but
> I've had users use --contiguous and their job takes forever to run.

:-)

> I suppose one method would would be enforce that each job take a full node
> and parallel jobs always have contiguous.

For us that would be wasteful though if their job can only scale to a small 
number of cores.

> As I recall Slurm will
> preferentially fill up nodes to try to leave as large of contiguous blocks
> as it can.

That's what I thought, but I'm not sure that's actually true.  In this Slurm 
bug report one of the support folks says:

https://bugs.schedmd.com/show_bug.cgi?id=3505#c16

# After the available nodes and cores are identified, the _eval_nodes()
# function is not preferring idle nodes, but sequentially going through
# the lowest weight nodes and accumulating cores.

> The other other option would be to use requeue to your advantage.  Namely
> just have a high priority queue only for large contiguous jobs and it just
> requeues all the jobs it needs to to run.  That would depend on your single
> node/core users tolerances for being requeued.

Yeah, I suspect with the large user base we have that's not much of an option.  
This is one of the times where migration of tasks would be really handy.  It's 
one of the reasons I was really interested with the last presentation about 
the Spanish group working on DMTCP checkpoint/restart at the Slurm User Group 
last year which claims to be able to do a lot of this:

https://slurm.schedmd.com/SLUG17/ciemat-cr.pdf

cheers!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC