[Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

Ryan Novosielski novosirj at rutgers.edu
Tue Jun 12 08:11:08 PDT 2018

> On Jun 12, 2018, at 11:08 AM, Prentice Bisbal <pbisbal at pppl.gov> wrote:
> On 06/12/2018 12:33 AM, Chris Samuel wrote:
>> Hi Prentice!
>> On Tuesday, 12 June 2018 4:11:55 AM AEST Prentice Bisbal wrote:
>>> I to make this work, I will be using job_submit.lua to apply this logic
>>> and assign a job to a partition. If a user requests a specific partition
>>> not in line with these specifications, job_submit.lua will reassign the
>>> job to the appropriate QOS.
>> Yeah, that's very much like what we do for GPU jobs (redirect them to the
>> partition with access to all cores, and ensure non-GPU jobs go to the
>> partition with fewer cores) via the submit filter at present..
>> I've already coded up something similar in Lua for our submit filter (that only
>> affects my jobs for testing purposes) but I still need to handle memory
>> correctly, in other words only pack jobs when the per-task memory request *
>> tasks per node < node RAM (for now we'll let jobs where that's not the case go
>> through to the keeper for Slurm to handle as now).
>> However, I do think Scott's approach is potentially very useful, by directing
>> jobs < full node to one end of a list of nodes and jobs that want full nodes
>> to the other end of the list (especially if you use the partition idea to
>> ensure that not all nodes are accessible to small jobs).
> This was something that was very easy to do with SGE. It's been a while since I worked with SGE so I forget all the details, but in essence, you could assign nodes a 'serial number' which would specify the preferred order in which nodes would be assigned to jobs, and I believe that order was specific to each queue, so if you had 64 nodes, one queue could assign jobs starting at node 1 and work it's way up to node 64, while another queue could start at node 64 and work its way down to node 1. This technique was mentioned in the SGE documentation to allow MPI and shared memory jobs to share the cluster.
> At the time, I used it, for exactly that purpose, but I didn't think it was that big a deal. Now that I don't have that capability, I miss it.

SLURM has the ability to do priority “weights” as well for nodes, to somewhat the same affect — so far as I know. At our site, though, that does not work as it apparently conflicts with the topology plugin, which we also use, instead of layering or something more useful.

|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: Message signed with OpenPGP
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20180612/f56cc300/attachment.sig>

More information about the Beowulf mailing list