[Beowulf] Using Linux cgroups to enforce resources allocation limits?

Thu Apr 17 11:19:16 PDT 2008

Hi all,

With the fresh release of the 2.6.25 kernel, Linux cgroups (fka process 
containers) are getting more attention. For those unfamiliar with the 
concept, Control Groups are "a generic framework where 
several 'resource controllers' can plug in and manage different 
resources of the system such as process scheduling or memory 
allocation. [They] also offer a unified user interface, based on a 
virtual filesystem where administrators can assign arbitrary resource 
constraints to a group of chosen tasks."

2.6.24 introduced a CPU bandwidth allocation controller, and today,  
2.6.25 features a memory resource controller. Patches for network and 
block I/O bandwidth control have also been submitted. So it looks to me 
that everything is available to create real process containers, 
susceptible to hold individual users' jobs and to keep them inside 
defined limits. At kernel level.

One of the limitations with current resource schedulers is the CPU usage 
limit enforcement on multi-core systems. On non-NUMA systems (hello 
Intel! :)), there's no mechanism to prevent a user submitting a job 
which asks for, say, one core on a 8-cores machine, to actually spawn 8 
threads which will be spread over the 8 cores, and make exclusive use 
of all the machine's CPU resources. This would impact performance of 
other users' jobs in a sneaky way, and, as a rigid^Wrighteous sysadmin, 
I can't tolerate this.

I was looking for the longest time for a way to "pin" a group of 
processes to a specific *number* of cores, and not to a specific list 
of cores (ie. I don't want to limit a process to run on cores 0 and 1, 
but rather say that I'd like this process to use at most 2 cores on the 
system). And it looks like cgroups would be a good candidate to achieve 
this. 

Another benefit would be the memory resources allocation. Our current 
scheduler, and it's probably the case for the others as well, enforces 
memory limitations by accounting for memory used by jobs every x 
minutes. So if a job has peak memory bursts, it can easily get 
unnoticed and continue to run, although it may already have either 
triggered the OOM killer, or prevented another process' memory 
allocation. If the enforcement is made at kernel level, I assume that 
it will be in real-time, and that this kind of problem would be 
avoided.

I yet have to try implementing cgroups and see if they could be used in 
an HPC environment to enforce reliable resources allocation limits, but 
I was wondering if anybody tried this already, especially the 
integration with existing schedulers, or if anyone had ideas on the 
subject.

Thanks,
-- 
Kilian