[Beowulf] Memory limit enforcement

Wed Oct 10 12:24:12 PDT 2007

On Wednesday 10 October 2007 12:23:14 am Tim Cutts wrote:

> We then have a default memory limit on the queues which
> is really very low indeed (1.9 GB, typically, because we have 2 GB
> RAM per core on our nodes).  If the user wants more memory, they have
> to set a new higher limit themselves.  

I'm also relying on LSF's LSB_MEMLIMIT_ENFORCE option to take care of 
memory-greedy jobs. 

Before that, I tried to modify the default VM overcommit behavior on 
individual nodes, playing with sys.vm.overcommit_memory and  
sys.vm.overcommit_ratio values.

By setting overcommit_memory=2 and an appropriate overcommit_ratio, you 
can basically prevent any swapping. The result is that processes' 
malloc()s going beyond the limits are denied. This is cool from the 
sysadmin standpoint, since the greedy applications are killed before 
bringing the machine to its knees. But it may as well happen that an 
application trying to use the last few available MBs gets killed, while 
another one has already allocated several GBs, which is not especially 
fair. And on top of that, most scientific applications are not very 
careful about checking errors. So our users were beginning to complain 
that their applications were crashing without any reason when they were 
reaching the overcommit limits. Which made me realize that this 
solution was probably not that optimal.

So LSF per-job memory limits enforcement did the trick for us: an esub 
script to check that user can't request funny limits, and jobs using 
more that requested get killed. That's good for serial jobs.

But parallel (read MPI) jobs are a different can of worms. Say you have 
2 dual-cpu nodes, with 4GB each. A user can submit a job using 4 CPUs 
and 6GB of memory without any problem as long as those 6GB are equally 
balanced between the two nodes. But since LSF conception of the memory 
limits is *per job*, it means that, for this specific job, we need to 
set -M6000000 if we want it to run. And this limit won't prevent a 
process from this job to use more than 4GB on the first node, making it 
unusable...

So anyway, no solution is perfect. I guess that what the Linux kernel 
really misses are memory quotas. Per user. Exactly like disk quotas. 
That would be *really* neat and solve a whole range of problems.

> When they do that, we have 
> supplied LSF with an esub script which then checks that the user has
> supplied both the new memory, and a suitable resource selection and
> reservation option.  If they have not, the job is rejected.  So for
> example, if the user asks for a 6 GB memory limit, the esub will
> check that they have requested a machine with at least 6GB of free
> memory, and then reserve that memory with the scheduler.  For
> example:
>
> -M6000000 -R"select[mem>6000] rusage[mem=6000]"

I'm not 100% certain here, but I would have assumed that it would be the 
scheduler's job to select a host with enough ressources to run the job. 
So from my understanding, specifying -R"rusage[mem=6000]" would be 
sufficient to select a machine which 6GB available. But I may have 
missed some LSF subtleties. :)

Cheers,
-- 
Kilian