[Beowulf] Memory limit enforcement

Tue Oct 9 21:47:39 PDT 2007

David Kewley wrote:

>Ah, a family of issues near to my heart. :)
>
>I'll ask a broader question:  How do you enforce real memory usage in modern 
>Linux *at all*?
>
>We were interested in this because we were having user jobs regularly cause 
>nodes to go into an Out Of Memory (OOM) state, triggering the kernel's 
>oom_killer.  The oom_killer sometime would kill system processes, which 
>sometimes caused subsequent jobs to die.  Even if subsequent jobs didn't 
>die, recovery required that we manually close the node, reboot it when 
>running jobs finished, then reopen it.  This gets to be pretty dreary after 
>a while.
>
>Our problem is somewhat different from your interests, but some of the same 
>issues come into play.  See below for the partially satisfying solution 
>that we put in place for our OOM woes.  First a review of the problem 
>landscape as I understand it.
>
>You can try to enforce memory limits with a daemon, but you risk missing 
>important events, including a badly behaved process suddenly using a whole 
>lot of memory all at once.  If that happens, your daemon is nearly useless 
>since swapping and/or oom_killer will be running, and not your daemon.  
>Your node may lock up for a while, which was what the daemon was supposed 
>to prevent.
>
>I think you really want to do it in the kernel, so that badly behaved 
>requests for memory (allocation and/or writing) can be cut off before they 
>affect anyone else.
>
>But the kernel doesn't really enforce anything useful.  It doesn't enforce a 
>resident set size (RSS) limit, even though setrlimit() will let you request 
>such a limit.  As I understand it, modern Linux doesn't even try to track 
>RSS, because semantics of RSS are unclear given modern memory management 
>methods.
>
>RSS probably isn't even what you want -- you probably want to limit the 
>amount of physical memory used, keeping the sum of the limits around the 
>amount of total RAM, to avoid swapping.  There is no way to communicate 
>this limit to the kernel; I suspect it doesn't even track it except 
>globally.
>
>The kernel *is* able to enforce the amount of virtual memory allocated per 
>process (set with setrlimit()), but as you noted, that is of limited value 
>when different applications can have very different overcommit percentages 
>(virtual memory allocated beyond the amount actually used).
>
>But take a step back from considering the limits you can place on a given 
>process.  You probably want a policy that limits memory use at the job 
>level, not at the process level, regardless of whether you have one job or 
>multiple jobs running on a node.  There is no kernel mechanism for that 
>either.
>
>Seems your best bet might be to write a daemon, and hope that actual use 
>patterns don't cause swapping or OOM before the daemon can act.
>
>To end our OOM problems, we took a different route.  The job launch 
>mechanism (via LSF) sets the per-process virtual-memory-allocation limit on 
>each user job process.  We can prevent OOM this way, unless a job both uses 
>non-standard job launch methods and has runaway memory use (which is rare 
>in our experience).
>
>Other weaknesses of our method include:
>
>* It does not prevent heavy swapping (which would be nice to have, but at 
>least the user suffers the consequences most).
>
>* It can prevent a job from using all available RAM if the job has a larger 
>overcommit than our algorithm assumes.
>
>* When the VM allocation limit is reached, the errors are often cryptic.  
>Nothing appears in syslog (unlike segfaults, which are logged at least on 
>x86_64) -- the kernel patch to enable logging seems likely pretty trivial, 
>but stock kernels don't do it.  A malloc() will return ENOMEM, which many 
>programs and libraries don't handle properly (or indeed handle at all -- 
>how many programmers omit checking the return value or errno?), so the user 
>doesn't get a useful error message.  A failed stack expansion will cause a 
>segfault (as I recall), which is also cryptic to the user.  At least 
>segfaults get logged...
>
>I'd love to hear other approaches to this family of problems.
>  
>

We have been dealing with similar problems on one of our clusters. The 
solution that we're coming to is that we need a non-standard solution. 
With Sun Grid Engine, one could build a memory consumable and then have 
jobs request memory. One could even require jobs to request memory. The 
problem is that many times a user will not know how much memory to request.

We have been experimenting with using SGE 6's suspend feature with a 
Free RAM limit to stop (suspend) jobs that are going over the preset 
limit. The problem with this particular solution is that the reporting 
feature has a default timing of once every 40 seconds. This means that 
there will be some lag and that could cause problems with jobs that 
allocate RAM very quickly.

In testing, we used some graphics jobs that will take up RAM fast. With 
these jobs and a limit of 1GB of free RAM, we were able to get the last 
1.5GB job submitted to suspend when the system memory reached a gig on a 
4GB machine (3 jobs). When the first job finished, the 3rd then 
restarted and completed. I won't say that there are not issues with this 
solution. But I believe that it can work.

I still believe that the best solution is to make users aware of the 
memory requirements for their jobs and then have them use memory 
requests and common sense to get their work done. If anyone is 
interested in more info, please let me know and I will put you in 
contact with the programmer that we have working on this.

Mike Davis