[Beowulf] Memory limit enforcement

Wed Oct 10 00:23:14 PDT 2007

On 10 Oct 2007, at 5:47 am, Mike Davis wrote:

> We have been dealing with similar problems on one of our clusters.  
> The solution that we're coming to is that we need a non-standard  
> solution. With Sun Grid Engine, one could build a memory consumable  
> and then have jobs request memory. One could even require jobs to  
> request memory. The problem is that many times a user will not know  
> how much memory to request.

If the memory requirements of the application are not known, then all  
bets are off, and there's basically nothing you can do to stop either  
the application being killed by an arbitrarily low memory limit that  
you set, or at the other extreme running out of memory.

We do exactly what you suggest, but under LSF, which has resource  
reservation for memory out of the box.  Of course, it's not real  
reservation, but it's reservation as far as the scheduler is  
concerned.  We then have a default memory limit on the queues which  
is really very low indeed (1.9 GB, typically, because we have 2 GB  
RAM per core on our nodes).  If the user wants more memory, they have  
to set a new higher limit themselves.  When they do that, we have  
supplied LSF with an esub script which then checks that the user has  
supplied both the new memory, and a suitable resource selection and  
reservation option.  If they have not, the job is rejected.  So for  
example, if the user asks for a 6 GB memory limit, the esub will  
check that they have requested a machine with at least 6GB of free  
memory, and then reserve that memory with the scheduler.  For example:

-M6000000 -R"select[mem>6000] rusage[mem=6000]"

On our beowulf cluster, this has been fairly effective in reducing  
the frequency with which nodes run out of memory - they jobs are  
usually killed first.  It's not 100% effective though.

> We have been experimenting with using SGE 6's suspend feature with  
> a Free RAM limit to stop (suspend) jobs that are going over the  
> preset limit. The problem with this particular solution is that the  
> reporting feature has a default timing of once every 40 seconds.  
> This means that there will be some lag and that could cause  
> problems with jobs that allocate RAM very quickly.

This is a problem with the LSF solution too.  I don't think there's a  
great deal that can be done about it, as others have said.  The other  
problem is that simply stopping the jobs then results in a node with  
suspended processes on it that are often deadlocked; you can't resume  
the job without running out of memory.  So you might as well have  
simply killed the job in the first place.

>
> I still believe that the best solution is to make users aware of  
> the memory requirements for their jobs and then have them use  
> memory requests and common sense to get their work done.

Absolutely.  If the user doesn't understand their application, all  
bets are off.

Tim

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.