[Beowulf] Memory limit enforcement
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tim Cutts tjrc at sanger.ac.ukWed Oct 10 00:23:14 PDT 2007
- Previous message: [Beowulf] Memory limit enforcement
- Next message: [Beowulf] Memory limit enforcement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 10 Oct 2007, at 5:47 am, Mike Davis wrote: > We have been dealing with similar problems on one of our clusters. > The solution that we're coming to is that we need a non-standard > solution. With Sun Grid Engine, one could build a memory consumable > and then have jobs request memory. One could even require jobs to > request memory. The problem is that many times a user will not know > how much memory to request. If the memory requirements of the application are not known, then all bets are off, and there's basically nothing you can do to stop either the application being killed by an arbitrarily low memory limit that you set, or at the other extreme running out of memory. We do exactly what you suggest, but under LSF, which has resource reservation for memory out of the box. Of course, it's not real reservation, but it's reservation as far as the scheduler is concerned. We then have a default memory limit on the queues which is really very low indeed (1.9 GB, typically, because we have 2 GB RAM per core on our nodes). If the user wants more memory, they have to set a new higher limit themselves. When they do that, we have supplied LSF with an esub script which then checks that the user has supplied both the new memory, and a suitable resource selection and reservation option. If they have not, the job is rejected. So for example, if the user asks for a 6 GB memory limit, the esub will check that they have requested a machine with at least 6GB of free memory, and then reserve that memory with the scheduler. For example: -M6000000 -R"select[mem>6000] rusage[mem=6000]" On our beowulf cluster, this has been fairly effective in reducing the frequency with which nodes run out of memory - they jobs are usually killed first. It's not 100% effective though. > We have been experimenting with using SGE 6's suspend feature with > a Free RAM limit to stop (suspend) jobs that are going over the > preset limit. The problem with this particular solution is that the > reporting feature has a default timing of once every 40 seconds. > This means that there will be some lag and that could cause > problems with jobs that allocate RAM very quickly. This is a problem with the LSF solution too. I don't think there's a great deal that can be done about it, as others have said. The other problem is that simply stopping the jobs then results in a node with suspended processes on it that are often deadlocked; you can't resume the job without running out of memory. So you might as well have simply killed the job in the first place. > > I still believe that the best solution is to make users aware of > the memory requirements for their jobs and then have them use > memory requests and common sense to get their work done. Absolutely. If the user doesn't understand their application, all bets are off. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
- Previous message: [Beowulf] Memory limit enforcement
- Next message: [Beowulf] Memory limit enforcement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
