[Beowulf] first cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caThu Jul 15 18:29:59 PDT 2010
- Previous message: [Beowulf] first cluster
- Next message: [Beowulf] first cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> Disadvantage is of course, when the system runs out of >> memory the oom-killer will look for an eligible process >> to be killed to free up some space. > > That assumes that you are permitting your compute nodes > to overcommit their memory, if you disable overcommit I > believe that you will instead just get malloc()'s failing > when there is nothing for them to grab. yes. actually, configuring memory and swap is an interesting topic. the feature Chris is referring to is, I think, the vm.overcommit_memory sysctl (and the associated vm.overcommit_ratio.) every distro I've seen leaves these at the default seting: vm.overcommit_memory=0. this is basically the traditional setting that tells the kernel to feel free to allocate way too much memory, and to resolve memory crunches via OOM killing. obviously, this isn't great, since it never tells apps to conserve memory (malloc returning zero), and often kills processes that you're rather not be killed (sshd, other system daemons). on clusters where a node may be shared across users/jobs, OOM can result serious collateral damage... we've used vm.overcommit_memory=2 fairly often. in this mode, the kernel limits its VM allocations to a combination of the size of ram and swap. this is reflected in /proc/meminfo:CommitLimit which will be computed as /proc/meminfo:SwapTotal + vm.overcommit_ratio * /proc/meminfo:MemTotal. /proc/meminfo:Committed_AS is the kernel's idea of total VM usage. IMO, it's essential to also run with RLIMIT_AS on all processes. this is basically a VM limit per process (not totalled across processes, though of course threads by definition share a single VM.) you might be thinking that RLIMIT_RSS would be better - indeed it would, but the kernel doesn't implement it. basically, limiting RSS is a bit tricky because you have to deal with how to count shared pages, and the limiting logic is going to slow down some important hot paths. (unlike AS (vsz), which only needs logic during explicit brk/mmap/munmap ops.) of course, to be useful, this requires users to provide reasonable memory limits at job-submission time. (our user population is pretty diverse, and isn't very good at doing wallclock limits, let alone "wizardly" issues like VM footprint.) batch systems often also provide their own resource management systems. I'm not fond of putting much effort in this direction, since it's usually based on a load-balancing model (which doesn't work if job memory use fluctuates), and upon on-node daemons which are assumed to be able to stay alive long enough to kill over-large job processes. yes, one can harden such system daemons by locking them into ram, but that's not an unalloyed win: they'll probably be nontrivial in size, and such memory usage is unswapable, even if some of the pages are never used... anyway, back to the topic: it's eminently possible to run nodes without swap, and reasonably safe to do so if your user community is not totally random, and if you make smart use of vm.overcommit_memory=2 and RLIMIT_AS. 5 years ago, running swapless was somewhat risky because the kernel was dramatically better tested/tuned in a normal swap-able configuration. my guess is that the huge embedded ecosystem has made swapless more robust, especially if you take the time to configure some basic sanity limits on user processes. regards, mark hahn.
- Previous message: [Beowulf] first cluster
- Next message: [Beowulf] first cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
