[Beowulf] [OOM killer/scheduler] disabling swap on cluster nodes?

Wed Feb 11 07:48:04 PST 2015

On 02/11/2015 12:25 AM, Mark Hahn wrote:
>>> is net-swap really ever a good idea?  it always seems like asking for
>>> trouble, though in principle there's no reason why net IO should be
>>> "worse" than disk IO...
>>
>> ... except for the need to allocate memory to build packets to send
>> the swap data.
>
> I thought the implication was clear, that doing disk IO may also require
> memory allocations.

Paging to local scratch is less memory intensive than constructing a 
memory packet to hold a buffer for transfer over the network.  In fact 
local paging is really quite memory efficient.

>
>> There are still a few places that look at you funny if you suggest
>> running w/o swap.  The 6 orders of magnitude performance difference
>> for random page touching performance suggests you should stare them
>> back down.
>
> absolutely: if you have reason to believe all your pages are uniformly hot,
> more power to you!

Bad analysis.  In the old days (ugh), locality of reference was 
something you had to work very hard to make sure you made effective use 
of your memory.  You re-ordered your loops, did all manner of other 
things.  Nowadays, you have to worry about objects and their instance 
data, which you don't know so much when and where they will be touched.

Feel free to use a modern OO code on a memory starved system ... its 
just not pleasant.  That 6 OOM performance variance between hot and cold 
pages will bite you.

>> Seriously, if you can avoid under-spec'ing/provisioning ram, you should.
>
> in other words: buy extra ram to hold your cold pages!  after all, dram
> is only O($10/GB), and disk is O($0.05/GB).  oh, wait...

And this is what I was waiting for, someone to pull out a bad analysis 
and then use it as a strawman.

Ok, using your underlying theory here (disk is cheap, ram is expensive), 
lets go to zero ram and save money.

Oh ... Wait ...

Yes, it should be obvious why this is silly.  And by extension, the 
original argument is silly.

But the more subtle point (which is the one I had hoped you would go 
for, as its the one that makes sense) is that there is a fine balancing 
between size of ram and (if you use it) swap.  This balancing act is 
influenced by the opportunity cost of decisions (less ram -> more swap, 
longer execution time/cost for memory intensive codes; versus more ram 
-> less swap, shorter execution time, though higher cost per node).

In fact this gets to the very definition of opportunity cost, what is 
the amount of value I am giving up by making the alternative choice. 
Another way of thinking of this is asking what the marginal value of the 
choice of more or less ram?

This is why I argue that sizing memory (and almost all other things) is 
very important.  Building a 1TB ram machine for problems that run in 4GB 
is a waste of resources (too much ram).  Building a 16GB ram machine for 
problems that run in 1TB is a waste (too little).

>> wish for the wild west of OOM shooting random things in comparison to
>> random 4k page touches.  Yes, I've seen the latter.
>
> thrashing is bad.  it's not the same as *using* swap.  that's why swap
> still makes sense.

Thrashing *is* using swap as a transparent memory extension.  It is one 
of the worst possible cases, and seen quite frequently when you have 
large OO codes where you can't predict what object is going to do what. 
  Or you have large in memory databases.  Or ...

That is, swap/paging provide a memory extension, and its a crutch 
relative to in-app memory management.  The latter is generally frowned 
upon in most development circles these days, especially with GC systems 
in OO code.

The world has evolved significantly since I spilled my first matrices to 
local files.

> interesting thought: SSD is about $0.5/GB, so would make a great swap
> dev - has anyone tried tuning the swap cluster size to match the SSD
> flash block?

We've done quite a bit of this, yes.

What it comes down to is, a) swap is a terrible thing to do, avoid it if 
possible.  b) if you can't avoid it, do it as quickly as you can. c) the 
incremental cost of increasing RAM size versus paying the (often far) 
longer run time (with all its attendant costs and effects: slower 
throughput, fewer jobs per unit time, more power spent per job, etc.) is 
heavily biased *against* building sizable swap.  This is why we use zram 
whenever possible, zcache, and very fast tuned swap partitions whenever 
possible.

Note though, and this has happened to us before:  if a swap device dies 
while you have pages out on it ... lets just say thats a new experience 
in crashing.  Its exactly like pulling a random DIMM out of a running 
machine.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
e: landman at scalableinformatics.com
w: http://scalableinformatics.com
t: @scalableinfo
p: +1 734 786 8423 x121
c: +1 734 612 4615