[Beowulf] NUMA zone_reclaim_mode considered harmful?

Stuart Barkley stuartb at 4gh.net
Mon Sep 22 12:07:45 PDT 2014


Just to add a few more details to Chris' post with some references
which helped us...

We were seeing severe performance issues on our diskless systems with
an application doing mmap reads of large files on GPFS.  The I/O
pattern was sequential reads a large file.  The file was 5-10 times
the size of ram on the nodes.

We tracked this down to 'pgscand/s' in the 'sar -B' output going
outrageous (13M pages scanned per second to try to find a pages to
free).

Some googling led us to:

    <http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases>

Although a fairly different problem this was just the information we
needed.

We found that /proc/sys/vm/zone_reclaim_mode was being set to 1 on our
systems despite various documentation indicating that the default
value should be 0.

As Chris noted the Linux kernel has recently accepted a patch claiming
to set zone_reclaim_mode to 0 (although the diff does not appear to do
it very directly).

It looks like setting zone_reclaim_mode to 0 was proposed at least as
early as 2009.  I'm unclear what happened with this patch:

    <http://osdir.com/ml/linux-kernel/2009-05/msg05670.html>

There is something from 2010 called "zone_reclaim_mode is the essence
of all evil":

    <http://www.poempelfox.de/blog/2010/03/19/>

This was very useful is pointing out Nehalem processor as being
particularly susceptible and suggesting 'numactl --hardware' to check
for the node distance.  Distance greater than 20 being the magic
number.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone


More information about the Beowulf mailing list