[Beowulf] big read triggers migration and slow memory IO?

Wed Jul 8 16:45:44 PDT 2015

On 08-Jul-2015 15:43, Jonathan Barber wrote:
> I think you're process is being moved between NUMA nodes and you're 
> losing
> locality to the data. Try confining the process and data to the same 
> NUMA
> node with the numactl command.

That's part of it.  I ran a bunch of commands like this:

  taskset -c 20 dd if=KTEMP1 of=KTEMP0 bs=120000 count=34000
  taskset -c 20 testprogram -in KTEMP0

with these results:

count   size Gb  time(s)  size bytes
  34000    ~4       ~3       4080000000
  68000    ~8       ~7       8160000000
  70000    ~8       ~3       8400000000
100000   ~12       ~3      12000000000
120000   ~14       ~7      14400000000
130000   ~16       ~9      15600000000
140000   ~17     >120      16800000000  (2^34 is 17179869184)

(I didn't wait for the 140000 to complete, it could have gone on for 
another 5 minutes.) The variation between ~3s and ~9s isn't significant 
or repeatable, I think it represents the flush process getting in the 
way of the second command.

If the test was changed so that "-c 1" was used for the first command
and "-c 20" for the second, then the 130000 record case took
23s.  So there is definitely an advantage in having the file cache pages 
somehow associated with the CPU where they will be needed next.

Now the mystery is what the problem is for an fread() into a buffer of 
close to, but just below, 2^34.

Here is ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 4134441
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Nothing there that screams 2^34 to me. Perhaps something crucial for 
performance needs to be locked into memory and grows beyond 64kb at that 
buffer size, and that indirectly leads to the performance problem.

As an aside, when the test program is locked to a CPU and a file which 
is "too big" is read there is no migration/20 process using CPU time.  
Instead, there is an events/20 that starts using up a significant amount 
of CPU time (varying wildly around 30%). ksoftirqd/20 also comes and 
goes, so that could also be a factor.

> 
> Assuming your machine is NUMA (hwloc-ls will show you this) in my
> experience some of the E5's have really poor performance for inter-NUMA
> communication.

I don't have anything called hwloc-ls on this system.  What package 
provides it?  This is a Centos system.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech