[Beowulf] big read triggers migration and slow memory IO?

Wed Jul 8 14:26:37 PDT 2015

This big Dell (PowerEdge T620/03GCP, 48 CPUs, >500Gb RAM) keeps throwing 
me curve balls.

On the RAID file system there are a bunch of files having about 
17453170224 bytes.  (Slightly different numbers of fixed length 
records.)  At one level these bytes move around very quickly, this takes 
only 3 seconds:

dd if=KTEMP1 of=/dev/null bs=8192

(5.8Gb/s) which means it must already be in cache.  Nothing else is 
going on on this system. However, when a program that uses this code 
(where len_file is again 17453170224)

    buffer=malloc(len_file);
   (void) posix_fadvise(fileno(fin), 0, 0, POSIX_FADV_SEQUENTIAL);
   (void) posix_madvise(buffer, len_file, POSIX_MADV_SEQUENTIAL);
    rlen = fread(buffer, 1, len_file, fin);

is run the fread() takes at least 30 seconds, sometimes longer, for the 
read to complete.  The thing is, "top" shows this (sorry about the 
wrap):

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
22501 mathog    20   0 16.3g  13g  520 R 71.3  2.6   0:44.86 binorder
    99 root      RT   0     0    0    0 S 16.6  0.0   0:08.75 
migration/24
     3 root      RT   0     0    0    0 S 12.3  0.0   0:24.91 migration/0

What happens is that RES quickly jumps up to about half of VIRT and then 
the two migration processes start up, at which point it crawls.
The numbers after "migration" vary.  dd doesn't run long enough to
trigger whatever this migration business is.  If my test program
is run a couple of times in a row sometimes it completes the read
in about 8 seconds.  When that happens the migration processes will not 
appear.

Through all of this iostat and iotop do not show any IO at all, 
presumably because it is all going between memory and file cache, with 
none of the read being straight from the RAID.

Anyway, using 30s as a nice round number that works out to about 582Mb/s 
to move this data from one section of memory to another.  Which is 
pretty poor since the stream benchmark shows:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5737.4     0.027951     0.027887     0.028254
Scale:           6273.8     0.025557     0.025503     0.025686
Add:             7632.6     0.031513     0.031444     0.031657
Triad:           8948.2     0.026896     0.026821     0.027126

all of which are 10x faster.  Note that the dd time is consistent
with stream's copy benchmark.

Can anybody shed some light on this behavior?  In particular, why does 
the OS feel the need to "migrate" something when one of these huge reads 
is running?  Mostly I want to know how to make it behave, leaving the 
process/memory attached to one CPU (but not a particular CPU, just 
wherever it happens to put it) and not shuffle the data through what 
seems to be a 1/10X speed memory pathway.  Also, is there really a 1/10X 
memory speed pathway on this big box, or is it just that the migration, 
whatever that is doing, has a lot of overhead?

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech