[Beowulf] Large Dell, odd IO delays

Wed Feb 14 14:26:57 PST 2018

Dell PowerEdge T630,  PERC H730P, single 11Tb RAID5 array.  Xeon CPU 
E5-2650 cpus with 40 total threads. 512Gb RAM. Centos 6.9. Kernel 
2.6.32-696.20.1.el6.x86_64.  (This machine is basically a small beowulf 
in a box.)

Sometimes for no reason that I can discern an IO operation on this 
machine will stall.  Things that should take seconds will run for 
minutes, or at least until I get tired of waiting and kill them.  Here 
is today's example:

   gunzip -c largeFile.gz > largeFile

producing a 24 Gb file.  One job running "nice" on 40 threads (which is 
all of them) for a few hours, using only 30Gb of RAM. If no other CPU 
intensive jobs start "top" shows it at 4700-3800.  That job is slowly 
reading largeFile sequentially.

About two hours after largeFile was created this was run:

   wc -l largeFile

and it just sat there for 10 minutes.  top showed 100% CPU for the "wc" 
process.  There was nothing else using a significant amount of CPU time, 
just the one big job and "wc".  Killed the wc process and instead did:

   dd if=largeFile bs=8192 | wc -l

and it completed in about 20 seconds.  After that

   wc -l largeFile

also completed, and in only 6.5s.

As far as I can tell largeFile should have been in cache the whole time. 
  Nothing big enough to force it out ran between when it was created and 
when the wc started.  "iostat 1" shows negligible disk activity, just 
the occasional reads and writes from the long running job, which works 
by sucking in a chunk of the file, calculating for a while, then 
emitting a chunk of results to an output file (which is only 320Mb).  
Using "dd" somehow kicked the system out of this state, forcing 
largeFile back into cache if it wasn't already there.

There are no warnings or errors in dmesg or /var/log/messages.

Checked the console yesterday and there are no error messages on the 
console display.

Smartctl status from the disks (SAS) last time it was checked were:
trombone   Mon Feb 12 10:20:22 PST 2018
   SMART status:           P    P    P    P
   Defect list:            0    1    0    2
   Non-medium errors:      1    7   22    3
   Corrected write:        6    1    1    0
   Corrected read:         0    0    0    0
   Uncorrected write:      0    0    0    0
   Uncorrected read:       0    0    0    0
   Age:                    16630 16630 16630 16630

and those values are unchanged after this event. (Another PowerEdge T630 
with SAS disks also has the occasional non-medium error and corrected 
write.)

A script which dumps pretty much all of the information available from 
the RAID using "megacli" is run periodically.  The only difference 
between a run after the "dd" and one weeks ago are the time stamps, disk 
temperatures and battery charge levels (by a few percent).

We have three systems that are fairly similar to this one, but only this 
one has this odd behavior.  These IO stalls have been seen on it before. 
  There was a similar issue a couple of days ago, so the system was 
rebooted then.  Apparently that made no difference.

Examined every value in /var/proc/vm and this an another system differ 
in only the max_map_count value.  The problem system has 262144 and the 
other has 65530.  Doesn't seem likely to be the issue.

Checked the hugepage settings and found a difference there.  The two 
systems that don't do this have  
/sys/kernel/mm/redhat_transparent_hugepage/defrag

always madvise [never]

whereas the system with the issue has:

[always] madvise never

I did not see any other jobs using up CPU time when this was going on, 
but perhaps the defrag processes sometimes run in a mode where they 
don't rise much in "top" yet bogs down the IO.  In any case, set the 
problem system to match the other two.

Does this sound like a reasonable cause for the slowdown, or might there 
be something else going on?  (And if so, what?)

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech