[Beowulf] Large Dell, odd IO delays
mathog at caltech.edu
Wed Feb 14 14:26:57 PST 2018
Dell PowerEdge T630, PERC H730P, single 11Tb RAID5 array. Xeon CPU
E5-2650 cpus with 40 total threads. 512Gb RAM. Centos 6.9. Kernel
2.6.32-696.20.1.el6.x86_64. (This machine is basically a small beowulf
in a box.)
Sometimes for no reason that I can discern an IO operation on this
machine will stall. Things that should take seconds will run for
minutes, or at least until I get tired of waiting and kill them. Here
is today's example:
gunzip -c largeFile.gz > largeFile
producing a 24 Gb file. One job running "nice" on 40 threads (which is
all of them) for a few hours, using only 30Gb of RAM. If no other CPU
intensive jobs start "top" shows it at 4700-3800. That job is slowly
reading largeFile sequentially.
About two hours after largeFile was created this was run:
wc -l largeFile
and it just sat there for 10 minutes. top showed 100% CPU for the "wc"
process. There was nothing else using a significant amount of CPU time,
just the one big job and "wc". Killed the wc process and instead did:
dd if=largeFile bs=8192 | wc -l
and it completed in about 20 seconds. After that
wc -l largeFile
also completed, and in only 6.5s.
As far as I can tell largeFile should have been in cache the whole time.
Nothing big enough to force it out ran between when it was created and
when the wc started. "iostat 1" shows negligible disk activity, just
the occasional reads and writes from the long running job, which works
by sucking in a chunk of the file, calculating for a while, then
emitting a chunk of results to an output file (which is only 320Mb).
Using "dd" somehow kicked the system out of this state, forcing
largeFile back into cache if it wasn't already there.
There are no warnings or errors in dmesg or /var/log/messages.
Checked the console yesterday and there are no error messages on the
Smartctl status from the disks (SAS) last time it was checked were:
trombone Mon Feb 12 10:20:22 PST 2018
SMART status: P P P P
Defect list: 0 1 0 2
Non-medium errors: 1 7 22 3
Corrected write: 6 1 1 0
Corrected read: 0 0 0 0
Uncorrected write: 0 0 0 0
Uncorrected read: 0 0 0 0
Age: 16630 16630 16630 16630
and those values are unchanged after this event. (Another PowerEdge T630
with SAS disks also has the occasional non-medium error and corrected
A script which dumps pretty much all of the information available from
the RAID using "megacli" is run periodically. The only difference
between a run after the "dd" and one weeks ago are the time stamps, disk
temperatures and battery charge levels (by a few percent).
We have three systems that are fairly similar to this one, but only this
one has this odd behavior. These IO stalls have been seen on it before.
There was a similar issue a couple of days ago, so the system was
rebooted then. Apparently that made no difference.
Examined every value in /var/proc/vm and this an another system differ
in only the max_map_count value. The problem system has 262144 and the
other has 65530. Doesn't seem likely to be the issue.
Checked the hugepage settings and found a difference there. The two
systems that don't do this have
always madvise [never]
whereas the system with the issue has:
[always] madvise never
I did not see any other jobs using up CPU time when this was going on,
but perhaps the defrag processes sometimes run in a mode where they
don't rise much in "top" yet bogs down the IO. In any case, set the
problem system to match the other two.
Does this sound like a reasonable cause for the slowdown, or might there
be something else going on? (And if so, what?)
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf