[Beowulf] Slow RAID reads, no errors logged, why?

Skylar Thompson skylar.thompson at gmail.com
Mon Mar 19 17:19:52 PDT 2018


Could it be a patrol read, possibly hitting a marginal disk? We've run into
this on some of our Dell systems, and exporting the RAID HBA logs reveals
what's going on. You can see those with "omconfig storage controller
controller=n action=exportlog" (exports logs in /var/log/lsi_mmdd.log) or
an equivalent MegaCLI command that I can't remember right now. We had a
rash of these problems, along with uncaught media errors (probably a
combination disk/firmware bug), so we ended up sending these logs to
Splunk, but if it's a one-off thing it's pretty easy to spot visually too.

Skylar

On Mon, Mar 19, 2018 at 1:58 PM, David Mathog <mathog at caltech.edu> wrote:

> On one of our Centos 6.9 systems with a PERC H370 controller I just noticed
> that file system reads are quite slow.  Like 30Mb/s slow.  Anybody care to
> hazard a guess what might be causing this situation?  We have another quite
> similar machine which is fast (A), compared to this (B) which is slow:
>            A      B
> RAM        512    512     GB
> CPUs       48     56      (via /proc/cpuinfo, actually this is threads)
> Adapter    H710P  H730
> RAID Level *      *       Primary-5, Secondary-0, RAID Level Qualifier-3
> Size       7.275  9.093   TB
> state      *      *       Optimal
> Drives     5      6
> read rate  540    30     Mb/s (dd if=largefile bs=8192 of=/dev/null& ;
> iotop)
> sata disk   ST2000NM0033
> sas disk          ST2000NM0023
> patrol     No    No       (megacli shows patrol read not going now)
>
> ulimit -a on both is:
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 2067196
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 60000
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 4096
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> Nothing in the SMART values indicating a read problem, although on "B"
> one disk is slowly accumulating events in the write x rereads/rewrites
> measurement (it has 2346, accumulated at about 10 per week).  The value is
> 0 there for reads x rereads/rewrites.  For "B" the smartctl output columns
> are:
>
>  Errors Corrected by         Total   Correction     Gigabytes    Total
>        ECC        rereads/  errors    algorithm      processed
>  uncorrected
>    fast | delayed rewrites corrected invocations   [10^9 bytes]  errors
>
> read: 934353848  0 0 934353848  0 48544.026 0
> read: 2017672022 0 0 2017672022 0 48574.489 0
> read: 2605398517 3 0 2605398520 3 48516.951 0
> read: 3237457411 1 0 3237457412 1 48501.302 0
> read: 2028103953 0 0 2028103953 0 14438.132 0
> read: 197018276  0 0 197018276  0 48640.023 0
>
> write: 0 0 0 0 0 26394.472 0
> write: 0 0 2346 2346 2346 26541.534 0
> write: 0 0 0 0 0 27549.205 0
> write: 0 0 0 0 0 25779.557 0
> write: 0 0 0 0 0 11266.293 0
> write: 0 0 0 0 0 26465.227 0
>
> verify: 341863005  0 0 341863005  0 241374.368 0
> verify: 866033815  0 0 866033815  0 223849.660 0
> verify: 2925377128 0 0 2925377128 0 221697.809 0
> verify: 1911833396 6 0 1911833402 6 228054.383 0
> verify: 192670736  0 0 192670736  0 66322.573 0
> verify: 1181681503 0 0 1181681503 0 222556.693 0
>
> If the process doing the IO is root it doesn't go any faster.
>
> Oddly if on "B" a second dd process is started on another file it ALSO
> reads at 30Mb/s.  So the disk system then does a total of 60Gb/s, but only
> 30Gb/s per process.  Added a 3rd and a 4th process doing the same.  At the
> 4th it seemed to hit some sort of limit, with each process now consistently
> less than 30Gb/s and the total at maybe 80Gb/s total.  Hard to say what the
> exact total was as it was jumping around like crazy.  On "A" 2 processes
> each got 270Mb/s,
> and 3 180Mb/s.  Didn't try 4.
>
> The only oddness of late on "B" is that a few days ago it loaded too many
> memory hungry processes so the OS killed some.  I have had that happen
> before on other systems without them doing anything odd afterwards.
>
> Any ideas what this slowdown might be?
>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20180319/56c64566/attachment-0001.html>


More information about the Beowulf mailing list