[Beowulf] Memory stress testing tools.

Thu Dec 9 18:14:41 PST 2010

Prentice,

You only asked for memory testing programs, but I'm going to go a bit
further, to make sure some background issues are covered, and to give you
some ideas you might not yet have.  Some of this is based on a lot of
experience with Dell servers in HPC.

Some of my background thoughts on dealing with SBEs:

1) Complete and details historical records are important for correctly and
efficiently resolving these type of errors, especially on larger clusters.
Otherwise it's too easy to get confused about what happened when, and come
to incorrect conclusions about problems and solutions.  Treat it like a lab
experiement -- keep a log book or equivalent, test your hypotheses against
the data, and think broadly about what alternative hypotheses may exist.

2) The resolution process will be iterative, with physical manipulations
(e.g. moving DIMMs among slots) alternating with monitoring for SBEs and
optionally running stress applications to attempt to trigger SBEs (a
"reproducer" of the SBEs).

3) For efficient resolution, you want a quick, reliable reproducer,
something that will trigger the SBEs quickly.

4) I've seen no evidence that SBEs materially affect performance or
correctness on a server, so my practice has often been to leave affected
servers in production as much as possible, taking them out of production
(after draining jobs) only briefly to move DIMMs, replace DIMMs, etc.

Regarding (4), if anyone here has measurements or a URL to a study saying in
what circumstances there's a significant material risk to performance or
correctness of calculation with SBE correction, I'd love to see that.  I'm
not saying that SBE correction is completely free performance wise -- I bet
it takes a little time to do the correction, but I bet for normal SBE
correction rates, that time is (nearly) unmeasurable.

Also, over a few thousand server-years, I've never or almost never seen SBE
corrections morph into uncorrectable multi-bit errors.  When uncorrectable
errors have shown up (which itself has been rare in my experience, mostly in
a single situation where there was a server bug that got corrected), they've
shown up early on a server, not after a long period of only seeing SBEs.

Prentice, I believe you started this thread because you need something for
(3), is that right?  As David Mathog said, you already know what activity
most reliably triggers SBE corrections: Your users' code.  If I were in your
shoes, and I had time and/or were concerned around issue (4) above, I'd a)
identify which user, code, and specific runs trigger SBEs the most, then b)
if possible, work with that user to get a single-node version of a similar
run that you could outside production node use, to reproduce and resolve
SBEs.  I'd then monitor for SBEs in production, and when they occur, drain
jobs from those nodes, and take them out of production so I could user that
single-node user job to satisfy (2) and (3) above.

If I was in your shoes and was NOTconcerned about (4), I'd simply drain the
running job, do a manipulation (2 above), and put the node back into
production, waiting for the SBE to recur if it is going to.  This is what
I've often done.

Or if you have a dev/test cluster, replace the entire production node with a
tested, known-good node from the dev/test cluster, then test/fix the SBE
server in the context of the dev/test cluster.  I've also often done this.

My experience has been that long runs of single-node HPL was the best SBE
trigger I ever found.  Dell's mpmemory did not do as well.  I believe
memtest86{,+} also didn't find problems that HPL found, though I didn't test
memtest86{,+} as much.  It also was not immediately obvious how to gather
the memory test results from mpmemory and memtest86{,+}, though it can
probably be done, perhaps easily, with a bit of R&D.

But since you've found that HPL does not trigger SBEs as much as your user's
code, I think you have a very good pointer that you should do stress tests
with your user's code if at all possible.

If you can share what the stressful app is, and any of the relevant run
parameters, that would probably be interesting to folks on this list.

In my experience, usually SBEs are resolved by reseating or replacing the
affected DIMM.  However it can also be an issue on the motherboard (sockets
or traces or something else), or possibly the CPU (because Intel and AMD now
both have the memory controllers on-die), or possibly a BIOS issue (if a CPU
or memory related parameter isn't set quite optimally by the BIOS you're
running; BIOSes may set hardware parameters without your awareness nor
ability to tune it yourself).

Best practice may be:

A) swap the DIMM where the SBE occurred with a neighbor that underwent
similar stress but did not show any SBEs.  Keep a permanent record of which
DIMMs you swapped and when, as well as all error messages and their timing.
B) re-stress either in production (if you believe my tentative assertion (in
4 above) that SBE corrections do not materially affect performance nor
correctness), or using your reliable reproducer for an amount of time that
you know should usually re-trigger the SBE if it is going to recur.
C) assess the results and respond accordingly:
  1) if the SBE messages do not recur, then either reseating resolved it, or
it's so marginal that you will need to wait longer for it to show up; may as
well leave it in production in this case
  2) if the SBE messages follow the DIMM when you swapped it with its
neighbor, then it's very very likely the DIMM (especially if the SBE
occurred quickly upon stressing it, both before and after the DIMM move).
Present this evidence to Dell Support and ask them to send you a replacement
DIMM.  KEEP IN MIND that although the replacement DIMM will usually resolve
the issue, it has never before been stressed in your setup, and it's
possible for your stress patterns to elicit SBEs even in this replacement
DIMM.  So if the error recurs in that DIMM slot, it's possible that the
replacement DIMM also needs to be replaced.  You again need to do a neighbor
swap to check whether it really is the replacement DIMM.
  3) If the SBE stays with the slot after you did the neighbor swap, take
this evidence to Dell Support, and see what they say.  I would guess they'd
have the motherboard and/or CPU swapped.  Alternatively, you may wish (use
your best judgment) to gather more data by CAREFULLY! swapping CPUs 1 and 2
in that server and see whether the SBEs follow the CPU or stay with the
slot.  Just as with DIMMs, it's not unheard of for replacement motherboards
and CPUs to also have issues, so don't assume they're perfect -- usually the
suitable replacement will resolve the issue fully, but you won't know for
sure until you've stressed the system.

What model of PowerEdge are these servers?

PowerEdge systems keep a history of the messages that get printed on the LCD
in the System Event Log (SEL), in older days also called the ESM log
(embedded systems management, I believe).  The SEL is maintained by the BMC
or iDRAC.  I believe the message you report below (SBE logging disable) will
be in the SEL.  I know the SEL logs messages that indicate that the SBE
correction rate has exceeded two successive thresholds (warning and
critical).

You can read and clear the SEL using a number of different methods.  I'm
sure DSET does it.  You can also do it with ipmitool, omreport (part of the
OpenManage Server Administration (OMSA) tools for the Linux command line),
and during POST by hitting Ctrl-E (I think) to get into the BMC or iDRAC
POST utility.  I'm sure there are other ways; these are the ones I've found
useful.

Normal, non-Dell-specific ipmitool will print the SEL records using
'ipmitool sel list', but it does not have the lookup tables and algorithms
needed to tell you the name of the affected DIMM on Dell servers.  You can
also do 'ipmitool sel list -v', which will dump raw field values for each
SEL record, and you can decode those raw values to figure out the affected
DIMM -- with enough examples (and comparing e.g. to theDIMM names in the
Ctrl-E POST SEL view), you might be able to figure out the decoding
algorithm on your own, or google might give you someone who has already
figured out the decoding for your specific PowerEdge model.

That is the downside of using standard ipmitool.  The upside of ipmitool,
though, is that it's quite lightweight, and can be used both on localhost
and across the network (using IPMI over LAN, if you have it configured
appropriately).

The good news is that there's a Dell-specific version of ipmitool available,
which adds some Dell-specific capabilities, including to decode DIMM names.
This works at least for current PowerEdge R and M servers, as well as older
PowerEdge models like the 1950, and probably a few generations older than
that.  I think it simply supports all models that the corresponding version
of OpenManage supports; this does not include older SC servers or current C
servers.  If you have a model that OpenManage does not support, it may be
worth trying, in case it does the right thing for you.

You can get the 'delloem' version of ipmitool from the OpenManage Management
Station package.  The current latest URL is

ftp://ftp.dell.com/sysman/OM-MgmtStat-Dell-Web-LX-6.4.0-1401_A01.tar.gz

Then unpack it and look in ./linux/bmc/ipmitool/ for your OS or a compatible
one.

For example, looking in the RHEL5_x86_64 subdirectory, the rpm
OpenIPMI-tools-2.0.16-99.dell.1.99.1.el5.x86_64.rpm has /usr/bin/ipmitool
with 'delloem' appearing as a string internally.  (I'm not able to test it
right now.)

Once you've installed the appropriate package, do 'ipmitool delloem'; this
should tell you what the secondary options are.  I believe 'ipmitool delloem
sel' will decode the SEL including the correct DIMM names.

If you install OpenManage appropriately, you can also get the SEL decoded,
as well as get alerts automatically and immediately sent to syslog.  The
command line to print a decoded SEL is 'omreport system esmlog'.  OpenManage
is pretty heavy-weight, though.  Some people do install it and leave it
running on HPC compute nodes; some people would never do that on a
production node.

Your mention of getting log messages about the SBEs makes me think you do
have OMSA installed and its daemons running -- is that correct?  Try
'omreport system esmlog' if so.

Finally, during POST Ctrl-E at the prompted moment will get you into the BMC
or iDRAC POST menu system, in which you can view and optionally clear the
SEL.  I do not think this is easily scriptable, but if all else fails, that
is one way to view the SEL, with proper decoding.

I know that's long, and I hope that helps you and possibly others.

David

On Thu, Dec 9, 2010 at 1:54 PM, Prentice Bisbal <prentice at ias.edu> wrote:

> Jon Forrest wrote:
> > On 12/9/2010 8:08 AM, Prentice Bisbal wrote:
> >
> >> So far, mprime appears to be working. I was able to trigger an SBE in 21
> >> hours the first time I ran it.  I plan on running it repeatedly for the
> >> next few days to see how well it can repeat finding errors.
> >
> > After it finds an error how do you
> > figure out which memory module to
> > replace?
> >
>
> The LCD display on the front of the server tells me, with a message like
> this:
>
> "SBE logging disabled on DIMM C3. Reseat DIMM"
>
> I can also generate a report with DELL DSET that shows me a similar
> other message. I'm sure there are other tools, but I usually have to
> create a DSET report to send to Dell, anyway.
>
> --
> Prentice
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20101209/1233a52e/attachment.html>