[Beowulf] Memory Testing?

David Kewley david.t.kewley at gmail.com
Sat Aug 20 13:16:28 PDT 2011


A few bits from my corner of the experience space:

If you have a BMC, 'ipmitool sel list' will probably show the correctable
and uncorrectable errors, generally not naming the DIMM involved. But
'ipmitool sel list -v' shows details from various fields in the SEL records.
 In the ASUS boards I've been playing with lately, the Sensor Number field
together with the Event Data field will (usually) tell you the DIMM slot,
once you know how to decode those fields for the specific motherboard (and
possibly firmware revisions?) that you have.

How do you get that motherboard-specific data?  By finding a DIMM that
reliably produces errors, and moving it from slot to slot, taking notes on
those two SEL fields above.  I've seen a similar thing work for Dell
machines too.

If you have Dell PowerEdge R or M boxes (or previous generation
equivalents), there are various nicer ways to get the name of the DIMM
involved, including using a version of ipmitool that has the 'delloem'
subcommand.

I second Tony's suggestion that RAM testers may not be as good as real
systems, for finding bad RAM.  My experience on one large system a few years
ago was that new DIMMs failed at a rate of around 1% per year, but
"refurbished" DIMMs from RMAs failed at 10% per year (or was it even higher?
I forget).  I was led to believe that these refurbished DIMMs were often
customer returns that had been run through a RAM tester and passed.  Turns
out sometimes the customers were right and the "refurbishment" process was
wrong.

One more thing about the ASUS boards I've been playing with lately: If you
get a panic on uncorrectable memory error, and power cycle the system (using
the power button, or by remote 'ipmitool ... power cycle'), the following
POST does not report the bad DIMM.  But if you *reset* the system (by
pushing the reset button with a paperclip, or by remote 'ipmitool ... power
reset'), the next POST will pause and tell you what CPU, Channel, and DIMM
was affected on that previous uncorrectable error, which is more info that
'ipmitool sel list' gives you.  It's then up to you to figure out how CPU,
Channel, and DIMM map to the silkscreened names on the motherboard -- I
couldn't find documentation, but it turned out to be the pattern we
suspected. :)

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20110820/52530b43/attachment.html>


More information about the Beowulf mailing list