[Beowulf] Memory stress testing tools.

Fri Dec 10 11:47:20 PST 2010

Prentice,

Thanks for filling in some details.  What you say makes complete sense to
me.

Is it the case that frigga has seen similar stress with no SBE errors?  If
so, I agree it seems like something else is going on besides bad DIMMs. To
test that, if you can schedule simultaneous downtime on the two boxes, you
might swap all DIMMs between odin and frigga.

If you do a few DIMM replacements, but continue to have the sense that DIMM
replacements aren't really solving the problem, and you have good evidence
why you think that, I encourage you to make sure Dell Support hears and
understands that, and make sure they're looking more holistically than
individual DIMMs.  They may look more broadly on their own, or you may need
to nudge them.

David

On Fri, Dec 10, 2010 at 6:24 AM, Prentice Bisbal <prentice at ias.edu> wrote:

> David,
>
> Thanks for the e-mail due to it's length, I'm not including it in my reply,
> which I know is normally bad mailing list etiquette.
>
> The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 GB
> of RAM.
>
> I installed two identical servers at the same time, named frigga and odin
> (husband and wife in Norse mythology, if your curious). These nodes are not
> part of a beowulf cluster, but this is the best forum I know of to discuss
> problems like this.
>
> Odin is the system with errors, and it started reporting SBE errors almost
> immediately, even when the system was completely idle. They started within
> hours of operating system installation, before users were even able to login
> to the system.
>
> As you pointed out, I don't think SBE errors are fatal, but I like to
> address all system errors I identify, no matter how trivial. I find when you
> get used to ignoring a "harmless" errors, you eventually end up ignoring all
> errors.
>
> So, you are right that I'm looking for a tool to quickly and reliably
> reproduce SBEs so that I can quickly resolve this problem with Dell. For
> reasons I can't discuss here, working with the user is not an option. Due to
> the nature of my institution, users are only here for a couple of years,
> anyway, and I'm looking for a tool that I can use long after this user (and
> his code) are gone.
>
> I have been keeping detailed logs of exactly when the SBE errors occur. And
> I have been reseating and swapping DIMMS to see of the errors move with the
> DIMM or stay with the slot to determine whether it's a bad DIMM, or a bad
> motherboard. In the first occasion, the error did move with the DIMM, and I
> replaced the DIMM. Since then, the errors have been moving from DIMM to
> DIMM, even across banks of DIMMS. Since each bank corresponds to a socket,
> this would indicate that it's not a bad on-chip memory controller, or
> they're all bad.
>
> My goal is to find a tool that I can run repeatedly to reproduce SBE errors
> in a finite time frame, and then run it repeatedly and collect data on where
> these SBEs occur. I suspect it's a bad motherboard, but unless I have
> overwhelming data showing that, Dell will just keep replacing the DIMMs, and
> I'm pretty confident it's not bad DIMMs in this case.
>
> As stated earlier, HPL wasn't reliable for me in this capacity. I'm now
> using mprime's stress test mode, and will also test stressapptest.
>
>
> --
> Prentice
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20101210/c1e6d367/attachment.html>