Prentice,<br>

<br>

Thanks for filling in some details.  What you say makes complete sense to me.<br>

<br>

Is it the case that frigga has seen similar stress with no SBE errors?  

If so, I agree it seems like something else is going on besides bad 

DIMMs. To test that, if you can schedule simultaneous downtime on the 

two boxes, you might swap all DIMMs between odin and frigga.<br>

<br>

If you do a few DIMM replacements, but continue to have the sense that 

DIMM replacements aren't really solving the problem, and you have good 

evidence why you think that, I encourage you to make sure Dell Support hears and

 understands that, and make sure they're looking more holistically than 

individual DIMMs.  They may look more broadly on their own, or you may 

need to nudge them.<br>

<br>

David<br><br><div class="gmail_quote">On Fri, Dec 10, 2010 at 6:24 AM, Prentice Bisbal <span dir="ltr"><<a href="mailto:prentice@ias.edu" target="_blank">prentice@ias.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


David,<br>

<br>

Thanks for the e-mail due to it's length, I'm not including it in my reply, which I know is normally bad mailing list etiquette.<br>

<br>

The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 GB of RAM.<br>

<br>

I installed two identical servers at the same time, named frigga and odin (husband and wife in Norse mythology, if your curious). These nodes are not part of a beowulf cluster, but this is the best forum I know of to discuss problems like this.<br>


<br>

Odin is the system with errors, and it started reporting SBE errors almost immediately, even when the system was completely idle. They started within hours of operating system installation, before users were even able to login to the system.<br>


<br>

As you pointed out, I don't think SBE errors are fatal, but I like to address all system errors I identify, no matter how trivial. I find when you get used to ignoring a "harmless" errors, you eventually end up ignoring all errors.<br>


<br>

So, you are right that I'm looking for a tool to quickly and reliably reproduce SBEs so that I can quickly resolve this problem with Dell. For reasons I can't discuss here, working with the user is not an option. Due to the nature of my institution, users are only here for a couple of years, anyway, and I'm looking for a tool that I can use long after this user (and his code) are gone.<br>


<br>

I have been keeping detailed logs of exactly when the SBE errors occur. And I have been reseating and swapping DIMMS to see of the errors move with the DIMM or stay with the slot to determine whether it's a bad DIMM, or a bad motherboard. In the first occasion, the error did move with the DIMM, and I replaced the DIMM. Since then, the errors have been moving from DIMM to DIMM, even across banks of DIMMS. Since each bank corresponds to a socket, this would indicate that it's not a bad on-chip memory controller, or they're all bad.<br>


<br>

My goal is to find a tool that I can run repeatedly to reproduce SBE errors in a finite time frame, and then run it repeatedly and collect data on where these SBEs occur. I suspect it's a bad motherboard, but unless I have overwhelming data showing that, Dell will just keep replacing the DIMMs, and I'm pretty confident it's not bad DIMMs in this case.<br>


<br>

As stated earlier, HPL wasn't reliable for me in this capacity. I'm now using mprime's stress test mode, and will also test stressapptest.<br>

<br>

<br>

-- <br><font color="#888888">

Prentice<br>

<br>

</font></blockquote></div><br>