[Beowulf] Memory stress testing tools.
prentice at ias.edu
Fri Dec 10 06:24:30 PST 2010
Thanks for the e-mail due to it's length, I'm not including it in my
reply, which I know is normally bad mailing list etiquette.
The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128
GB of RAM.
I installed two identical servers at the same time, named frigga and
odin (husband and wife in Norse mythology, if your curious). These nodes
are not part of a beowulf cluster, but this is the best forum I know of
to discuss problems like this.
Odin is the system with errors, and it started reporting SBE errors
almost immediately, even when the system was completely idle. They
started within hours of operating system installation, before users were
even able to login to the system.
As you pointed out, I don't think SBE errors are fatal, but I like to
address all system errors I identify, no matter how trivial. I find when
you get used to ignoring a "harmless" errors, you eventually end up
ignoring all errors.
So, you are right that I'm looking for a tool to quickly and reliably
reproduce SBEs so that I can quickly resolve this problem with Dell. For
reasons I can't discuss here, working with the user is not an option.
Due to the nature of my institution, users are only here for a couple of
years, anyway, and I'm looking for a tool that I can use long after this
user (and his code) are gone.
I have been keeping detailed logs of exactly when the SBE errors occur.
And I have been reseating and swapping DIMMS to see of the errors move
with the DIMM or stay with the slot to determine whether it's a bad
DIMM, or a bad motherboard. In the first occasion, the error did move
with the DIMM, and I replaced the DIMM. Since then, the errors have been
moving from DIMM to DIMM, even across banks of DIMMS. Since each bank
corresponds to a socket, this would indicate that it's not a bad on-chip
memory controller, or they're all bad.
My goal is to find a tool that I can run repeatedly to reproduce SBE
errors in a finite time frame, and then run it repeatedly and collect
data on where these SBEs occur. I suspect it's a bad motherboard, but
unless I have overwhelming data showing that, Dell will just keep
replacing the DIMMs, and I'm pretty confident it's not bad DIMMs in this
As stated earlier, HPL wasn't reliable for me in this capacity. I'm now
using mprime's stress test mode, and will also test stressapptest.
More information about the Beowulf