[Beowulf] Stress / torture test cluster hardware
agshew at gmail.com
Sat Oct 7 21:26:52 PDT 2006
On 10/7/06, Nico Mittenzwey <nico.mittenzwey at s2001.tu-chemnitz.de> wrote:
> "memtest86" http://www.memtest86.com/
If you are using a large amount of ECC memory, you may find it
necessary to keep track of Single Bit Errors and look for "weak"
DIMMs using something like the EDAC/bluesmoke drivers
(http://bluesmoke.sourceforge.net) and a userspace memory
On a 264 node cluster with 8-16GB RAM, I had to weed out
weak memory over a period of months. A given node
running a memory tester would show no SBEs for a day or
more, then suddenly show a huge burst. Another system
might have a more consistently incrementing SBE counter.
Now, ECC was working, so applications like the memory
tester weren't having problems. However, I couldn't reliably
reboot this cluster because the BIOS would often refuse to
boot unless a node was powered off for say, five minutes.
I wrote up more of this experience on the Real World Tech
It looks like the latest version of Stresslinux has a 220.127.116.11
kernel, so it should have the EDAC drivers included. Plus,
it has the userspace memtester. Memtest86 is nice, but it
didn't support checking the ECC counters on the the cluster
I mention above. It couldn't help me weed out DIMMs at
for some more info about this (links to LWN articles and a
list of supported drivers in 2.6.16).
I wasn't aware of the EDAC wiki until I saw it linked
from the bluesmoke page just now. It will tell you
what chipset support is coming.
I would be interested to hear about other what kind of
single bit error rates other people see on their clusters.
More information about the Beowulf