[Beowulf] Stress / torture test cluster hardware
agshew at gmail.com
Sat Oct 7 21:26:52 PDT 2006
On 10/7/06, Nico Mittenzwey <nico.mittenzwey at s2001.tu-chemnitz.de> wrote:
> "memtest86" http://www.memtest86.com/
If you are using a large amount of ECC memory, you may find it
necessary to keep track of Single Bit Errors and look for "weak"
DIMMs using something like the EDAC/bluesmoke drivers
(http://bluesmoke.sourceforge.net) and a userspace memory
On a 264 node cluster with 8-16GB RAM, I had to weed out
weak memory over a period of months. A given node
running a memory tester would show no SBEs for a day or
more, then suddenly show a huge burst. Another system
might have a more consistently incrementing SBE counter.
Now, ECC was working, so applications like the memory
tester weren't having problems. However, I couldn't reliably
reboot this cluster because the BIOS would often refuse to
boot unless a node was powered off for say, five minutes.
I wrote up more of this experience on the Real World Tech
It looks like the latest version of Stresslinux has a 18.104.22.168
kernel, so it should have the EDAC drivers included. Plus,
it has the userspace memtester. Memtest86 is nice, but it
didn't support checking the ECC counters on the the cluster
I mention above. It couldn't help me weed out DIMMs at
for some more info about this (links to LWN articles and a
list of supported drivers in 2.6.16).
I wasn't aware of the EDAC wiki until I saw it linked
from the bluesmoke page just now. It will tell you
what chipset support is coming.
I would be interested to hear about other what kind of
single bit error rates other people see on their clusters.
More information about the Beowulf