[Beowulf] Stress / torture test cluster hardware

Andrew Shewmaker agshew at gmail.com
Sat Oct 7 21:26:52 PDT 2006


On 10/7/06, Nico Mittenzwey <nico.mittenzwey at s2001.tu-chemnitz.de> wrote:

> "memtest86" http://www.memtest86.com/

If you are using a large amount of ECC memory, you may find it
necessary to keep track of Single Bit Errors and look for "weak"
DIMMs using something like the EDAC/bluesmoke drivers
(http://bluesmoke.sourceforge.net) and a userspace memory
tester.

On a 264 node cluster with 8-16GB RAM, I had to weed out
weak memory over a period of months.  A given node
running a memory tester would show no SBEs for a day or
more, then suddenly show a huge burst.  Another system
might have a more consistently incrementing SBE counter.
Now, ECC was working, so applications like the memory
tester weren't having problems.  However, I couldn't reliably
reboot this cluster because the BIOS would often refuse to
boot unless a node was powered off for say, five minutes.

I wrote up more of this experience on the Real World Tech
forum:

http://www.realworldtech.com/forums/index.cfm?action=detail&id=69894&threadid=69639&roomid=11

It looks like the latest version of Stresslinux has a 2.6.16.18
kernel, so it should have the EDAC drivers included.  Plus,
it has the userspace memtester.  Memtest86 is nice, but it
didn't support checking the ECC counters on the the cluster
I mention above.  It couldn't help me weed out DIMMs at
all.

See http://agenda.clustermonkey.net/index.php/Memory
for some more info about this (links to LWN articles and a
list of supported drivers in 2.6.16).

I wasn't aware of the EDAC wiki until I saw it linked
from the bluesmoke page just now.  It will tell you
what chipset support is coming.

http://buttersideup.com/edacwiki/

I would be interested to hear about other what kind of
single bit error rates other people see on their clusters.

-- 
Andrew Shewmaker



More information about the Beowulf mailing list