[Beowulf] Advanced Clustering's Breakin
prentice at ias.edu
Wed Oct 1 08:44:01 PDT 2008
> We have a tool on our website called "breakin" that is Linux 18.104.22.168
> patched with K8 and K10f Opteron EDAC reporting facilities. It can
> usually find and identify failed RAM in fifteen minutes (two hours at
> most). The EDAC patches to the kernel aren't that great about naming
> the correct memory rank, though.
> Make sure you have multibit (sometimes says 4-bit) ECC enabled in your BIOS.
I've been using breakin for the past week or two on my new cluster. I
get some results that seem to be inconsistent. For example on a node
I'll get this:
Test | Pass | Fail | Last Message
hdhealth | 315 | 0 | No disk devices found
Then in the log section:
00h 57m 40s: Disabling burnin test 'hdhealth'
If I reboot and restart the testing, it will see a hard disk. Why is
breaking not always seeing the disk?
I've tried to dump logs to a USB drive, but breakin refuses to mount the
correct partition on my usb drive (/dev/sdb vs. /dev/sdb1, or vice versa).
I sent e-mail to Advanced Clustering regarding these issues, but didn't
get any response, so I"m hoping I have better luck here.
More information about the Beowulf