[Beowulf] Advanced Clustering's Breakin

Prentice Bisbal prentice at ias.edu
Wed Oct 1 08:44:01 PDT 2008


> We have a tool on our website called "breakin" that is Linux 2.6.25.9
> patched with K8 and K10f Opteron EDAC reporting facilities. It can
> usually find and identify failed RAM in fifteen minutes (two hours at
> most). The EDAC patches to the kernel aren't that great about naming
> the correct memory rank, though.
> 
> Make sure you have multibit (sometimes says 4-bit) ECC enabled in your BIOS.
> 
> http://www.advancedclustering.com/software/breakin.html

I've been using breakin for the past week or two on my new cluster. I
get some results that seem to be inconsistent. For example on a node
I'll get this:

Test     | Pass | Fail |  Last Message
------------------------------------------
hdhealth | 315  |  0   |  No disk devices found

Then in the log section:

00h 57m 40s: Disabling burnin test 'hdhealth'


If I reboot and restart the testing, it will see a hard disk. Why is
breaking not always seeing the disk?

I've tried to dump logs to a USB drive, but breakin refuses to mount the
correct partition on my usb drive (/dev/sdb vs. /dev/sdb1, or vice versa).

I sent e-mail to Advanced Clustering regarding these issues, but didn't
get any response, so I"m hoping I have better luck here.


-- 
Prentice



More information about the Beowulf mailing list