[Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275

Paulo Afonso Lopes pal at di.fct.unl.pt
Sun Aug 24 04:48:00 PDT 2008


> On Wed, Aug 06, 2008 at 02:56:51PM -0500, Jason Clinton wrote:
>
>> We have a tool on our website called "breakin" that is Linux 2.6.25.9
>> patched with K8 and K10f Opteron EDAC reporting facilities. It can
>> usually find and identify failed RAM in fifteen minutes (two hours at
>> most). The EDAC patches to the kernel aren't that great about naming
>> the correct memory rank, though.
>>
>> Make sure you have multibit (sometimes says 4-bit) ECC enabled in your
>> BIOS.
>>
>> http://www.advancedclustering.com/software/breakin.html
>
> I just gave this a try, and it seems to be a very nicely packaged
> utility. Thanks for making it available. I've used some similar stuff
> before, but this is really easy.
>
> -- greg
>

After more than a week of testing I can assert :-) that the cause was poor
power, as the UPS was operating outside its envelope. Since I
re-distributed the load, moving some nodes to other UPS'es, errors went
away.

Thanks for all the suggestions,

paulo



-- 
Paulo Afonso Lopes                        | Tel: +351- 21 294 8536
Departamento de Informática               | 294 8300 ext.10763
Faculdade de Ciências e Tecnologia        | Fax: +351- 21 294 8541
Universidade Nova de Lisboa               | e-mail: pal at di.fct.unl.pt
2829-516 Caparica, PORTUGAL






More information about the Beowulf mailing list