Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Paulo Afonso Lopes pal at di.fct.unl.pt
Sun Aug 24 04:48:00 PDT 2008


> On Wed, Aug 06, 2008 at 02:56:51PM -0500, Jason Clinton wrote:
>
>> We have a tool on our website called "breakin" that is Linux 2.6.25.9
>> patched with K8 and K10f Opteron EDAC reporting facilities. It can
>> usually find and identify failed RAM in fifteen minutes (two hours at
>> most). The EDAC patches to the kernel aren't that great about naming
>> the correct memory rank, though.
>>
>> Make sure you have multibit (sometimes says 4-bit) ECC enabled in your
>> BIOS.
>>
>> http://www.advancedclustering.com/software/breakin.html
>
> I just gave this a try, and it seems to be a very nicely packaged
> utility. Thanks for making it available. I've used some similar stuff
> before, but this is really easy.
>
> -- greg
>

After more than a week of testing I can assert :-) that the cause was poor
power, as the UPS was operating outside its envelope. Since I
re-distributed the load, moving some nodes to other UPS'es, errors went
away.

Thanks for all the suggestions,

paulo



-- 
Paulo Afonso Lopes                        | Tel: +351- 21 294 8536
Departamento de Informática               | 294 8300 ext.10763
Faculdade de Ciências e Tecnologia        | Fax: +351- 21 294 8541
Universidade Nova de Lisboa               | e-mail: pal at di.fct.unl.pt
2829-516 Caparica, PORTUGAL






More information about the Beowulf mailing list