[Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275

Paulo Afonso Lopes pal at di.fct.unl.pt
Sat Aug 2 04:57:37 PDT 2008


Thanks, Mark

>> So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2
>> DL145-G2 nodes with 2 dual-core 275 / 4GB each.
>
> it's worth making sure you have current bios installed.
>
Not the latest, but the previous; according to "Fixes" just a single,
unrelated fix. Anyway I'm upgrading it...
>
>> 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted
>
> it may also be useful to run mcelog, which will tell you about
> any ongoing _correctable_ ECC activity.

No output in any of the 4 hosts; tried with/without --k8, --dmi, etc.

(Just a side note, as it is being pursued in another thread): I have been
quite happy with DL145-G2's IPMI and BMC board: I was able to power it
remotely in every occasion, including after crashes.


-- 
Paulo Afonso Lopes                        | Tel: +351- 21 294 8536
Departamento de Informática               | 294 8300 ext.10763
Faculdade de Ciências e Tecnologia        | Fax: +351- 21 294 8541
Universidade Nova de Lisboa               | e-mail: pal at di.fct.unl.pt
2829-516 Caparica, PORTUGAL






More information about the Beowulf mailing list