[Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275

Paulo Afonso Lopes pal at di.fct.unl.pt
Fri Aug 1 08:40:42 PDT 2008


Dear all:

Around 2/Apr I removed 2 Opterons 246 and "companion" 4x 512 MB DIMMs from
two HPs DL145-G2, leaving them void, to populate other two HPs (got 2 CPUs
and 4GB per node).

Then, I installed 2 dual-core Opterons per DL145-G2, together with 4
sticks of 1GB (2 sticks per CPU).

So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2
DL145-G2 nodes with 2 dual-core 275 / 4GB each.

On 18th/Apr, one of the dual-core nodes crashed with an ECC error. From
IMPI, for that node,

 04/18/2008 | 20:26:26 | Memory #0x02 | Uncorrectable ECC | Asserted
 06/18/2008 | 12:00:16 | Memory #0x02 | Uncorrectable ECC | Asserted
 06/23/2008 | 11:58:34 | Memory #0x02 | Uncorrectable ECC | Asserted
 07/19/2008 | 22:41:12 | Memory #0x02 | Uncorrectable ECC | Asserted
 07/22/2008 | 17:18:00 | Memory #0x02 | Uncorrectable ECC | Asserted
 07/23/2008 | 22:08:15 | Memory #0x02 | Uncorrectable ECC | Asserted
 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted

On 07/19 the memory of CPU0 was replaced; on the 27th, the remaining
memory was replaced. ECC crashes do continue, from 1 per day to 1 per
week.


07/28: first ECC error on the other Opteron-275 populated node.

 07/28/2008 | 18:54:23 | Memory #0x02 | Uncorrectable ECC | Asserted

All nodes have IB boards, and I swapped the boards from the first crashing
and second crashing nodes (that's when, a few days later, the second node
crashed the very first time).

I have observed that not more than 2 minutes away from the ECC there are
always these events logged:

06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S0/G0: working |
Asserted
06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S5/G2: soft-off |
Deasserted

(but they are logged also at other times)

I am running Scientific Linux 5, the (lam) MPI application uses almost
100% CPU and does exchange lots of small packets through IPoIB (I have not
used "native" IB yet). "Everything" is 64-bit (kernel, apps).

Any thoughts?

Best Regards,


paulo lopes


-- 
Paulo Afonso Lopes                        | Tel: +351- 21 294 8536
Departamento de Informática               | 294 8300 ext.10763
Faculdade de Ciências e Tecnologia        | Fax: +351- 21 294 8541
Universidade Nova de Lisboa               | e-mail: pal at di.fct.unl.pt
2829-516 Caparica, PORTUGAL





More information about the Beowulf mailing list