Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Paulo Afonso Lopes pal at di.fct.unl.pt
Fri Aug 1 08:40:42 PDT 2008


Dear all:

Around 2/Apr I removed 2 Opterons 246 and "companion" 4x 512 MB DIMMs from
two HPs DL145-G2, leaving them void, to populate other two HPs (got 2 CPUs
and 4GB per node).

Then, I installed 2 dual-core Opterons per DL145-G2, together with 4
sticks of 1GB (2 sticks per CPU).

So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2
DL145-G2 nodes with 2 dual-core 275 / 4GB each.

On 18th/Apr, one of the dual-core nodes crashed with an ECC error. From
IMPI, for that node,

 04/18/2008 | 20:26:26 | Memory #0x02 | Uncorrectable ECC | Asserted
 06/18/2008 | 12:00:16 | Memory #0x02 | Uncorrectable ECC | Asserted
 06/23/2008 | 11:58:34 | Memory #0x02 | Uncorrectable ECC | Asserted
 07/19/2008 | 22:41:12 | Memory #0x02 | Uncorrectable ECC | Asserted
 07/22/2008 | 17:18:00 | Memory #0x02 | Uncorrectable ECC | Asserted
 07/23/2008 | 22:08:15 | Memory #0x02 | Uncorrectable ECC | Asserted
 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted

On 07/19 the memory of CPU0 was replaced; on the 27th, the remaining
memory was replaced. ECC crashes do continue, from 1 per day to 1 per
week.


07/28: first ECC error on the other Opteron-275 populated node.

 07/28/2008 | 18:54:23 | Memory #0x02 | Uncorrectable ECC | Asserted

All nodes have IB boards, and I swapped the boards from the first crashing
and second crashing nodes (that's when, a few days later, the second node
crashed the very first time).

I have observed that not more than 2 minutes away from the ECC there are
always these events logged:

06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S0/G0: working |
Asserted
06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S5/G2: soft-off |
Deasserted

(but they are logged also at other times)

I am running Scientific Linux 5, the (lam) MPI application uses almost
100% CPU and does exchange lots of small packets through IPoIB (I have not
used "native" IB yet). "Everything" is 64-bit (kernel, apps).

Any thoughts?

Best Regards,


paulo lopes


-- 
Paulo Afonso Lopes                        | Tel: +351- 21 294 8536
Departamento de Informática               | 294 8300 ext.10763
Faculdade de Ciências e Tecnologia        | Fax: +351- 21 294 8541
Universidade Nova de Lisboa               | e-mail: pal at di.fct.unl.pt
2829-516 Caparica, PORTUGAL





More information about the Beowulf mailing list