Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Memory errors poll

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Mark Hahn hahn at mcmaster.ca
Mon Mar 30 21:14:06 PDT 2009


>> we replace dimms which show > 1000 corrected ECCs per day
>> (or any overflows, for which counts are inaccurate, or any uncorrectable
>> errors.)
>
> These systems are a couple of generations old, right?

waaait a minute - I think I gave the wrong impression.  we have about
13 TB of this gen hardware (yes, from 3 years ago).  our observed rate
is that at a given moment, a fraction of 1% of the nodes have any EC's at
all.  our vendor is happy to replace dimms that have a nontrivial rate,
and there aren't a lot of nodes that have had this done.

one interesting thing is that during a 3-year period, seems like about 1% 
of nodes developed higher EC rates that disappeared when the dimms were 
reseated.  I wonder whether this was the result of thermal cycling...

> I think I have Linux set up to record single-bit errors, and the rate

using edac?  I toyed with mcelog before that, but never really got much
traction until edac came with an updated kernel.



More information about the Beowulf mailing list