[Beowulf] Memory errors poll
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caMon Mar 30 21:14:06 PDT 2009
- Previous message: [Beowulf] Memory errors poll
- Next message: [Beowulf] Memory errors poll
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> we replace dimms which show > 1000 corrected ECCs per day >> (or any overflows, for which counts are inaccurate, or any uncorrectable >> errors.) > > These systems are a couple of generations old, right? waaait a minute - I think I gave the wrong impression. we have about 13 TB of this gen hardware (yes, from 3 years ago). our observed rate is that at a given moment, a fraction of 1% of the nodes have any EC's at all. our vendor is happy to replace dimms that have a nontrivial rate, and there aren't a lot of nodes that have had this done. one interesting thing is that during a 3-year period, seems like about 1% of nodes developed higher EC rates that disappeared when the dimms were reseated. I wonder whether this was the result of thermal cycling... > I think I have Linux set up to record single-bit errors, and the rate using edac? I toyed with mcelog before that, but never really got much traction until edac came with an updated kernel.
- Previous message: [Beowulf] Memory errors poll
- Next message: [Beowulf] Memory errors poll
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
