[Beowulf] Memory errors poll
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caMon Mar 30 21:12:43 PDT 2009
- Previous message: [Beowulf] Memory errors poll
- Next message: [Beowulf] Memory errors poll
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>>> /Could those of you running ECC memory give me an updated figure on >>> the number of errors detected/corrected per day per system? / >> >> we replace dimms which show > 1000 corrected ECCs per day (or >> any overflows, for which counts are inaccurate, or any >> uncorrectable errors.) > > > That seems a remarkably high rate, for the raw memory errors. Micron quotes > something like 100 soft errors per 1E9 device hours. (That's a > FIT:failure in time of 100) 1000 per day seems high? it doesn't worry me much, since it's low enough that there will be very few double errors by coincidence, and almost certainly no measurable overhead. (overhead of polling and logging CEs _is_ measurable on machines with bad dimms, btw.) these dimms have 16 chips. also, these are observed CEs, which includes problems due to other dimms, sockets, the csrow bus and the (opteron) memory controller. I'm also not claiming that there are a significant number of dimms showing > 0 but < 1000 CEs/day. > If I saw that rate, I'd assume that there's something seriously wrong with the part. perhaps. one problem is that I don't have a good load-generator. when idle, or loaded with light-footprint jobs, even nodes with a real problem can wind up reporting few CEs. initially, my attempt at a load-generator was simply a multithreaded stream-like thing that kept blasting bit-patterns into big arrays. as far as I know, it's as likely to write bad ECC as read it, so you have to alternate r/w cycles. but being sequential is probably less than optimal (indeed, perhaps why memtest86 sometimes gives false negatives). > I suspect that most "memory errors" reported for PCs (whether in clusters >or not) are manifestations of bus timing problems, perhaps over temperature, >rather than actual bit flips in memory. The actual measured rate of single >event upsets is so low sure. I'm just talking about observed events reported by ECC hardware. interestingly, it's easy to imagine a scenario where the MC trains its dram parameters at one temperature, but winds up operating at another. and possibly operating poorly - things like skew are set by the bios and afaik never recalibrated.
- Previous message: [Beowulf] Memory errors poll
- Next message: [Beowulf] Memory errors poll
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
