[Beowulf] ECC support on motherboards?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux james.p.lux at jpl.nasa.govMon May 12 19:40:47 PDT 2008
- Previous message: [Beowulf] ECC support on motherboards?
- Next message: [Beowulf] Re: ECC support on motherboards? (Perry E. Metzger)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Quoting Joe Landman <landman at scalableinformatics.com>, on Mon 12 May 2008 06:03:56 PM PDT: > Perry E. Metzger wrote: >> Joe Landman <landman at scalableinformatics.com> writes: >>>> I've been reading spec sheets, and they often don't tell you, which is >>>> rather annoying. Thus my question. >>> I just randomly selected 2 motherboards from 2 different vendors and >>> on both spec sheets, they clearly defined which memory they took. >> >> Oh, sheets will tell you that they *take* ECC memory, but long >> experience says that the motherboards that actually properly do ECC >> scrubbing are a subset -- some boards will accept the extra ECC bits >> and do nothing with them! Generally speaking, the only reliable way There's a difference between logic that does EDAC on read (and corrects any single bit errors), and having logic that actually writes back when an EDAC hit occurs. In most cases, the latter is under software control: that is, when you get the unlikely upset, you have some sort of interrupt routine that goes out and rewrites that general section of memory. The EDAC logic remembers what the location of the "error" was, so the ISR knows where to read/write. It's not clear that the rewrite actually buys you much, IFF the error rate is low. That is, if you get one upset per day, you're probably safe just correcting on the read, and assuming that you'll not get a second upset *on the same word* before it changes to something else (which would rewrite the syndrome bits). OTOH, if you have an upset rate in the sub-second range (pretty unlikely, I should think), maybe some sort of active scrubbing might be useful. The typical scrubber is tied to some sort of interrupt line or clock, and just reads/writes each location in turn. Obviously, this burns memory bus bandwidth. There's also systems that have autoscrubbing, where the memory contents are expected to be constant over a long time, so it gets continuously rewritten (or, at least, a checksum is done, and if it fails, it rewrites from a known good copy). This is a typical scheme for SRAM based FPGAs (Xilinx) in spaceflight applications. Just how many bit errors do people see? Even in old, very soft DRAM technology I worked with in the early 80s, we'd get maybe 1 legitimate SBE per week in a megabyte or so of DRAM.. that's with HUGE transistor sizes and very, very soft parts. We saw more errors than that during prototyping, but it was manifestations of bus timing, or timing conflicts that trashed bits. here's a presentation on SDRAMs in spaceflight applications http://klabs.org/richcontent/MAPLDCon02/presentations/session_p/p7_ladbury_s.ppt There's some data on an array of 12 256 Mbit SDRAMs EDACed with upset rates in geosynchronous orbit of 4E-17 per day, 99th percentile. Thats with a raw upset rate of about 1/2 bit/day overall http://www.maxwell.com/microelectronics/support/presentations/ESCCON_2002.pdf > > Heh... you want motherboards that work? Different question :( > >> I've found to determine if the ECC stuff works is to look at the BIOS >> ECC settings, but often that info seems to be missing from the >> manuals. > > Sadly, the bios ECC settings on a number of MB's appear to be busted in > some cases ... well, ok, the bios setup of the ECC system appears to > be busted. At least most will signal an MCE these > There are, also, nefarious memory parts that emulate the ECC bits. Which, of course, does absolutely no good. >> >> Anyway, I was asking for a reason. :) >
- Previous message: [Beowulf] ECC support on motherboards?
- Next message: [Beowulf] Re: ECC support on motherboards? (Perry E. Metzger)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
