[Beowulf] ECC support on motherboards?
james.p.lux at jpl.nasa.gov
Mon May 12 19:40:47 PDT 2008
Quoting Joe Landman <landman at scalableinformatics.com>, on Mon 12 May
2008 06:03:56 PM PDT:
> Perry E. Metzger wrote:
>> Joe Landman <landman at scalableinformatics.com> writes:
>>>> I've been reading spec sheets, and they often don't tell you, which is
>>>> rather annoying. Thus my question.
>>> I just randomly selected 2 motherboards from 2 different vendors and
>>> on both spec sheets, they clearly defined which memory they took.
>> Oh, sheets will tell you that they *take* ECC memory, but long
>> experience says that the motherboards that actually properly do ECC
>> scrubbing are a subset -- some boards will accept the extra ECC bits
>> and do nothing with them! Generally speaking, the only reliable way
There's a difference between logic that does EDAC on read (and
corrects any single bit errors), and having logic that actually writes
back when an EDAC hit occurs. In most cases, the latter is under
software control: that is, when you get the unlikely upset, you have
some sort of interrupt routine that goes out and rewrites that general
section of memory. The EDAC logic remembers what the location of the
"error" was, so the ISR knows where to read/write.
It's not clear that the rewrite actually buys you much, IFF the error
rate is low. That is, if you get one upset per day, you're probably
safe just correcting on the read, and assuming that you'll not get a
second upset *on the same word* before it changes to something else
(which would rewrite the syndrome bits). OTOH, if you have an upset
rate in the sub-second range (pretty unlikely, I should think), maybe
some sort of active scrubbing might be useful.
The typical scrubber is tied to some sort of interrupt line or clock,
and just reads/writes each location in turn. Obviously, this burns
memory bus bandwidth.
There's also systems that have autoscrubbing, where the memory
contents are expected to be constant over a long time, so it gets
continuously rewritten (or, at least, a checksum is done, and if it
fails, it rewrites from a known good copy). This is a typical scheme
for SRAM based FPGAs (Xilinx) in spaceflight applications.
Just how many bit errors do people see? Even in old, very soft DRAM
technology I worked with in the early 80s, we'd get maybe 1 legitimate
SBE per week in a megabyte or so of DRAM.. that's with HUGE transistor
sizes and very, very soft parts. We saw more errors than that during
prototyping, but it was manifestations of bus timing, or timing
conflicts that trashed bits.
here's a presentation on SDRAMs in spaceflight applications
There's some data on an array of 12 256 Mbit SDRAMs EDACed with upset
rates in geosynchronous orbit of 4E-17 per day, 99th percentile. Thats
with a raw upset rate of about 1/2 bit/day overall
> Heh... you want motherboards that work? Different question :(
>> I've found to determine if the ECC stuff works is to look at the BIOS
>> ECC settings, but often that info seems to be missing from the
> Sadly, the bios ECC settings on a number of MB's appear to be busted in
> some cases ... well, ok, the bios setup of the ECC system appears to
> be busted. At least most will signal an MCE these
There are, also, nefarious memory parts that emulate the ECC bits.
Which, of course, does absolutely no good.
>> Anyway, I was asking for a reason. :)
More information about the Beowulf