[Beowulf] ECC support on motherboards?

Mon May 12 19:40:47 PDT 2008

Quoting Joe Landman <landman at scalableinformatics.com>, on Mon 12 May  
2008 06:03:56 PM PDT:

> Perry E. Metzger wrote:
>> Joe Landman <landman at scalableinformatics.com> writes:
>>>> I've been reading spec sheets, and they often don't tell you, which is
>>>> rather annoying. Thus my question.
>>> I just randomly selected 2 motherboards from 2 different vendors and
>>> on both spec sheets, they clearly defined which memory they took.
>>
>> Oh, sheets will tell you that they *take* ECC memory, but long
>> experience says that the motherboards that actually properly do ECC
>> scrubbing are a subset -- some boards will accept the extra ECC bits
>> and do nothing with them! Generally speaking, the only reliable way

There's a difference between logic that does EDAC on read (and  
corrects any single bit errors), and having logic that actually writes  
back when an EDAC hit occurs.  In most cases, the latter is under  
software control: that is, when you get the unlikely upset, you have  
some sort of interrupt routine that goes out and rewrites that general  
section of memory. The EDAC logic remembers what the location of the  
"error" was, so the ISR knows where to read/write.

It's not clear that the rewrite actually buys you much, IFF the error  
rate is low.  That is, if you get one upset per  day, you're probably  
safe just correcting on the read, and assuming that you'll not get a  
second upset *on the same word* before it changes to something else  
(which would rewrite the syndrome bits). OTOH, if you have an upset  
rate in the sub-second range (pretty unlikely, I should think), maybe  
some sort of active scrubbing might be useful.

The typical scrubber is tied to some sort of interrupt line or clock,  
and just reads/writes each location in turn.  Obviously, this burns  
memory bus bandwidth.

There's also systems that have autoscrubbing, where the memory  
contents are expected to be constant over a long time, so it gets  
continuously rewritten (or, at least, a checksum is done, and if it  
fails, it rewrites from a known good copy).  This is a typical scheme  
for SRAM based FPGAs (Xilinx) in spaceflight applications.

Just how many bit errors do people see?  Even in old, very soft DRAM  
technology I worked with in the early 80s, we'd get maybe 1 legitimate  
SBE per week in a megabyte or so of DRAM.. that's with HUGE transistor  
sizes and very, very soft parts.  We saw more errors than that during  
prototyping, but it was manifestations of bus timing, or timing  
conflicts that trashed bits.

here's a presentation on SDRAMs in spaceflight applications
http://klabs.org/richcontent/MAPLDCon02/presentations/session_p/p7_ladbury_s.ppt

There's some data on an array of 12 256 Mbit SDRAMs EDACed with upset  
rates in geosynchronous orbit of 4E-17 per day, 99th percentile. Thats  
with a raw upset rate of about 1/2 bit/day overall

http://www.maxwell.com/microelectronics/support/presentations/ESCCON_2002.pdf

>
> Heh... you want motherboards that work?  Different question :(
>
>> I've found to determine if the ECC stuff works is to look at the BIOS
>> ECC settings, but often that info seems to be missing from the
>> manuals.
>
> Sadly, the bios ECC settings on a number of MB's appear to be busted in
> some cases ...  well, ok, the bios setup of the ECC system appears to
> be busted.  At least most will signal an MCE these
>

There are, also, nefarious memory parts that emulate the ECC bits.   
Which, of course, does absolutely no good.
>>
>> Anyway, I was asking for a reason. :)
>