[Beowulf] ECC Memory and Job Failures

Thu Apr 23 15:35:37 PDT 2009

On Thu, Apr 23, 2009 at 04:45:08PM +0100, Huw Lynes wrote:
> 
> Thought this might be of interest to others:
> 
> http://blog.revolution-computing.com/2009/04/blame-it-on-cosmic-rays.html
> 
> Apparently someone ran a large cluster job with both ECC and none-ECC
> RAM. They consistently got the wrong answer when foregoing ECC.
> 
> I'd love to see the original data.

Not unexpected and yes, ....data please.

What if disabling ECC changes the data path timing and
uncovers a hardware race condition that is unrelated
to cosmic ray bit flipping.  Test with stream benchmarks 
etc....  bit error rates should track to altitude. 

While cosmic ray bit flipping is real it is only one data integrity
issue to cope with in system design.

Does disabling ECC enable some other form of error detection
like parity or is the RAM running bare.   Does the ECC hardware
log errors even in disabled mode (it might).  In some cases 
disabling ECC causes the RAM to be accessed faster... causing more
heat causing timing changes...

Years ago SGI ran into this when the cache line coherency model changed
on one desktop box.   While today's RAM technology is very different it
is interesting to note that then a parity error might be expected once
in about 22 days on a 96MB RAM system on those old boxes (as best I
can recall).  The memory design made it very easy to count the errors
and very hard to not count them.  The last part is important btw.

The vast majority were seen only by the kernel in "bzero(), bcopy()" where
they could be safely delt with once the issue was understood.  Other recovery
tricks delt with more but not all errors... some applications would be killed
when recovery was impossible.   To my knowledge that was the last system SGI 
designed from scratch that only had parity error detection on main memory.

I suspect the same number (one flipped bit in 22 days) could be used as
an initial assumption for any block of 6or8 -DIMMs as the cross section of
the "detector" is about the same (i.e. square mm of Si).  I suspect that good
data is also very HARD to come by.

IMO Running on a large cluster without multiple bit detection and a minimum of one bit 
correction ECC is silly.

Further running without watching the ECC logs is also silly.  Watching the
logs can be hard to do.   ECC codes for wide cache lines today are very
good and a bad component may go undetected for some time.  Some memory
controllers will correct single bit errors without inserting a delay....
or posting a machine check exception.  Of interest a hardware trainer at SGI was mystified
when he cut the leg on a memory chip and it did not produce the error
that he expected on an Orign.   

DMA data paths, cache, and even paths internal to the processor and IO should be protected.

When I first heard this 64k DRAM as the new thing (c 1984 perhaps sooner)
and IBM with the IBM PC was in the middle of it.  Then it was Cosmic rays,
today Google search for Neutrons and flipped bits.   There was one distraction
associated with uranium contaminated ceramic packages back then too.

If guess is S(*t happens, Bits flip.

Later,
mitch

-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?