[Beowulf] Curious about ECC vs non-ECC in practice

Tony Travis a.travis at abdn.ac.uk
Fri May 20 08:35:45 PDT 2011


On 20/05/11 05:35, Joe Landman wrote:
> Hi folks
>
>     Does anyone run a large-ish cluster without ECC ram?  Or with ECC
> turned off at the motherboard level?  I am curious if there are numbers
> of these, and what issues people encounter.  I have some of my own data
> from smaller collections of systems, I am wondering about this for
> larger systems.

Hi, Joe.

I ran a small cluster of ~100 32-bit nodes witn non-ECC memory and it 
was a nightmare, as Guy described in his email, until I pre-emptively 
tested the memory in user-space, using Chlarles Cazabon's "memtester":

   http://pyropus.ca/software/memtester

Prior to this, *all* the RAM had passed Memtest86+.

I had a strict policy that if a system crashed, for any reason, it was 
re-tested with Memtest86+, then 100 passes of "memtester" before being 
allowed to re-join the Beowulf cluster. This made the Beowulf much more 
stable running openMosix. However, I've scrapped all our non-ECC nodes 
now because the real worry is not knowing if an error has occurred...

Apparently this is still a big issue for computers in space, using 
non-ECC RAM for solid-state storage on grounds of cost for imaging. 
They, apparently, use RAM background SoftECC 'scrubbers' like this:

http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf

Bye,

   Tony.
-- 
Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition
and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK
tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk
mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk



More information about the Beowulf mailing list