Memory type? (ECC vs non-ECC)

Thomas R Boehme mail at thomas-boehme.de
Fri Aug 17 15:23:00 PDT 2001


I would recommend ECC. Even if you can afford a machine crashing, what if it
it doesn't crash and just affect some of your data? How do you know if those
simulations you run did run correctly? Can you afford getting the wrong
results?

This question actually has a much bigger scope then just the question of ECC
or not. We recently got a couple of machines that experience sporadic
problems. Some crashed the machines; some didn't and just perturbed the data
from our Molecular Dynamics simulations. We think it was heat related, but
the big question is, if we would have found out if it hadn't crashed
eventually.

Bye, Thomas

-----Original Message-----
From: Jared Hodge [mailto:jared_hodge at iat.utexas.edu] 
Sent: Friday, August 17, 2001 2:25 PM
To: Dan Kirkpatrick
Cc: beowulf at beowulf.org
Subject: Re: Memory type? (ECC vs non-ECC)

Dan,
	Ok, first with what ECC is.  Error Correction Circuitry.  How will
this
affect performance?  As far as speed, they run about the same (ECC may
even be a little slower).  The issue is reliability.  We had a few
rounds of E-mails on how often errors occur in non-ECC memory chips a
few months ago (and it's affected by climate, altitude, EMI radiation,
solar flares, someone breaking wind near the machines, bla, bla, bla). 
Anyway, I don't think you want us to launch that conversation again, but
the thing is that with a single machine, non-ECC is typically fine
(except for mission critical servers, etc.) since the time between
errors is so great.  The problem is that with a cluster, you have so
many memory chips that the time between failures (of any one of them) is
significantly less.  I guess the question is how big is the cluster and
how much do you lose if you have to restart?  We've got an 8 node
cluster, 4 GB RAM total that seems to work fine without ECC.  We're
getting a larger cluster with 24 GB RAM total and going with ECC.  The
larger the cluster, the more you need ECC.  Also, if you're running
problems that take many days to complete, go with ECC.  If you're
running checkpoints, or individual problems only take a few hours, you
can go with non-ECC.  Hope this helps.

Jared

Dan Kirkpatrick wrote:
> 
> We're finalizing our specs for our next beowulf cluster... and I had a
> question...
> 
> ECC or non-ECC memory?  Motherboard "supports" ECC memory mode... although
> non-ecc memory is cheaper so we can get more...
> How does this realistically affect performance?
> 
> comments?
> Thanks
> 
> =======================================================
> Dan Kirkpatrick                   dkirk at physics.syr.edu
> Computer Systems Manager
> Department of Physics
> Syracuse University, Syracuse, NY
> http://www.physics.syr.edu/help/    Fax:(315) 443-9103
> =======================================================
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Jared Hodge
Institute for Advanced Technology
The University of Texas at Austin
3925 W. Braker Lane, Suite 400
Austin, Texas 78759

Phone: 512-232-4460
Fax: 512-471-9096
Email: Jared_Hodge at iat.utexas.edu

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list