[Beowulf] Curious about ECC vs non-ECC in practice

Guy Coates gmpc at sanger.ac.uk
Fri May 20 01:58:59 PDT 2011


On 20/05/11 06:45, Greg Lindahl wrote:
> On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote:
> 
>>    Does anyone run a large-ish cluster without ECC ram?  Or with ECC 
>> turned off at the motherboard level?  I am curious if there are numbers 
>> of these, and what issues people encounter.  I have some of my own data 
>> from smaller collections of systems, I am wondering about this for 
>> larger systems.

We did, circa 2003. Never again.

When we were lucky, the uncorrected errors happened in memory in use by
the kernel or application code, and we got hard machine crashes or code
seg-faulting. Those were easy to spot.

When we were unlucky, the errors happened in page cache, resulting in
data being randomly transmuted. Most of the code we were running at the
time did minimal input sanity checking. It was quite instructive to see
just how much genomic analysis code would quite happily compute on DNA
sequences that contained things other than ATGC.

The duff runs would eventually get picked up by the various
sanity-checks that happened at the end of our analysis pipelines, but it
involved quite a bit of developer & sysadmin effort to track down and
re-run all of the possibly affected jobs.

Cheers,

Guy


-- 
Dr. Guy Coates, Informatics Systems Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 



More information about the Beowulf mailing list