[Beowulf] Not quite Walmart, or, living without ECC?

Bruno Coutinho coutinho at dcc.ufmg.br
Mon Nov 26 13:15:06 PST 2007


I heard that the major source of memory corruption in servers is the memory
bus.
And this becomes worse as you add memory sticks.
With 8 memory stics that have 8 chips in both sides, you has 128 chips.
So the main purpose of ECC is correcting bus errors.



2007/11/26, David Mathog <mathog at caltech.edu>:
>
> I ran a little test over the Thanksgiving holiday to see how common
> random errors in nonECC memory are.  I used the memtest86+ bit fade test
> mode, which writes all 1s, waits 90 minutes, checks the result, then
> does the same thing for all 0s.   Anyway, this was the best test I could
> find for detecting the occasional gamma ray type data loss event.  The
> result: no errors logged in 5 solid days of testing.  So this class of
> error (the type ECC would detect and probably fix) apparently occurs
> on these machines at a rate of less than 1 per 840 Gigabyte-hours.
> Possibly the upper limit is half that if data can only be lost
> on 1 -> 0 transition, or vice versa.  This assumes the bit fade test
> works, which cannot be independently verified from these results.
>
> On the web there are references to an IBM study which found 1 bit
> error/256Mb/Month, which would have been (.25 *30 * 24) =
> 1 per 180 Gigabyte-hours.  If IBM's numbers held for my hardware
> there should have seen 4 or 5 errors in total.  Mine are in a basement
> in a concrete building, perhaps that provided some shielding relative to
> what IBM used for their test conditions.
>
> The memory was Corsair Twinx1024-3200C2.  When first installed all
> of this memory had run for 24 hours with no errors in normal
> memtest86+ testing.
>
> Regards,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071126/4e8edb47/attachment.html>


More information about the Beowulf mailing list