[Beowulf] Not quite Walmart, or, living without ECC?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at caltech.eduMon Nov 26 12:27:03 PST 2007
- Previous message: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Next message: [Beowulf] Not quite Walmart, or, living without ECC?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I ran a little test over the Thanksgiving holiday to see how common random errors in nonECC memory are. I used the memtest86+ bit fade test mode, which writes all 1s, waits 90 minutes, checks the result, then does the same thing for all 0s. Anyway, this was the best test I could find for detecting the occasional gamma ray type data loss event. The result: no errors logged in 5 solid days of testing. So this class of error (the type ECC would detect and probably fix) apparently occurs on these machines at a rate of less than 1 per 840 Gigabyte-hours. Possibly the upper limit is half that if data can only be lost on 1 -> 0 transition, or vice versa. This assumes the bit fade test works, which cannot be independently verified from these results. On the web there are references to an IBM study which found 1 bit error/256Mb/Month, which would have been (.25 *30 * 24) = 1 per 180 Gigabyte-hours. If IBM's numbers held for my hardware there should have seen 4 or 5 errors in total. Mine are in a basement in a concrete building, perhaps that provided some shielding relative to what IBM used for their test conditions. The memory was Corsair Twinx1024-3200C2. When first installed all of this memory had run for 24 hours with no errors in normal memtest86+ testing. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Next message: [Beowulf] Not quite Walmart, or, living without ECC?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
