[Beowulf] Not quite Walmart, or, living without ECC?
ajt at rri.sari.ac.uk
Mon Nov 26 16:02:52 PST 2007
David Mathog wrote:
> I ran a little test over the Thanksgiving holiday to see how common
> random errors in nonECC memory are. I used the memtest86+ bit fade test
> mode, which writes all 1s, waits 90 minutes, checks the result, then
> does the same thing for all 0s. Anyway, this was the best test I could
> find for detecting the occasional gamma ray type data loss event. The
Memtest86+ is fine for 'burn-in' tests, but it does not do a realistic
memory stress test under the conditions that normal applications run. I
test new non-ECC compute nodes by booting memtest86+ and running it
for 24h. If there are no errors I reboot into Linux and run memtester.
I've found memory that passes a 24h memtest86+ test, but fails memtester:
If one of our compute node crashes in when use it is re-tested the same
way before being allowed to rejoin the openMosix cluster. It is possible
that faults detected by memtester are caused by other components such
as CPU's overheating or PSU's struggling to provide enough power but the
important point is these problems affect applications in a similar way.
All the compute nodes in our Beowulf cluster have to pass 24h Memtest86+
clean, followed by 100 memtester runs on 128MB RAM before being trusted
to accept openMosix migrated processes, or to be used as LAM MPI hosts.
Dr. A.J.Travis, | mailto:ajt at rri.sari.ac.uk
Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687
More information about the Beowulf