[Beowulf] Not quite Walmart, or, living without ECC?
ajt at rri.sari.ac.uk
Fri Nov 16 08:51:33 PST 2007
David Mathog wrote:
> Any of you running clusters without ECC? Has the lack of error
> correction been a problem?
Yes, I'm running openMosix on 64 at Athlon2400+/2600+ 1p compute nodes. I
posted this on the openMosix Wiki about it:
'Q.' How reliable is openMosix?
'A.' An openMosix cluster is only as reliable as its "least" reliable
node: In particular, memory corruption can be propagated throughout a
cluster if processes are migrated to and from an unreliable COTS
(Commodity Off The Shelf) PC without ECC (Error Correction Code) memory.
If the memory corruption is sufficient to make a migrated process crash,
the load on the unreliable node then decreases and more processes are
"attracted" to the node from the rest of the cluster by the openMosix
load balancing algorithm. Migrated processes that do not crash on the
node may also be corrupted if they make use of unreliable memory. When
these processes are migrated away from the unreliable node memory
corruption is propagated back to the rest of the openMosix cluster. For
this reason, it is essential to test the memory of COTS PC's thoroughly
BEFORE allowing them to join an openMosix cluster. This can be done
using a stand-alone utility e.g. "memtest86" (http://www.memtest86.com/)
or under Linux with a user-mode utility e.g. "memtester"
Dr. A.J.Travis, | mailto:ajt at rri.sari.ac.uk
Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687
More information about the Beowulf