[Beowulf] cheap PCs this christmas
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tony Travis ajt at rri.sari.ac.ukThu Nov 10 08:04:37 PST 2005
- Previous message: [Beowulf] cheap PCs this christmas
- Next message: [Beowulf] cheap PCs this christmas
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jim Lux wrote: > There's a rumor that HP is going to have a $300 PC this Christmsas > shopping season (as in starting the day after thanksgiving) to be sold > through massmarket outlets (e.g. Wal-Mart). Presumably this is a real > $300, not a $1000 PC with $700 in "rebates". Hello, Jim. That sounds great BUT what about the reliability of COTS memory? I built a 64-node Athlon XP 2400+/2600+ cluster here to run openMosix, and had *terrible* problems with memory reliability on 32 of the nodes with slimline cases that I bought for £287 each including 1GB RAM, 40GB IDE disk and 3C2000 PCI Gigabit NIC. Although it sounds like a bargain, it has taken me a long time to weed out all the 'bad' memory (using memtest86 and memtester). A particular problem when using openMosix process migration is that bad RAM on one node can spread process memory corruption throughout the cluster - I wrote about this on the oM Wiki: http://howto.x-tend.be/openMosixWiki/index.php/Additions%20to%20the%20FAQ Now, I don't allow openMosix compute nodes to join the cluster unless they can run 100 passes of memtester on 50% of their available RAM without a single error. This might seem a bit OTT but it is, in fact, a realistic simulation of the way real jobs run on the cluster. We adopted this strategy because some jobs were producing odd results despite the fact that ALL the nodes passed memtest86 before being allowed to join the cluster. There has been some discussion about the reliability of COTS memory in space: http://www.crhc.uiuc.edu/FTCS-29/pdfs/rennelsd.pdf And an infrastructure for handling memory errors in the Linux kernel: http://kerneltrap.org/node/5293 I've suggested doing CRC checks on memory transfers during oM process migration, but this was received with little enthusiasm by the openMosix community. I think it it's a similar problem to doing CRC checks on disk transfers myself, and the performance overhead would be acceptable with 100Base-T/Gigabit NIC latency. I thought it might be possible to adapt Rick Rein's work, but he told me he was doubtful about this: http://www.linuxjournal.com/article/4489 I think memory reliability represents an Achilles heel for openMosix on COTS clusters. The economics DIY Beowulf seem a lot less attractive if you have to use PC's with ECC memory. My present strategy is to subject nodes to random memory stress tests, and replace memory if any errors are reported. If a node crashes during normal use it is not allowed to re-join the cluster until it has run 100 passes of memtester without error. FYI memtester is at: http://pyropus.ca/software/memtester/ I'm interested to know about other people's views and experiences of the reliability of COTS (i.e. non-ECC) memory? Best wishes, Tony. -- Dr. A.J.Travis, | mailto:ajt at rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687
- Previous message: [Beowulf] cheap PCs this christmas
- Next message: [Beowulf] cheap PCs this christmas
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
