[Beowulf] Re: cheap PCs this christmas
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tony Travis ajt at rri.sari.ac.ukMon Nov 14 12:07:14 PST 2005
- Previous message: [Beowulf] Re: cheap PCs this christmas
- Next message: [Beowulf] Liinpack benchmark
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
David Mathog wrote: >> It's not quite as bad as it sounds because, on the basis of simulations >> running the "memtester" stress test periodically on nodes in our cluster >> we have machines that have been up for over 60 days that are capable of >> running 100 passes on 50% of their memory (typically 512MB) without >> reporting an error. I'm working on the basis that if the stress test >> doesn't give errors then a 'normal' application is unlikely to either. > > There's a slight problem with that argument. Memtest writes and then > reads back memory fairly quickly. It will detect memory errors that > [...] Hello, David. Good point, but I'm not using memtest86, I'm using "memtester": http://pyropus.ca/software/memtester/ This is Charles Cazabon's user-mode VM stress test, using mlock() to lock memory into 'core' while Linux is running. It's not a stand-alone boot-time/burn-in memory test like "memtest86". I also test the swap disk separately, but "memtester" doesn't allow the tested memory to be swapped unless it runs in 'degraded' mode without mlock() which is NOT recommended. The test takes about 50h to run on an Athlon XP 2400+ with 1GB RAM (512MB of which is actually tested). All our nodes have already passed memtest86+ which I use to check for memory faults before they are connected to the cluster. The nodes then have to run 100 passes of "memtester" without error on 50% of their memory (the maximum that can be locked by a user process under Linux) before being allowed to accept openMosix migrated processes from the other nodes in the cluster. I also periodically run "memtester" along with 'normal' jobs, as a confidence test, to ensure the cluster is working reliably. Having 'weeded' out all the suspect memory, it is now running quite reliably. The last time I had to reboot the entire cluster was caused by a mains power failure to the whole building. Best wishes, Tony. -- Dr. A.J.Travis, | mailto:ajt at rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687
- Previous message: [Beowulf] Re: cheap PCs this christmas
- Next message: [Beowulf] Liinpack benchmark
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
