Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Burn-in Utilities

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Velocet math at velocet.ca
Wed Apr 24 08:30:08 PDT 2002


On Wed, Apr 24, 2002 at 10:45:19AM -0400, Justin Nemmers's all...
> All:
> 	I am in search of a utility that will allow me to burn-in a 
> new PC.  Ideally, it would peg the procs at 100% as well as exercise 
> the memory (as much as 2Gb/Node.  I know there is a Sun provided 
> utility to do this on Sparc systems, but does anyone have a 
> suggestion for a linux-based (perl would work, too) that will do the 
> same thing?

The packages (in debian and redhat AFAIK) cpuburn and memtest will do
you nicely.

We run 5 odd of each of burnMMX burnK7 and memtest on our athlon machines for
2-3 days and see if even one crashes. We've had a crash on machines tested
AFTER being in service with no problems for 3-4 months. So its definitely a
hardcore excercise.  Oh we also stick dnetc on them on top of all that just to
make sure its hurting.  I think they're set to generate the most heat
possible in the CPU during operation. They definitely draw the most current -
when we were first setting up our cluster and werent sure of power draw,
8 dual 1.333Ghz athlon boards (no drives) would run G98 fine on a 15
amp circuit - as soon as we ran burnMMX/k7 we'd blow breakers.

We run 5-10 to get a nice high context switch going and excercise the OS as
well ;) We (through trial and error) found that running only 1 each of
burnMMX/burnK7 at a time will often not crash for days, whereas running 5-10
will.

(In fact, we only consider a crash within 12 hours to be a reason to RMA it if
its slated for a workstation running windows.  12 hours of that test is almost
equivalent to a crash every 3-6 months of regular LINUX desktop use (and with
windows how can you tell? :))

Its actually suprising how well you can measure the quality of boards that
way. Out of 40 246x Tyan boards we found one bad stick of ram and 0 cpus and
boards bad using this method. However with ECS K75As we found 1/10 boards as
shipped to us would die in 1-6 hours under this load, and another 1/10 will
die within the 2-3 days. while ! burnMMX; do RMA_via_VAR; done

Nonetheless we've never seen every unit of a certain brand always crash within
that time - eventually we get good boards - so using proper sorting after
testing in this manner you can always end up with a set of good boards (at
least as far as these tests are concerned). So far with any board that makes
it past 2-3 days of this we've never seen a problem with Gaussian98, Gromacs
or distributed-net afterwards (at least until we hit long term electron
migration path problems due to regular CPU heat wear and tear...) but none of
our boards/CPUs (the PcChips M817 LMRs are hitting 16 months of continuous
operation) are there yet.

/kc


> 
> Cheers,
> Justin
> -- 
> 
> System Administrator
> National Institutes of Health
> Center for Information Technology
> 9000 Rockville PK
> Building 12B 2N/207
> Bethesda, MD 20892-5680
> 301.496.0396
> http://biowulf.nih.gov
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 



More information about the Beowulf mailing list