[Beowulf] Re: ECC exerciser/exorciser?

David Mathog mathog at caltech.edu
Tue Jan 27 09:15:47 PST 2009

Joe Landman <landman at scalableinformatics.com> wrote:

> There are two that I know of ... memtest and memtest86, one of which is 
> a fork of the other.  While I like both for coarse testing, we run a 
> bunch of GAMESS runs to burn nodes in.  Some folks like HPL for this.  I 
> like large dense matrix computations that pound on the memory subsystem.

It's an interesting question: why don't the common memory testers catch
memory failures that user code does?  One would think that the folks who
maintain these programs would be trying very hard to emulate the loads
that real code places on a system.  But consider the differences.  

1.  One limitation of memtest86+, at least the last time I looked, was
that it only used a single core in a multiple CPU system.  The tests Joe
describes above are going to be banging away on all cores at once. 
Since memtest86+ is in a sense its own operating system, getting it to
run on multiple cores would require one heck of a lot of code to be
added.  So much so that it probably becomes easier to just boot linux
and run the memory tester as a standard application, or as Joe says,
just run the actual applications.

2.  Most of the modes in most memory testers (generalizing much?) are in
some sense sequential.  That is, they tend to go through memory in a
fixed order, this is not always strictly linear, but is rarely (ever?)
as random as the end user test codes may be.  Consequently, they tend
not to find failure modes that correspond to multiple memory operations
on memory cells at peculiar geometries and intervals.

3.  The memory testers don't exercise anything but the memory. This puts
a pretty constant but minimal load on the power supply.  Pounding away
at the same time on the disks (and to a lesser extent the NIC) puts a
large and varying load on the power supply, which most likely results in
additional noise on all voltages, which may be enough to trigger memory
failures in marginal devices.  

As an aside, I have often wished I had a "marginal by design" power
supply specifically for more realistic stress tests of the electronic
components in a bench situation.  That is, a power supply that acts with
minimal load as if it was under severe time varying load, with respect
to noise on the voltages.  This would be useful in finding marginal
electronic components, not only memory, but motherboards, NICs, and so
forth.  No one is going to sell such a PSU, but one could make a sort of
"brutalizer box" to plug into a system to emulate this.  This would be a
small(ish) device into which a power supply connector would be plugged.
 Once powered up, it would apply a wildly and rapidly varying load on
all voltage lines.  It need not be a particularly complicated circuit. 
For instance, run a white noise generator off of the 5V line and use
that to drive load transistors on all the supply lines.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

More information about the Beowulf mailing list