[Beowulf] Re: ECC exerciser/exorciser?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at caltech.eduTue Jan 27 09:15:47 PST 2009
- Previous message: [Beowulf] Not sure if people have seen it yet, but 2TB disks from Western Digital appear to be in the wild ...
- Next message: [Beowulf] Re: ECC exerciser/exorciser?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Joe Landman <landman at scalableinformatics.com> wrote: > > There are two that I know of ... memtest and memtest86, one of which is > a fork of the other. While I like both for coarse testing, we run a > bunch of GAMESS runs to burn nodes in. Some folks like HPL for this. I > like large dense matrix computations that pound on the memory subsystem. It's an interesting question: why don't the common memory testers catch memory failures that user code does? One would think that the folks who maintain these programs would be trying very hard to emulate the loads that real code places on a system. But consider the differences. 1. One limitation of memtest86+, at least the last time I looked, was that it only used a single core in a multiple CPU system. The tests Joe describes above are going to be banging away on all cores at once. Since memtest86+ is in a sense its own operating system, getting it to run on multiple cores would require one heck of a lot of code to be added. So much so that it probably becomes easier to just boot linux and run the memory tester as a standard application, or as Joe says, just run the actual applications. 2. Most of the modes in most memory testers (generalizing much?) are in some sense sequential. That is, they tend to go through memory in a fixed order, this is not always strictly linear, but is rarely (ever?) as random as the end user test codes may be. Consequently, they tend not to find failure modes that correspond to multiple memory operations on memory cells at peculiar geometries and intervals. 3. The memory testers don't exercise anything but the memory. This puts a pretty constant but minimal load on the power supply. Pounding away at the same time on the disks (and to a lesser extent the NIC) puts a large and varying load on the power supply, which most likely results in additional noise on all voltages, which may be enough to trigger memory failures in marginal devices. As an aside, I have often wished I had a "marginal by design" power supply specifically for more realistic stress tests of the electronic components in a bench situation. That is, a power supply that acts with minimal load as if it was under severe time varying load, with respect to noise on the voltages. This would be useful in finding marginal electronic components, not only memory, but motherboards, NICs, and so forth. No one is going to sell such a PSU, but one could make a sort of "brutalizer box" to plug into a system to emulate this. This would be a small(ish) device into which a power supply connector would be plugged. Once powered up, it would apply a wildly and rapidly varying load on all voltage lines. It need not be a particularly complicated circuit. For instance, run a white noise generator off of the 5V line and use that to drive load transistors on all the supply lines. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] Not sure if people have seen it yet, but 2TB disks from Western Digital appear to be in the wild ...
- Next message: [Beowulf] Re: ECC exerciser/exorciser?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
