[Beowulf] ECC exerciser/exorciser?
prentice at ias.edu
Mon Jan 26 08:00:54 PST 2009
Mark Hahn wrote:
> Hi all,
> we're having some trouble with nodes showing high ECC corrected error (CE)
> counts. I'm wondering whether you have any wisdom on the following:
> - first, how would you go about setting a threshold for how high is an
> acceptable CE count? we by default are using the mce module, which by
> default polls at 1Hz. my thinking is that if we get overflow events
> (the multiple error bit is set), then it's too fast.
> - do you have or know of a good exerciser for testing ECC's? yes, I
> know about memtest86, but I'm more curious about a load that could be
> run under
> linux. my thinking is that ecc's are triggered by bad reads, so something
> which allocates all memory and then continually reads it would be best.
I find just running a large HPL job across the cluster will find errors
It may take a couple of days, but it will. I've run breakin for days on
end, and not found any memory errors, but when I run a full-blown hpl
job, I find memory errors right away (if right away = a couple of days)
Breakin runs xhpl on every core, but I'm not sure if it's MPI-based, or
if every core is running an independent job. Maybe the breakin
developer(s) can pipe in on how it stresses the RAM.
Hope that helps.
More information about the Beowulf