[Beowulf] ECC exerciser/exorciser?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Prentice Bisbal prentice at ias.eduMon Jan 26 08:00:54 PST 2009
- Previous message: [Beowulf] ECC exerciser/exorciser?
- Next message: [Beowulf] ECC exerciser/exorciser?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Mark Hahn wrote: > Hi all, > we're having some trouble with nodes showing high ECC corrected error (CE) > counts. I'm wondering whether you have any wisdom on the following: > > - first, how would you go about setting a threshold for how high is an > acceptable CE count? we by default are using the mce module, which by > default polls at 1Hz. my thinking is that if we get overflow events > (the multiple error bit is set), then it's too fast. > > - do you have or know of a good exerciser for testing ECC's? yes, I > know about memtest86, but I'm more curious about a load that could be > run under > linux. my thinking is that ecc's are triggered by bad reads, so something > which allocates all memory and then continually reads it would be best. > Mark, I find just running a large HPL job across the cluster will find errors It may take a couple of days, but it will. I've run breakin for days on end, and not found any memory errors, but when I run a full-blown hpl job, I find memory errors right away (if right away = a couple of days) Breakin runs xhpl on every core, but I'm not sure if it's MPI-based, or if every core is running an independent job. Maybe the breakin developer(s) can pipe in on how it stresses the RAM. Hope that helps. -- Prentice
- Previous message: [Beowulf] ECC exerciser/exorciser?
- Next message: [Beowulf] ECC exerciser/exorciser?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
