[Beowulf] ECC exerciser/exorciser?

Mark Hahn hahn at mcmaster.ca
Mon Jan 26 07:30:50 PST 2009


Hi all,
we're having some trouble with nodes showing high ECC corrected error (CE)
counts.  I'm wondering whether you have any wisdom on the following:

- first, how would you go about setting a threshold for how high is an
acceptable CE count?  we by default are using the mce module, which by 
default polls at 1Hz.  my thinking is that if we get overflow events
(the multiple error bit is set), then it's too fast.

- do you have or know of a good exerciser for testing ECC's?  yes, I know 
about memtest86, but I'm more curious about a load that could be run under
linux.  my thinking is that ecc's are triggered by bad reads, so something
which allocates all memory and then continually reads it would be best.

- how about layout of memory -> dimms?  take a single page, for example:
I presume that the first cacheline (16B) will be "striped" across both 
channels of one bank (for instance, the first dimm-pair.)  is it normal
for the 17th byte to begin on the next dimm-pair (csrow)?  dmidecode 
seems to indicate that 8 1GB dimms are mapped to contiguous addresses
(which would imply no channel interleaving, which is wrong...)

- does "numactl --hardware" work correctly for you?  I see something like:
available: 2 nodes (0-1)
node 0 size: 5375 MB
node 0 free: 3550 MB
node 1 size: 4095 MB
node 1 free: 3874 MB

9470 MB total, which, on a machine with only 8x 1GB dimms is unexpected...

thanks for any comments,
mark hahn



More information about the Beowulf mailing list