[Beowulf] ECC settings for Opteron 175 + Serverworks HT1000
ballen at gravity.phys.uwm.edu
Fri Jan 27 06:13:26 PST 2006
> Hey bruce, World is small isn't it ;))
Yup. (Actually if you look back through the archives you'll find the
first annoucment of smartmontools was on this list.)
>> I would appreciate advice about:
>> -- how to configure these settings
>> -- pointers to relevant AMD/Serverworks documentation
>> -- relevant Linux kernel options/modules
>> -- anything else relevant/related
> You cand find some documentation on this project :
> http://bluesmoke.sourceforge.net/ or the older
I've been corresponding off-list with Mark Langsdorf. He's an AMD
employee who works on Linux tools and implementation, hangs out on the
LKML, and submits kernel patches from AMD. Mark said that the 'bluesmoke'
functionality is only needed with 2.4 kernels. With 2.6 kernels you just
install 'mcelog' and that's everything that's needed.
Mark also said that the mapping between CPUID and chipid needs to be
correlated with DIMM slot on a case-by-case basis. One way (which Mark
does NOT recommend!) is to heat each DIMM with a heat gun, or mask off a
single bit on the connector, to generate errors from that DIMM. This
makes sense for people on this list who will have dozens or hundreds of
the same box and want to understand this relationship.
> EDAC sounds to be on the way to be integrated upstream
> This sounds to be some preliminary work but you may give it a try. *I
> don't know your configuration but the "drivers/edac/amd76x_edac.c" may
> match. I didn't had time to test EDAC but if you will, I'm interested in
> your results.
I'll report back to the list whether mcelog is enough, or whether we also
needed to install other drivers to get ECC reporting.
Mark also provided advice about the other ECC settings. I'll copy it
verbatim to the list. Mark wrote:
You'll want to look at chapter 3 (Memory System) of the BKDG (AMD 64 BIOS
AND KERNEL DEVELOPERS GUIDE). Here's the recommended settings:
MCA DRAM ECC logging
ECC Chip Kill
Enable if using x4 DIMMs
DRAM Scrub Redirect
DRAM BG Scrub
set as high as possible (84 ms is maximum)
L2 Cache BG Scrub
not DRAM related
Data Cache BG Scrub
not DRAM related
[Note from Bruce: can anyone on the list make recommendations about this
last two, non-DRAM-related SCRUB settings??]
I also asked Mark:
> Am I correct that there is nothing in the Linux kernel which
> modifies the machine registers which determine ECC behavior,
> so I have to depend upon the BIOS to initialize/configure
> these registers as I want?
As far as I know, it's BIOS set-up only. Linux tries to avoid
knowing the details of the DRAM set-up, and there's a limit to
how much the OS can modify anyway. Linux can set bits to
determine what MCEs cause exceptions, but it can't enable the
DRAM scrubber, for example.
More information about the Beowulf