[Beowulf] ECC settings for Opteron 175 + Serverworks HT1000 chipset
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bruce Allen ballen at gravity.phys.uwm.eduFri Jan 27 06:13:26 PST 2006
- Previous message: RS: [Beowulf] about clusters in high schools
- Next message: [Beowulf] ECC settings for Opteron 175 + Serverworks HT1000 chipset
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Salut Velu! > Hey bruce, World is small isn't it ;)) Yup. (Actually if you look back through the archives you'll find the first annoucment of smartmontools was on this list.) >> [...] >> I would appreciate advice about: >> -- how to configure these settings >> -- pointers to relevant AMD/Serverworks documentation >> -- relevant Linux kernel options/modules >> -- anything else relevant/related > You cand find some documentation on this project : > http://bluesmoke.sourceforge.net/ or the older > http://www.anime.net/~goemon/linux-ecc/ I've been corresponding off-list with Mark Langsdorf. He's an AMD employee who works on Linux tools and implementation, hangs out on the LKML, and submits kernel patches from AMD. Mark said that the 'bluesmoke' functionality is only needed with 2.4 kernels. With 2.6 kernels you just install 'mcelog' and that's everything that's needed. Mark also said that the mapping between CPUID and chipid needs to be correlated with DIMM slot on a case-by-case basis. One way (which Mark does NOT recommend!) is to heat each DIMM with a heat gun, or mask off a single bit on the connector, to generate errors from that DIMM. This makes sense for people on this list who will have dozens or hundreds of the same box and want to understand this relationship. > EDAC sounds to be on the way to be integrated upstream > (http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=806c35f5057a64d3061ee4e2b1023bf6f6d328e2). > This sounds to be some preliminary work but you may give it a try. *I > don't know your configuration but the "drivers/edac/amd76x_edac.c" may > match. I didn't had time to test EDAC but if you will, I'm interested in > your results. I'll report back to the list whether mcelog is enough, or whether we also needed to install other drivers to get ECC reporting. Mark also provided advice about the other ECC settings. I'll copy it verbatim to the list. Mark wrote: You'll want to look at chapter 3 (Memory System) of the BKDG (AMD 64 BIOS AND KERNEL DEVELOPERS GUIDE). Here's the recommended settings: ECC enable Enable MCA DRAM ECC logging Enable ECC Chip Kill Enable if using x4 DIMMs DRAM Scrub Redirect Enable DRAM BG Scrub set as high as possible (84 ms is maximum) L2 Cache BG Scrub not DRAM related Data Cache BG Scrub not DRAM related [Note from Bruce: can anyone on the list make recommendations about this last two, non-DRAM-related SCRUB settings??] I also asked Mark: > Am I correct that there is nothing in the Linux kernel which > modifies the machine registers which determine ECC behavior, > so I have to depend upon the BIOS to initialize/configure > these registers as I want? He replied: As far as I know, it's BIOS set-up only. Linux tries to avoid knowing the details of the DRAM set-up, and there's a limit to how much the OS can modify anyway. Linux can set bits to determine what MCEs cause exceptions, but it can't enable the DRAM scrubber, for example. Cheers, Bruce
- Previous message: RS: [Beowulf] about clusters in high schools
- Next message: [Beowulf] ECC settings for Opteron 175 + Serverworks HT1000 chipset
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
