Take any two: motherboard performance, compatibility, value
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Don Holmgren djholm at fnal.govWed Jun 28 17:06:33 PDT 2000
- Previous message: Take any two: motherboard performance, compatibility, value
- Next message: Take any two: motherboard performance, compatibility, value
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 28 Jun 2000, Bob Drzyzgula wrote: ... > > > BTW, I see that ECC corrects about one single bit error per month in > > > 12GB of RAM. Our total system will have close to 40GB, so errors could > > > pop up weekly, which is why we need ECC. > > > > Are you absolutely certain that ECC RAM on PC hardware actually *corrects* > > bit errors ? > > > > There was a short discussion on this subject on the linux-kernel list some > > weeks ago, where someone stated that ECC RAM (for PCs) can only *detect* a > > parity error and offer you an NMI when that occurs. Noone seemed to object to > > this. > > The last thing I am is an expert on this, but, quoting > Intel's 440BX web page at > > http://developer.intel.com/design/intarch/techinfo/440BX/BX_arch.htm > > ] The Intel® 440BX AGPset also provides DIMM plug-and-play > ] support via Serial Presence Detect (SPD) mechanism using > ] the SMBus interface. The 82443BX provides optional > ] data integrity features including ECC in the memory > ] array. During reads from DRAM, the 82443BX provides > ] error checking and correction of the data. The 82443BX > ] supports multiple-bit error detection and single-bit error > ] correction when ECC mode is enabled and single/multi-bit > ] error detection when correction is disabled. During > ] writes to the DRAM, the 82443BX generates ECC for the > ] data on a QWord basis. Partial QWord writes require a > ] read-modify-write cycle when ECC is enabled. > > In these PC architectures, I don't think that there is any > ECC generation on-module like there is in some architectures, > there is only sufficient bit storage to allow the chipset > to generate the somewhat-redundant codes and store those. > > Whether the motherboard manufacturers, BIOS writers and > operating systems configure the chipset properly to take > advantage of this, or do anything interesting with any > information provided by the chipset is another matter > entirely. I would expect, for example, that the chipset > would raise some sort of alert if a single-bit ECC error > was detected and corrected; certainly the OS would want > to log such an event. Depending on the motherboard, BIOS > and OS, it would certainly be possible to treat such an > alert exactly the same as one would treat a double-bit > error, or a a single-bit error when ECC is turned off, > e.g. NMI. It's also possible, I suppose, that the ECC > generation and detection in the 443BX doesn't work worth > a damn and thus most 440BX designs leave it turned off. > I have no reason to believe this is true, however. > > FWIW. > > --Bob Drzyzgula When we ran into some memory problems on 440BX- and 440GX-based systems, I dug into the Intel PCI chipset manuals and wrote some code to dump the information from the memory controller registers. The extra 8 bits available on memory with parity - 72 bits wide, rather than 64 bits (interesting that this is now marketed as ECC memory; a couple of years ago it was sold as parity memory) - is indeed used to do the ECC calculations and corrections by the memory controller. No additional circuitry in needed on the DIMMs. Single bit errors are all corrected transparently to the microprocessor. Multibit errors are not correctable, and if so configured the chipset can issue an NMI. On Linux this NMI results in the "dazed and confused" console message: "Uhhuh. NMI received. Dazed and confused, but trying to continue" We have a critical application which can't tolerate data errors, and so have patched the NMI trap and reboot the system immediately following a multibit error. The memory controller has a couple of registers used to indicate whether bit errors have been detected - a flag for a single bit error, a flag for a multiple bit error, and the page where the error occurred. This information is latched at the first error. At ftp://linux-rep.fnal.gov/pub/motherboards/ I have 3 programs you can use to query the controller: chip2.c - dumps lots of information, such as CAS/RAS timings, which DIMM slot(s) are populated, how large the DIMMs are, whether each DIMM is ECC-capable or not, whether and where bit errors have occurred. biterror_check.c - checks and reports whether or not a single or multiple bit error has occurred, and the page of the occurrance. Remember, this information is latched, so multiple errors may have occurred subsequent to the first. biterror_reset.c - checks and reports whether or not a single or multiple bit error has occurred, and the page of the occurrance. Also resets the error flags. On my motherboards there's always a single bit error after a reboot, so I suspect the BIOS causes one to happen when sizing memory. So, I usually do a biterror_reset during system startup. On the systems we're currently monitoring - 20 L440GX+ motherboards with 512 MB of memory each - single bit errors are extremely rare. Perhaps 1 per month of operation across all of the machines. I've not seen a multiple bit error since replacing memory last January. To interpret the output of chip2.c you'll need the 82443BX or 82443GX host bridge manual from Intel. Don Holmgren Fermilab
- Previous message: Take any two: motherboard performance, compatibility, value
- Next message: Take any two: motherboard performance, compatibility, value
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
