intermittent crashing of programs

Daniel Kidger Daniel.Kidger at quadrics.com
Thu Feb 21 09:46:00 PST 2002


Donald Becker wrote:
>I think of parity errors being connected to NMI as being an obscure
>legacy part of the PC architecture, much like the "A20" line being
>switched by the keyboard controller.  If the backwards compatibility
>broke, no one would notice.



Nope not legacy - just look for example at any brand new Dell Pentium 4
system with RAMBUS ECC memory. 

Any 'multibit errors', generate an NMI. 

Single bit errors in ecc memory get spotted by the BIOS too but the O/S will
not be told - since they are corrected 'on-the-fly' by the hardware on
reading the data. Hence 'memtest' will never detect these single-bit errors.

The other thing to get is 'ecc.o'. This is a kernal module that polls the
motherboard chipset every second - it will show in /proc/ram the single and
multibit errors and will collate them by  memory bank. 

eg.
[dan at fridge8]$ cat /proc/ram
Chipset ECC capability : ECC detection and correction
Current ECC mode : ECC detection and correction
Bank    Size    Type    ECC     SBE     MBE
0       256M    RMBS    Y       202758  0
1       256M    RMBS    Y       0       5
2       256M    RMBS    Y       0       2
3       256M    RMBS    Y       0       0
4       256M    RMBS    Y       0       0
5       256M    RMBS    Y       0       257
6       256M    RMBS    Y       0       0
7       256M    RMBS    Y       0       0



Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------



More information about the Beowulf mailing list