[Beowulf] mcelog output, interpretation?

David Mathog mathog at caltech.edu
Mon Aug 18 15:50:34 PDT 2008



> > /var/log/messages.  One of them had 29 machine checks logged, all of
> > them variants of this:
> 
> where the variation encompassed nearby addresses?

If by address you are referring to the ADDR line, the values there are:

MISC c0080e8300000000 ADDR 51ad6e780 
MISC c0080e8400000000 ADDR 512b6a7b0 
MISC c0080e8600000000 ADDR 52496e7a0 
MISC c0080e8a00000000 ADDR 4f4f6e780 
MISC c0080e8b00000000 ADDR 4f116a7b0 
MISC c0080e8f00000000 ADDR 4ee66a7a0 
MISC c0080e9000000000 ADDR 4f426a780 
MISC c0080e9400000000 ADDR 4fef6e7a0 
MISC c0080e9600000000 ADDR 50e06a780 
MISC c0080e9c00000000 ADDR 4e856e780 
MISC c0080e9d00000000 ADDR 502a6e7a0 
MISC c0080ea100000000 ADDR 4f0a6a780 
MISC c0080ea300000000 ADDR 522e6e790 
MISC c0080eb800000000 ADDR 4e386e780 
MISC c0080ec000000000 ADDR 4e1d6e790 
MISC c0080ec300000000 ADDR 4b1f6e780 
MISC c0080ec500000000 ADDR 4c4d6e7a0 
MISC c0080ec600000000 ADDR 4c296e7a0 
MISC c0080ec800000000 ADDR 502e6a7a0 
MISC c0080ecb00000000 ADDR 4c4d6e780 
MISC c0080ece00000000 ADDR 4c4d6e780 
MISC c0080ed200000000 ADDR 4f0e6e780 
MISC c0080ed300000000 ADDR 4c296e7a0 
MISC c0080ed500000000 ADDR 502e6a7a0 
MISC c0080edc00000000 ADDR 502e6a780 
MISC c0080ee000000000 ADDR 502e6a790 
MISC c0080ee100000000 ADDR 4c296e7a0 
MISC c0080ee200000000 ADDR 50de6e7a0 
MISC c0080ee300000000 ADDR 50236e7a0 
MISC c0080ee400000000 ADDR 504d6e7a0 

The MISC part seems to be mostly just counting up, maybe it is a pointer
to the storage for the error message.  The ADDR part is pretty tightly
constrained, extending only from 4b1f6e780 to 52496e7a0, in no
particular order.  Two addresses appear twice, one appears three times,
the rest are unique.

> 
> > These had built up at about 1 per month over the last couple of years.
> 
> 1/month is not a concern, IMO.  the main reason to run mcelog is to
> avoid the situation of having enough corrected mce's that you run 
> into uncorrectable or undetected ones.  1/month is a very low rate.

These IBM systems automatically disable memory banks which fail an ECC
correction, and they light a lamp on the console when that has happened.
On this unit these errors were all corrected, no memory was disabled,
and no lamp was lit.  So I tend to agree with you that this is not too
horrible a situation.

> 
> > There seems to be an issue with the Northbridge, but exactly what that
> 
> NB is used here in a very general sense - it's referring to the onchip
> memory controller, not a literal external chip.  I don't think there's 
> any involvement of video.

Gotcha - error messages that don't mean quite what they seem to.  

Thanks,


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list