[Beowulf] Re: RAM ECC errors (Henning Fehrmann)

Mark Hahn hahn at mcmaster.ca
Tue Feb 23 12:05:39 PST 2010


> No, but there seem to be a switch in the kernel module that allows to trigger
> a kernel panic upon discovering uncorrectable errors.

I suspect you mean /sys/module/edac_mc/panic_on_ue
(ue = uncorrected error).  I consider this very much the norm:
it would be very strange to run with ECC memory, and ECC enabled,
and not actually halt on UE.  UE represents a failure of the memory
system, not just a transient event, but something which must be 
physically fixed.  even for HA situations, I'd be pretty skeptical
about using a memory channel which had any UE's on it.

CE (corrected errors) OTOH, are very different.  they're almost just 
a heartbeat of your ECC subsystem.  yes, a CE indicates some event 
that needed correcting, but at a modest rate, CEs are acceptable.
there are failure modes, though, where enough CEs eventually cause 
a UE: tracking CE rate is important for that reason.  (other UE modes
don't have this warning sign...)

you can set CEs to log through kernel->syslog via edac tunables in /sys.

> Yes, but the memory of any process might get corrupted, thus this is more to

if UE is set to panic, nothing will get corrupted (that's really the point eh?)



More information about the Beowulf mailing list