[Beowulf] Re: RAM ECC errors (Henning Fehrmann)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Henning Fehrmann henning.fehrmann at aei.mpg.deTue Feb 23 23:30:31 PST 2010
- Previous message: [Beowulf] Arima motherboards with SATA2 drives
- Next message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Mark, On Tue, Feb 23, 2010 at 03:05:39PM -0500, Mark Hahn wrote: > >No, but there seem to be a switch in the kernel module that allows to trigger > >a kernel panic upon discovering uncorrectable errors. > > I suspect you mean /sys/module/edac_mc/panic_on_ue > (ue = uncorrected error). I consider this very much the norm: > it would be very strange to run with ECC memory, and ECC enabled, > and not actually halt on UE. UE represents a failure of the memory > system, not just a transient event, but something which must be > physically fixed. even for HA situations, I'd be pretty skeptical > about using a memory channel which had any UE's on it. Strangely enough, panic_on_ue is off by default. > > CE (corrected errors) OTOH, are very different. they're almost just > a heartbeat of your ECC subsystem. yes, a CE indicates some event > that needed correcting, but at a modest rate, CEs are acceptable. > there are failure modes, though, where enough CEs eventually cause a > UE: tracking CE rate is important for that reason. (other UE modes > don't have this warning sign...) On some apparently broken hardware we have a rate of nearly one event per second. I assume the probability of having uncorrectable errors is few orders of magnitude smaller than the rate of correctable errors since more event have to occur simultaneously. And hopefully, the rate of a silent corruption is still smaller. > > you can set CEs to log through kernel->syslog via edac tunables in /sys. > > >Yes, but the memory of any process might get corrupted, thus this is more to > > if UE is set to panic, nothing will get corrupted (that's really the point eh?) Correct, but it helps rule out other reasons for job failures. Cheers, Henning
- Previous message: [Beowulf] Arima motherboards with SATA2 drives
- Next message: [Beowulf] Re: RAM ECC errors (Henning Fehrmann)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
