[Beowulf] Re: RAM ECC errors (Henning Fehrmann)

Carsten Aulbert carsten.aulbert at aei.mpg.de
Mon Feb 22 22:33:38 PST 2010


Hi David

replying also on Henning's behalf

On Monday 22 February 2010 21:30:38 David Mathog wrote:
> 
> Are you saying that now that you are monitoring you are seeing kernel
> panics which did not appear before?
> 

No, but there seem to be a switch in the kernel module that allows to trigger 
a kernel panic upon discovering uncorrectable errors.

> You can get some information through netconsole, but you know that already.
> 

Yup already running, question is if a kernel panic would also be fully visible 
via netconsole - we are glad that we rarely have those ;)

> Well, you could log process start/stops and flush them to disk or syslog
> them, so that at least when the system crashes it would be possible to
> derive a list of everything that was still running.  Doubt this will
> help much though, since the most likely culprit is a bad stick of
> memory, in which case the netconsole or IPMI or MCE messages may be
> enough to figure out which stick is the problem.  That is, whichever
> process triggered it is probably an innocent bystander.

Yes, but the memory of any process might get corrupted, thus this is more to 
learn which user is currently running jobs. Which in turn enables us to notify 
these users that this particular machine running these jobs had a problem and 
the user might need to re-run her jobs to prevent "false" data entering her 
job.

Cheers

Carsten
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1871 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100223/35c1c67b/attachment.bin>


More information about the Beowulf mailing list