Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: RAM ECC errors

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Henning Fehrmann henning.fehrmann at aei.mpg.de
Tue Feb 23 23:20:43 PST 2010


Hi David,

Thank you for the response.

> Carsten Aulbert  wrote
> > > Are you saying that now that you are monitoring you are seeing kernel
> > > panics which did not appear before?
> > > 
> > 
> > No, but there seem to be a switch in the kernel module that allows to
> trigger 
> > a kernel panic upon discovering uncorrectable errors.
> 
> By "switch" do you mean:
> A. There is an option that may be set when that module is loaded which
> will then cause it to panic on an uncorrectable error, where normally it
> would not.
> B. There has been a change in the module code between kernel versions
> that causes it to panic now on events where it formerly did not panic.

It is A. There is a module parameter for edac_core:
edac_mc_panic_on_ue=1. We have not tested it yet since uncorrectable
errors rarely occur. 

> 
> > > You can get some information through netconsole, but you know that
> already.
> > > 
> > 
> > Yup already running, question is if a kernel panic would also be fully
> visible 
> > via netconsole - we are glad that we rarely have those ;)
> 
> I have seen one kernel panic since turning on netconsole, and it did log
> across the network and showed up in /var/log/messages as it was supposed
> to, with the same information presented as in the tests.  Limited data,
> but it would seem the answer is "at least sometimes".

I got a hint from one of the kernel developer. Including the show show_state()
function into panic.c right before dump_stack() should give process
information via printk which could be collected with netconsole. 
We are still waiting for an UE event.

> 
> > Yes, but the memory of any process might get corrupted, thus this is
> more to 
> > learn which user is currently running jobs. Which in turn enables us
> to notify 
> > these users that this particular machine running these jobs had a
> problem and 
> > the user might need to re-run her jobs to prevent "false" data
> entering her 
> > job.
> 
> If the node blows up presumably the output of all the jobs currently
> running there will clearly indicate that there was a failure - so you
> should not have to notify those users since they will see the problem in
> their results.  (Unless MPI, or PVM, or whatever is being used to spread
> jobs around, ignores fatal errors, which should never be the case.)  For
> jobs which completed earlier on the same node, this would have been
> before an uncorrectable error took place, so the results should be OK.  

Yes, this is correct. A panic should be enough to avoid corrupted data.
Often, jobs are failing for other reasons. A process list might help
us to exclude other possibilities for job failure. It makes the work a bit
more convenient.  


Cheers,
Henning



More information about the Beowulf mailing list