[Beowulf] ECC Memory and Job Failures

Fri Apr 24 09:03:04 PDT 2009

On Fri, Apr 24, 2009 at 12:49 AM, John Hearns <hearnsj at googlemail.com>wrote:

> 2009/4/23 Nifty Tom Mitchell <niftyompi at niftyegg.com>:
> > On Thu, Apr 23, 2009 at 04:45:08PM +0100, Huw Lynes wrote:
> >
> > IMO Running on a large cluster without multiple bit detection and a
> minimum of one bit
> > correction ECC is silly.
> >
> > Further running without watching the ECC logs is also silly.  Watching
> the
> > logs can be hard to do.
>
> Yes indeed.
> At the risk of being an SGI fanboy again, obviously SGI Altix systems
> keep excellent logs of hardware errors in /var/log/salinfo - indeed we
> had a DIMM fail the day before yesterday, I sent off the traces, and
>

The EDAC drivers for Linux are able to do this for all x86_64 platforms up
to but not including Nehalem (a driver hasn't been released yet). With EDAC,
a whole slew of statistics are made available in /sys which can be used for
reporting, tracking and tracing the failing DIMM down to physical socket. In
fact, just a few weeks ago, AMD released 29 patches for Barcelona and
Shanghai. (Unfortunately, these new patches only build on 2.6.30-rc*.)

At Advanced Clustering, we use this reporting facility in our Breakin
software--we run BLAS-optimized linpack from a RAM filesystem and watch for
EDAC messages.

-- 
Jason D. Clinton, 913-643-0306
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090424/a9588ac2/attachment.html>