Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] ECC Memory and Job Failures

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Jason Clinton jclinton at advancedclustering.com
Fri Apr 24 09:03:04 PDT 2009


On Fri, Apr 24, 2009 at 12:49 AM, John Hearns <hearnsj at googlemail.com>wrote:

> 2009/4/23 Nifty Tom Mitchell <niftyompi at niftyegg.com>:
> > On Thu, Apr 23, 2009 at 04:45:08PM +0100, Huw Lynes wrote:
> >
> > IMO Running on a large cluster without multiple bit detection and a
> minimum of one bit
> > correction ECC is silly.
> >
> > Further running without watching the ECC logs is also silly.  Watching
> the
> > logs can be hard to do.
>
> Yes indeed.
> At the risk of being an SGI fanboy again, obviously SGI Altix systems
> keep excellent logs of hardware errors in /var/log/salinfo - indeed we
> had a DIMM fail the day before yesterday, I sent off the traces, and
>

The EDAC drivers for Linux are able to do this for all x86_64 platforms up
to but not including Nehalem (a driver hasn't been released yet). With EDAC,
a whole slew of statistics are made available in /sys which can be used for
reporting, tracking and tracing the failing DIMM down to physical socket. In
fact, just a few weeks ago, AMD released 29 patches for Barcelona and
Shanghai. (Unfortunately, these new patches only build on 2.6.30-rc*.)

At Advanced Clustering, we use this reporting facility in our Breakin
software--we run BLAS-optimized linpack from a RAM filesystem and watch for
EDAC messages.

-- 
Jason D. Clinton, 913-643-0306
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20090424/a9588ac2/attachment.html


More information about the Beowulf mailing list