[Beowulf] ECC Memory and Job Failures
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jason Clinton jclinton at advancedclustering.comFri Apr 24 09:03:04 PDT 2009
- Previous message: [Beowulf] ECC Memory and Job Failures
- Next message: [Beowulf] ECC Memory and Job Failures
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Apr 24, 2009 at 12:49 AM, John Hearns <hearnsj at googlemail.com>wrote: > 2009/4/23 Nifty Tom Mitchell <niftyompi at niftyegg.com>: > > On Thu, Apr 23, 2009 at 04:45:08PM +0100, Huw Lynes wrote: > > > > IMO Running on a large cluster without multiple bit detection and a > minimum of one bit > > correction ECC is silly. > > > > Further running without watching the ECC logs is also silly. Watching > the > > logs can be hard to do. > > Yes indeed. > At the risk of being an SGI fanboy again, obviously SGI Altix systems > keep excellent logs of hardware errors in /var/log/salinfo - indeed we > had a DIMM fail the day before yesterday, I sent off the traces, and > The EDAC drivers for Linux are able to do this for all x86_64 platforms up to but not including Nehalem (a driver hasn't been released yet). With EDAC, a whole slew of statistics are made available in /sys which can be used for reporting, tracking and tracing the failing DIMM down to physical socket. In fact, just a few weeks ago, AMD released 29 patches for Barcelona and Shanghai. (Unfortunately, these new patches only build on 2.6.30-rc*.) At Advanced Clustering, we use this reporting facility in our Breakin software--we run BLAS-optimized linpack from a RAM filesystem and watch for EDAC messages. -- Jason D. Clinton, 913-643-0306 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090424/a9588ac2/attachment.html
- Previous message: [Beowulf] ECC Memory and Job Failures
- Next message: [Beowulf] ECC Memory and Job Failures
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
