[Beowulf] ECC Memory and Job Failures
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
John Hearns hearnsj at googlemail.comThu Apr 23 22:49:23 PDT 2009
- Previous message: [Beowulf] ECC Memory and Job Failures
- Next message: [Beowulf] ECC Memory and Job Failures
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
2009/4/23 Nifty Tom Mitchell <niftyompi at niftyegg.com>: > On Thu, Apr 23, 2009 at 04:45:08PM +0100, Huw Lynes wrote: > > > IMO Running on a large cluster without multiple bit detection and a minimum of one bit > correction ECC is silly. > > Further running without watching the ECC logs is also silly. Watching the > logs can be hard to do. Yes indeed. At the risk of being an SGI fanboy again, obviously SGI Altix systems keep excellent logs of hardware errors in /var/log/salinfo - indeed we had a DIMM fail the day before yesterday, I sent off the traces, and an engineer was on site yesterday to change it. If ESP email was able to squeak its way out of our network I probably would have met the engineer on the way into work before I called them. More relevantly there is excellent memory error detection and logging on the ICE cluster. SGI provide a utility for switching on memory error logging, using the 'worm' module and logging all errors to syslog. As the blades all do central syslogging to their rack leaders you can track the errors readily. You don't even have to run your own script to parse through logs - the 'memcheck' utility will check through your entire system and report memory logs. This facility has recently been very, very useful to me, and I've been very grateful for SGI support. Having experienced many other clusters, I think I can say that the SGI attention to error logging like this is second to none. Plus couple that with command-line utilities to flash BMC, CMC and BIOSes and you've got a winner.
- Previous message: [Beowulf] ECC Memory and Job Failures
- Next message: [Beowulf] ECC Memory and Job Failures
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
