[Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joe Landman landman at scalableinformatics.comThu Apr 23 14:44:52 PDT 2009
- Previous message: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)
- Next message: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Gerry Creager wrote: > David Mathog wrote: >> Huw Lynes <lynesh at cardiff.ac.uk> wrote: >> >>> http://blog.revolution-computing.com/2009/04/blame-it-on-cosmic-rays.html >>> >>> >>> Apparently someone ran a large cluster job with both ECC and none-ECC >>> RAM. They consistently got the wrong answer when foregoing ECC. >> >> There were not very many details given. I would not rule out the >> possibility that the nonECC memory was slightly faulty, and that the >> observed errors had nothing to do with gamma rays at all. A better test >> would have been to use the same ECC memory for both tests, and to turn >> ECC memory correction on and off in the BIOS. > > Where's Jim Lux. I'm sure he's an opinion on this, too... > > Cosmic ray hits are, if I recall correctly, an improbable event at the > earth's surface on the order of 1/1e13 sec (but I'm doing this from Hmmm... one of the experiments done way back in the dusty days of my undergrad was cosmic ray generated Muon lifetime measurement, using 3 large scintillators, some PMDs, and a little luck. No computers were harmed (or used!) in these measurements. Labview wasn't even a glint in National Instrument's eyes then. I am pretty sure we did this experiment on the surface (inside a large concrete building in fact, which may have altered the signal somewhat). The atmosphere definitely attenuates the cosmic radiation background (and I seem to remember reading things about notch and other weird filter properties of the EM spectrum traversing the atmosphere ... all that absorption...) > memory and IT may have taken a hit). In spaceborne applications, > however, the potential for random high energy particle hits is > significantly higher. And it's not just memory, although that tends to > be more susceptible. CPUs are also at risk. CMOS parts tend to > tolerate these events better than a lot of others than NMOS. There are > a lot of old CPUs and memory designs for spaceflight even today. > > I tend to buy the theory that there's something wrong with the non-ECC > components, rather than thinking there's a cosmic ray with you r name on > it. > Allow me to second this. If I see a memory showing off a huge number of ECC errors, I start looking at if the DIMMs were seated right. Reseating memory (on one server) is usually a fast thing. More than one ... not so much fast. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
- Previous message: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)
- Next message: [Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
