ecc-memory.
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux James.P.Lux at jpl.nasa.govTue Jan 16 15:50:55 PST 2001
- Previous message: memory speed ?
- Next message: Scyld, RedHat 7.0, 6.2 and Athlon's
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Many years ago, I worked on a system with ECC memory (Multibus-II). Bad hardware bus drivers manifested themselves as corrected bit error interrupts. We discovered it because we were getting double bit errors as well, and even at the several a day single bit error rate, we shouldn't have been seeing DBE's. So, Serguei's comment that it might be due to difference in parts quality and system design is quite relevant. Just off hand, I wouldn't expect 1 km of air to provide enough shielding to reduce the upset rate by 50 times. The estimates of how much the increase is are all over the map (some articles about radiation and airline safety talk about extremely high ratios, but they assume polar flights, during a solar flare, etc., etc.) One source http://www.prioritiesforhealth.com/1102/rad.html that, at least, appears superficially authoritative (at least they use the right units, etc.) gives the following data for cosmic rays: SL background - 30 mRem/yr Denver at ca 1600 m - 50 mRem/yr Mexico City (2260m) - 70 mRem/yr La Paz, Bolivia (3660) - 180 mRem/yr Of some interest might be that the background radiation in Calgary from Uranium and Thorium in the rocks might increase the upset rate, but again, not 50 times.... If you are interested, there are a number of sites which give the solar weather statistics, which directly affects the number of solar originating particles that might cause upsets. You could correlate solar particle flux against your observed bit error rates (or intervals) and determine if it is radiation induced, or something else. (if nothing else, you should see a 24 hour periodicity if it is solar related) -----Original Message----- From: Serguei Patchkovskii <patchkov at ucalgary.ca> To: Greg Lindahl <glindahl at hpti.com> Cc: josip at icase.edu <josip at icase.edu>; beowulf at beowulf.org <beowulf at beowulf.org> Date: Tuesday, January 16, 2001 2:59 PM Subject: RE: D-Link switch and ecc-memory. >On Tue, 16 Jan 2001, Greg Lindahl wrote: >> > My best estimate is that our system corrects one single bit error (SBE) >> > per week in 37.5 GB of ECC memory. This translates into SBE event >> > intervals of about 9 months per GB of RAM. Your mileage may vary... >> >> Josip neglected to mention that he is at sea level. If you are at a higher >> altitude, you will see more errors. > >Indeed. Here in Calgary (1 kilometer above the sea level), I count an average >of 50 corrected memory errors _per_day_ for 220 Gbytes of memory over the >last three months - or about fifty times the Josip's rate. This average >excludes three systems with failing memory - which we hadn't got around to >replace yet. (These three have the error rate of about 30 times the median). > >How much of the difference is due to an increase in cosmic radiation, and >how much is due to the differences in parts quality and system design, >I am not qualified to assess. > >Regards, > >/Serge.P > >--- >Home page: http://www.cobalt.chem.ucalgary.ca/ps/ > > >_______________________________________________ >Beowulf mailing list >Beowulf at beowulf.org >http://www.beowulf.org/mailman/listinfo/beowulf >
- Previous message: memory speed ?
- Next message: Scyld, RedHat 7.0, 6.2 and Athlon's
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
