[Beowulf] Not quite Walmart, or, living without ECC?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux James.P.Lux at jpl.nasa.govFri Nov 16 15:16:37 PST 2007
- Previous message: [Beowulf] Quad-Core Parallelism
- Next message: [Beowulf] XPVM advice for Fedora 7
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 01:56 PM 11/16/2007, Mark Hahn wrote: >>I just asked the local NT goon, "do you use ECC for the servers?" and >>he answered, "you have to". What he considers a server-class mobo >>requires ECC > >whether you need ECC depends on many things. first, how much memory >your machine has - my experience is that most generic servers (web, file, >mail, etc) don't have much - maybe a few GB. the chance of needing ECC >also depends on how "hard" you use the ram (again, mundane servers >are pretty lightly utilized.) as well as factors like altitude, ram quality, >and the ever popular "how important is your data". > >for clusters, I would say that ECC is basically a necessity, unless all >the jobs can be run in a "checking" mode (ie, perform a search or >optimization, then verify the results in case the hit was due to a bit flip.) > >that said, ECC events are not all that common. I have a 768-node cluster >here, each node dual-socket opteron with 8GB PC3200 ddr. I just >checked all nodes with mcelog, and 35 have reported corrected events >over roughly >the last 20 days. one may have hit an uncorrectable event (but in >our clusters, corrected ECC rate is not a good predictor for uncorrectable >ones...) So the detected upset rate is: 35/(768*20) detected errors per day per computer (0.0023) or 3.3E-14 errors/bit/day Wikipedia claims 1 error/month/GB (3E-11 errors/bit/day) but their references are all pretty ancient (a JPL paper from 2001 is probably reporting on devices that would have been used in consumer electronics in the early 90s). They may also have been talking about "upset rates", and what you observe is "detected bit error rate" (that is, you don't see all the upsets that have occurred, because you don't read all memory, all the time... your accesses may be concentrated in, say, 1GB of your overall 8GB DRAM space) http://parts.jpl.nasa.gov/docs/CassDRAM-00.pdf discusses some possible reasons why multibit error rates and single bit error rates don't scale like you'd expect (a heavy ion can zap multiple bits at one time, so the bit errors are not uncorrelated). In spacecraft systems, often, they implement a scrubbing algorithm that systematically reads and checks each location in turn, as opposed to waiting for the processor to happen to fetch that location. That's so that you have a chance to scrub an error in a word before it takes a second hit. On Cassini, the scrubbing in the 2.5 Gbit solid state recorders is such that every word gets scrubbed about every 9 minutes. They get about 200-300 single bit errors/day. But this is, truly, ancient technology...
- Previous message: [Beowulf] Quad-Core Parallelism
- Next message: [Beowulf] XPVM advice for Fedora 7
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
