Frequency of upsets was Re: [Beowulf] ECC support on motherboards?
James.P.Lux at jpl.nasa.gov
Tue May 13 15:27:11 PDT 2008
At 02:16 PM 5/13/2008, Håkon Bugge wrote:
>At 19:17 13.05.2008, Perry E. Metzger wrote:
>>So another question is, how can you reliably test any of this stuff?
>>It isn't like you can reliably induce single bit errors and see if the
>>hardware catches them. (A special memory module that let you test
>>would be a wonderful thing, but I've never even heard of such a thing.)
More on upsets..
Here's an interesting paper from Boeing in the
late 90s that asserts that a leading cause of
these upsets is atmospheric neutrons. Gives
rates too.. (see also the link below to the
presentation which uses some of this data)
looks like for 4M DRAMs, 1E-12 upset/bit hour is a nice round number (Table 4)
Some data from Fermilab with 160 Gbit of DRAM
showed 2.5 upset/day. Extrapolating (always
dangerous with these kinds of radiation effects
data, but I'll plunge in regardless).. that means
a workstation with 4-8 Gbyte of DRAM might see an upset per day.
Any sort of ECC would catch this and correct it, of course.
There is a paper from Gary Swift, here at JPL,
that discusses that some radiation induced upsets
will be multiple bit errors by their nature (i.e.
imagine a bullet tearing through a bunch of
memory cells.. more than one gets hit). But this
is for Cassini era Solid State Recorders (e.g.
early 90s, late 80s components) and, it's in
space, where the radiation environment is quite
different than terrestrially. Swift & Guertin,
"In-Flight Observations of Multiple-Bit Upset in
DRAMS", IEEE Trans on Nuc Sci, V47, #6, Dec 2000, pp2386-2391.
The Ladbury presentation from MAPLD2002 I posted
the link to yesterday talks about the mechanics of the upset.
A fascinating presentation about upsets in
avionics (for planes, not spacecraft) from Boeing is here:
Look at slide 11, and you see that the upset rate
is 30 times higher at 30,000 ft than sea
level. Those of you building clusters for
observatories in Atacama might want to pay more
attention to upsets than those of us close to sealevel.
Likewise, the upset rate is higher at high
latitudes. (Why yes, it's essential that we build
that cluster on a tropical island. otherwise it will cost more for ECC ram)
An interesting post on a mailing list:
Ladkin discusses some of the potential issues with the Boeing (and other) data.
So there's more about SEUs in memory than anyone
on this list ever wanted to know. There's lots
more stuff available, although you pretty quickly
get into export controlled territory if you are
poking at the limits of the technology.
James Lux, P.E.
Task Manager, SOMD Software Defined Radios
Flight Communications Systems Section
Jet Propulsion Laboratory
4800 Oak Grove Drive, M/S 161-213
Pasadena CA 91109
More information about the Beowulf