[Beowulf] looking for a reference on failure rates

Jim Lux James.P.Lux at jpl.nasa.gov
Mon Mar 7 15:26:17 PST 2005

At 12:03 PM 3/7/2005, Joe Landman wrote:
>Hi Jim:
>   Something I can refer to for primary literature for a paper.  If it is 
> anecdotal, that may be fine as well, though I will have to treat it 
> differently.
>   This is largely for microprocessors, disks, networks, etc.  General 
> digital equipment, with a focus on computers in clusters.
>   Thanks!

One of our reliability guys recommended this:

E.A. Amerasekera & F.N. Najim, "Failure Mechanisms in Semiconductor 
Devices", 2nd Ed., Wiley, NY, 1997

You might also take a look at MIL-HDBK-217F, which provides reliability 
math models for just about everything electronic.  There might be some 
argument about the applicability of this in some instances, but it's 
certainly a commonly used document.  Chapter 5 talks about 
microcircuits.  Section 5.8 has the temperature factors (Ea in eV) for 
various logical families... CMOS looks like 0.35, BiCMOS and LSTTL are 0.5, 
Linears are 0.65
(Converting to life effects, it looks like a 20C rise in temp corresponds 
to twice the failure rate for CMOS, and a 20C rise is a 4.9 factor increase 
for Linears (10 deg= 2.3))... the actual assembly failure rate will depend 
on how many of each kind of part, what temperature they're at, etc.

Something like a doubling per 10C is probably not too far from the overall 
effect.  It's not going to be 10 times and it's not going to be 10% 
increase either.

There's also a BellCore/Telcordia model which apparently takes into account 
burnin and testing.  It might be more relevant, depending on the environment.

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

More information about the Beowulf mailing list