[Beowulf] Re: failure trends in a large disk drive population

Thu Feb 22 10:40:58 PST 2007

At 08:22 AM 2/22/2007, David Mathog wrote:
>Justin Moore wrote:
> > As mentioned in an earlier e-mail (I think) there were 4 SMART variables
> > whose values were strongly correlated with failure, and another 4-6 that
> > were weakly correlated with failure.  However, of all the disks that
> > failed, less than half (around 45%) had ANY of the "strong" signals and
> > another 25% had some of the "weak" signals.  This means that over a
> > third of disks that failed gave no appreciable warning.  Therefore even
> > combining the variables would give no better than a 70% chance of
> > predicting failure.
>
>Now we need to know exactly how you defined "failed".

The paper defined failed as "requiring the computer to be pulled" 
whether or not the disk was actually dead.
Were there postmortem analyses of the power supplies in the failed
>systems?  It wouldn't surprise me if low or noisy power lines led
>to an increased rate of disk failure.  SMART wouldn't give this
>information (at least, not on any of the disks I have), but
>lm_sensors would.

I would make the case that it's not worth it to even glance at the 
outside of the case of a dead unit, much less do failure analysis on 
the power supply.  FA is expensive, new computers are not.  Pitch the 
dead (or "not quite dead yet, but suspect") computer, slap in a new 
one and go on.

There is some non-zero value in understanding the failure mechanics, 
but probably only if the failure rate is high enough to make a 
difference.  That is, if you had a 50% failure rate, it would be 
worth understanding.  If you have a 3% failure rate, it might be 
better to just replace and move on.

There is also some value in predicting failures, IF there's an 
economic benefit from knowing early.  Maybe you can replace computers 
in batches less expensively than waiting for them to fail or maybe 
your in a situation where a failure is expensive (highly tuned 
brittle software with no checkpoints that has to run 1000 processors 
in lockstep for days on end).  I can see Google being in the former 
case but probably not in the latter.  Predictive statistics might 
also be useful if there is some "common factor" that kills many disks 
at once (Gosh, when Bob is the duty SA after midnight and it's the 
full moon, the airfilters clog with some strange fur and the drives 
overheat, but only in machine rooms with a window to the outside..)

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875