[Beowulf] Are disk MTBF ratings at all useful?
Peter St. John
peter.st.john at gmail.com
Mon Apr 22 18:19:39 PDT 2013
Human mortality has, broadly, a Poisson, and a non-Poisson, component. The
chance of getting hit by a meteor is Poisson, it has nothing to do with
your age; but the chance of a 99 year old living to 100 is lower than the
chance of a 20 year old living to 21, because we wear out, that's not
Poisson. (Dogs are a clearer example: the chance of getting hit by a car is
Poisson, but dying of old age after a dozen years or so is not.)
We usually think of incandescent light bulbs as Poisson; the chance of, I
don't know, Brownian Motion, clipping a very narrow filament, is bigger
than the degradation of mere use; except in the case of switching the bulb
off and on frequently, when the chance of failure depends more on fatigue
as the filament expands and contracts.
Hard Disks are somewhat Poisson, and somewhat not. More so, I think, than
On Mon, Apr 22, 2013 at 12:07 PM, mathog <mathog at caltech.edu> wrote:
> In partial answer to the subject question, let us apply the mode of
> analysis used by the drive manufacturers
> to human life expectancy, as if Humans were one of their products.
> That is, what is the Human AFR and
> MTBF? Unlike for disk drives, we can easily obtain a table of USA
> mortality rates, this one
> is for the year 2007:
> Looking at the first row of the table, which is the data for the whole
> country, we see that it has a bathtub
> shaped curve, with a relatively high "early failure rate", which
> decreases to a minimum for the
> ages 5-14, and then an increasing "failure rate" with advancing years.
> Now assume the "manufacturer" calculates the AFR assuming a "working
> life" for the "product"
> of 20 years. The total "failures"/100,000 over that period measured in
> 2007 were:
> 685.4 + 4*28.6 + 10*15.3 + 5*79.9 = 1352.3
> Giving a 20 year failure rate of
> 1352.3 / 100000 = .013523
> and an AFR of .013523/20 = .000676,
> or .0676%.
> So the MTBF for the humans (in years, not hours), is 1/.000676 = 1479
> This number is just as nonsensical for people as 150 years is for
> In the human case, since we have all the data, we can see exactly why
> the result is so far off.
> In rough terms the human mortality rate doubles for every decade of
> age. Consequently any AFR
> calculated up to an age below the actual MTBF (average lifespan) will
> be an underestimate, and
> the earlier the cut off, the further off the value will be. This is
> on top of the other issue
> which affects the calculations for disks - the definition of a "failed
> unit" used by the manufacturers
> is much less stringent than that employed by the end users/vendors.
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf