[Beowulf] Re: Cooling vs HW replacement
James.P.Lux at jpl.nasa.gov
Fri Jan 21 14:58:47 PST 2005
At 11:09 AM 1/21/2005, David Mathog wrote:
> > > or "Server" grade disks still cost a lot more than that. For
> > this is a very traditional, glass-house outlook. it's the same one
> > that justifies a "server" at $50K being qualitatively different
> > from a commodity 1U dual at $5K. there's no question that there
> > are differences - the only question is whether the price justifies
> > those differences.
>The MTBF rates quoted by the manufacturers are one indicator
>of disk reliability, but from a practical point of view the number
>of years of warranty coverage on the disk is a more useful metric.
>The manufacturer has an incentive to be sure that those disks
>with a 5 year warranty really will last 5 years. Unclear
>to me what their incentive is to support the MTBF rates since only
>a sustained and careful testing regimen over many, many disks could
>challenge the manufacturer's figures. And who would run such
>an analysis??? Buy the 5 year disk and you'll have a working
>disk, or a replacement for it, for 5 years.
While MTBFs of the disk may seem unrealistic (as was pointed out, nobody is
likely to run a single disk for 100+ years), but they are a "common
currency" in the reliability calculation world, as are "Failures in Time"
(FIT) which is the number of failures in a billion (1E9) hours of operation.
What would be very useful (and is something that does get analysis for some
customers, who care greatly about this stuff) is to compare the MTBF of a
widget determined by calculation and analysis (look up the component
reliabilities, calculate the probability of failure for the ensemble) with
the MTBF of the same widget determined by test (run 1000 disk drives for
months). Especially if you run what are called "accelerated life tests" at
elevated temperatures or higher duty factors.
MTBFs are also used because they're easier to understand and handle than
things like "reliability", which winds up being .999999999, or failure
rates per unit time, which wind up being very tiny numbers (unless "unit
time" is a billion hours).
And, if I were asked to estimate the reliability of a PC, I'd want to get
the MTBF numbers for all the assemblies, and then I could calculate a
composite MTBF, which might be surprisingly short. If I then had to
calculate how many PC failures I'd get in a cluster of 1000 computers, it
would be appallingly short.
To a first order, an ensemble of 1000 units, each with an MTBF of 1E6 hours
will have an MTBF of only 1000 hours, which isn't all that long....and if
the MTBF of those units is only 1E5 hours, because you're running them 25
degrees hotter than expected, only a few days will go by before you get
your first failure.
James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
More information about the Beowulf