[Beowulf] Re: Cooling vs HW replacement
josip at lanl.gov
Fri Jan 21 14:06:07 PST 2005
Robert G. Brown wrote:
>>> 2. higher reliability - typically 1.2-1.4M hours, and usually
>>> specified under higher load. this is a very fuzzy area, since
>>> commodity disks often quote 1Mhr under "lower" load.
> Has anyone observed that a megahour is 114 years? Has anyone observed
> that this is so ludicrous a figure as to be totally meaningless? Show
> me a single disk on the planet that will run, under load, for a mere two
> decades and I'll bow down before it and start sacrificing chickens.
> Humans don't live a megahour MTBF. Disks damn sure don't.
All of the above is true on the "per sample" basis. Moreover, with the
product cycles measured in months rather than years, none of the MTBF
figures could possibly be based on actual MTBF measurements. Instead,
manufacturers use composite statistics, computed from mid-life component
failure rates, then quote MTBF as the reciprocal of this number.
This practice results in good MTBF numbers, but it amounts to stating
that the life expectancy of a 10-year-old kid is 5000 years based on the
99.98% probability that the kid will survive the next year (these
numbers are quoted from IEEE Spectrum, Sept. 2004, see
Both humans and machines fall apart at higher rates in infancy, as well
as with age, when built-in redundancy wears thin due to accumulated
damage. The disk drive MTBF number does not apply to drives that fail
fairly quickly, nor to failure rates of old/heavily used drives. If,
somewhat questionably, human life expectancy is taken as a guide, disk
manufacturers' MTBF numbers ought to be de-rated by about a factor of
50-70 to make practical sense (e.g. an 1.4M hour MTBF drive might last
some 25,000 hours) -- but even this applies only under nominal
conditions, where the above-mentioned statistical MTBF estimate is not
In other words, a drive may last several years at 20 deg. C ambient
temperature. Still, this says nothing about its durability at 40+ deg.
C. Given that in many systems failure rates increase exponentially with
temperature, e.g. doubling for every 10 degree increase, I would avoid
baking a drive unless it was specifically designed for high temperature
operation (if such drives even exist).
More information about the Beowulf