[Beowulf] Re: Cooling vs HW replacement

Josip Loncaric josip at lanl.gov
Fri Jan 21 14:06:07 PST 2005


Robert G. Brown wrote:
> 
>>>	2. higher reliability - typically 1.2-1.4M hours, and usually 
>>>	specified under higher load.  this is a very fuzzy area, since 
>>>	commodity disks often quote 1Mhr under "lower" load.
> 
> Has anyone observed that a megahour is 114 years?  Has anyone observed
> that this is so ludicrous a figure as to be totally meaningless?  Show
> me a single disk on the planet that will run, under load, for a mere two
> decades and I'll bow down before it and start sacrificing chickens.
> 
> Humans don't live a megahour MTBF.  Disks damn sure don't.

All of the above is true on the "per sample" basis.  Moreover, with the 
product cycles measured in months rather than years, none of the MTBF 
figures could possibly be based on actual MTBF measurements.  Instead, 
manufacturers use composite statistics, computed from mid-life component 
failure rates, then quote MTBF as the reciprocal of this number.

This practice results in good MTBF numbers, but it amounts to stating 
that the life expectancy of a 10-year-old kid is 5000 years based on the 
99.98% probability that the kid will survive the next year (these 
numbers are quoted from IEEE Spectrum, Sept. 2004, see 
http://www.spectrum.ieee.org/WEBONLY/publicfeature/sep04/0904age.html).

Both humans and machines fall apart at higher rates in infancy, as well 
as with age, when built-in redundancy wears thin due to accumulated 
damage.  The disk drive MTBF number does not apply to drives that fail 
fairly quickly, nor to failure rates of old/heavily used drives.  If, 
somewhat questionably, human life expectancy is taken as a guide, disk 
manufacturers' MTBF numbers ought to be de-rated by about a factor of 
50-70 to make practical sense (e.g. an 1.4M hour MTBF drive might last 
some 25,000 hours) -- but even this applies only under nominal 
conditions, where the above-mentioned statistical MTBF estimate is not 
wildly inaccurate.

In other words, a drive may last several years at 20 deg. C ambient 
temperature.  Still, this says nothing about its durability at 40+ deg. 
C.  Given that in many systems failure rates increase exponentially with 
temperature, e.g. doubling for every 10 degree increase, I would avoid 
baking a drive unless it was specifically designed for high temperature 
operation (if such drives even exist).


Sincerely,
Josip



More information about the Beowulf mailing list