[Beowulf] Re: Cooling vs HW replacement
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduSun Jan 23 22:57:16 PST 2005
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] Re: Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, 23 Jan 2005, Greg Lindahl wrote: > On Sun, Jan 23, 2005 at 11:30:30AM -0500, Robert G. Brown wrote: > > > So I reiterate -- MTBF for hard disks, as reported by the manufacturer, > > is a nearly useless number. > > It is useful if you use it for what it's meant to be used for: the > failure rate in the bottom of the bathtub. I don't know why you were > thinking of using it for anything else, like disk lifetime, or infant > mortality. I have found that my actual failure rates have been 2X-3X > the manufacturer's number, but you always have to worry about dust, > power surges, and excess heat incidents in real machine rooms. I think >>everybody<< finds that actual failure rates are (at least) 2x-3x the mfr number, and finds that it varies wildly in time and with environmental conditions and with plain old luck. That's why (and what I mean by stating that) mfr MTBF quotations are optimistic and cheery. If you've developed Kentucky Windage for their numbers that makes them useful to you, that's great, but you've got a LOT of experience on which to base that correction, and can still get burned by the fact that actual failures are (at best) not terribly uniformly distributed -- the "lemon" phenomenon of manufacturing, also known as "the box of disks that fell from the truck during shipping". Otherwise, what I was basically doing is describing the bathtub (which might, in fact, be more of a kitchen sink with a quite small flat region, given that the testing cannot, obviously, take long enough to define a proper tub floor). That is, we don't really know much about the bathtub size or shape for any drive except (perhaps) for whatever we can infer from the mfr warranty on the particular drive in question, and even THAT is bent out of ideal shape by the actual conditions (such as the particular case it is mounted in and how good its ventilation is and the temperature of the ambient air and how hard it is being run). As was pointed out by Karen (and I agree) the mfr warranty period is perhaps a better number for most people to pay attention to than MTBF as it is the only number that actually costs mfrs money when a disk "prematurely" fails and the only number that does you any good if you buy a hundred disks -- or even just one -- that turn out to be from a "bad batch". Being a cynic, I cannot keep from thinking of the dozens of ways an overgood MTBF number could be "cooked" by a mfr, the near certainty that nobody will ever do anything like a study that could refute it if they pulled it out of thin air, and the lack of financial incentives to make it pessimistic or even acurate. Maybe they are all perfectly honest and drive failure rates are really just 1%/year or thereabouts (on the bathtub floor) and I just never noticed it, or was unlucky, or beat the disks to death by using them in actual computers that only rarely used the disks at all;-) With a warranty, though, while I still care I care less -- I still have to hassle with the replacement but I don't have to buy the disk over again, even if it is just one drive in 100 in a year. Even the warranty period and marginal cost is a less than perfect predictor. I'll bet that in the consumer marketplace they don't actually have to make good on more than two potential warranty claims out of three for three year drives -- RMA is a PITA and probably daunts many a should-be claimant after a 1 year system warranty expires, or they are sold the systems and never told that the drives have a three year warranty. Dropping the warranty on most OTC disks to 1 year sends a pretty negative signal to me, at least, as does the explicit marginal cost of adding back the missing two years. The dollar amounts imply that the MANUFACTURERS are expecting a whole lot more than 1% of ANY batch of disks to fail per year, even out there on the bathtub floor. Who should I believe -- the MTBF or the money? > MTBF for just about everything is computed the same way, and most gizmos > have the same bathtub-shaped failure curve. I'm reminded of a line in a statistics book I once read (I can't remember which one, alas) in which the author had just done a lengthy analysis of failure rates and probability and arrived at a mathematically proven and statistically sound conclusion based on the observations and premises, who then ended up his argument with "but everybody >>knows<< that things go wrong more often than >>that<<" or something similar. His point (I think) was that statistics are lovely but use your gut and your head as well -- reality check time. I tend to think more in terms of warranties and Murphy than in mfr's MTBF, especially when MTBF is a number with absolutely no financial penalty attached to it derived from measurements that (necessarily) are not in the actual context of most usage. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] Re: Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
