[Beowulf] Re: Cooling vs HW replacement
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux James.P.Lux at jpl.nasa.govFri Jan 21 14:58:47 PST 2005
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] send back output from local node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 11:09 AM 1/21/2005, David Mathog wrote: > > > > > or "Server" grade disks still cost a lot more than that. For > > > > this is a very traditional, glass-house outlook. it's the same one > > that justifies a "server" at $50K being qualitatively different > > from a commodity 1U dual at $5K. there's no question that there > > are differences - the only question is whether the price justifies > > those differences. > >The MTBF rates quoted by the manufacturers are one indicator >of disk reliability, but from a practical point of view the number >of years of warranty coverage on the disk is a more useful metric. > >The manufacturer has an incentive to be sure that those disks >with a 5 year warranty really will last 5 years. Unclear >to me what their incentive is to support the MTBF rates since only >a sustained and careful testing regimen over many, many disks could >challenge the manufacturer's figures. And who would run such >an analysis??? Buy the 5 year disk and you'll have a working >disk, or a replacement for it, for 5 years. While MTBFs of the disk may seem unrealistic (as was pointed out, nobody is likely to run a single disk for 100+ years), but they are a "common currency" in the reliability calculation world, as are "Failures in Time" (FIT) which is the number of failures in a billion (1E9) hours of operation. What would be very useful (and is something that does get analysis for some customers, who care greatly about this stuff) is to compare the MTBF of a widget determined by calculation and analysis (look up the component reliabilities, calculate the probability of failure for the ensemble) with the MTBF of the same widget determined by test (run 1000 disk drives for months). Especially if you run what are called "accelerated life tests" at elevated temperatures or higher duty factors. MTBFs are also used because they're easier to understand and handle than things like "reliability", which winds up being .999999999, or failure rates per unit time, which wind up being very tiny numbers (unless "unit time" is a billion hours). And, if I were asked to estimate the reliability of a PC, I'd want to get the MTBF numbers for all the assemblies, and then I could calculate a composite MTBF, which might be surprisingly short. If I then had to calculate how many PC failures I'd get in a cluster of 1000 computers, it would be appallingly short. To a first order, an ensemble of 1000 units, each with an MTBF of 1E6 hours will have an MTBF of only 1000 hours, which isn't all that long....and if the MTBF of those units is only 1E5 hours, because you're running them 25 degrees hotter than expected, only a few days will go by before you get your first failure. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] send back output from local node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
