[Beowulf] Re: Cooling vs HW replacement
josip at lanl.gov
Mon Jan 24 08:58:47 PST 2005
Jim Lux wrote:
> Actually, I'd trust the MTBF and other reliability data more than the
> warranty, and here's why:
I agree -- but I wished I had two more numbers: percentage lost to
infant mortality, and possibly the overall life expectancy. This would
describe the "bathtub" failure rate graph in a way that I can apply in
practice, while MTBF alone is only a partial description.
Life expectancy for today's drives is probably longer than the useful
life of a computer cluster (3-4 years, but see below). Therefore,
midlife MTBF numbers should be a good guide of how many disk
replacements the cluster may need annually.
However, infant mortality can be a *serious* problem. Once you install
a bad batch of drives and 40% of them start to go bad within months,
you've got an expensive problem to fix (in terms of the manpower
required), regardless of what the warranty says.
Manufacturers are starting to address this concern, but in ways that are
very difficult to compare. For example, Maxtor advertises "annualized
return rate <1%" which presumably relates to the number of drives
returned for warranty service, but comparing Maxtor's numbers to anyone
else's is mere guesswork.
Even if manufacturers were to truthfully report their overall warranty
return experience, this would not prevent them from releasing a bad
batch of drives every now and then. Only those manufacturers that
routinely fail to meet industry's typical reliability get reputations
bad enough to erode their financial position -- so I suspect that
average warranty return percentages (for surviving manufacturers) would
turn out to be virtually identical -- and thus not very significant for
cluster design decisions.
Until a better solution is found, we can only make educated guesses --
and share anecdotal stories about bad batches to avoid...
P.S. Drives are designed for particular markets: expensive server
drives (->SCSI) are designed to be worked hard 24/7 and rarely spun
down; cheap desktop drives (->ATA) are designed for light workloads
10-12 hr/day and more start/stop cycles. Their respective MTBF figures
assume these different workloads. Moreover, target component lifespan
for cheap drives is 5 years minimum, so this should describe their life
expectancy -- assuming that a particular batch does not have a design
defect creating high infant mortality.
If a cluster is good for 3-4 years and its drives for 5, there will be
some rise in the number of drive replacements needed towards the end,
but probably still within reason. This is as it should be: it makes no
economic sense to overdesign components which will be replaced after 3-4
years anyway. Mature consumer products usually reach this balance of
component reliabilities. We all know what happens with cars: they work
for years with modest maintenance, but then all seems to go wrong at
once, and it's time to get a new one.
More information about the Beowulf