[Beowulf] Re: Cooling vs HW replacement

Mon Jan 24 10:09:50 PST 2005

At 08:58 AM 1/24/2005, Josip Loncaric wrote:
>Jim Lux wrote:
>>Actually, I'd trust the MTBF and other reliability data more than the
>>warranty, and here's why:
>
>I agree -- but I wished I had two more numbers: percentage lost to infant 
>mortality, and possibly the overall life expectancy.  This would describe 
>the "bathtub" failure rate graph in a way that I can apply in practice, 
>while MTBF alone is only a partial description.
>
>Life expectancy for today's drives is probably longer than the useful life 
>of a computer cluster (3-4 years, but see below).  Therefore, midlife MTBF 
>numbers should be a good guide of how many disk replacements the cluster 
>may need annually.
>
>However, infant mortality can be a *serious* problem.  Once you install a 
>bad batch of drives and 40% of them start to go bad within months, you've 
>got an expensive problem to fix (in terms of the manpower required), 
>regardless of what the warranty says.

The Seagate documentation actually had some charts in there with expected 
failure rates, by month, for the first few months.

>Manufacturers are starting to address this concern, but in ways that are 
>very difficult to compare.  For example, Maxtor advertises "annualized 
>return rate <1%" which presumably relates to the number of drives returned 
>for warranty service, but comparing Maxtor's numbers to anyone else's is 
>mere guesswork.

Indeed.. annualized return rate is an economic planning number, good if 
you're a retailer or consumer manufacturer trying to estimate how much to 
allow for, but hardly a testable specification.  (I would imagine, though, 
that they can, if necessary, back up their <1% return rate with 
documentation...).  If I were a HP making millions of consumer computers, 
that's the spec I'd really want to see.

If I were a cluster builder or server farm operator with concerns about 
failure rates over a 3 year design life, then I'd want to see real 
reliability data.

>Even if manufacturers were to truthfully report their overall warranty 
>return experience, this would not prevent them from releasing a bad batch 
>of drives every now and then.  Only those manufacturers that routinely 
>fail to meet industry's typical reliability get reputations bad enough to 
>erode their financial position -- so I suspect that average warranty 
>return percentages (for surviving manufacturers) would turn out to be 
>virtually identical -- and thus not very significant for cluster design 
>decisions.

Precisely true...

>Until a better solution is found, we can only make educated guesses -- and 
>share anecdotal stories about bad batches to avoid...
>

Or, spend some time with the full reliability data and make a "calculated" 
guess.

This is kind of what separates the big companies from the small ones.  The 
big ones have the resources to do meaningful tests (i.e. pull 1000 units 
off the line and life test them), the small ones don't.

It's interesting.. to a certain extent this discussion reflects the change 
in Beowulfery.. from making use of commodity consumer equipment (because 
it's cheap and living within the limitations.. interconnect bandwidth, 
etc.) to far more specialized cluster computing, where you're looking at 
the details of node reliability, infrastructure issues, etc.

Part of it is the scale of clusters has increased.  It used to be 4 or 8 
computers in the typical cluster, and even with kind of crummy reliability, 
it worked ok.  The failures weren't so common that you couldn't run a big 
job, and the impact of having to replace a machine wasn't so huge.

Now, though, with 1000 nodes, the reliability becomes much more important, 
because the failure rate is multiplied by 1000, instead of 8.

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875