[Beowulf] Re: Cooling vs HW replacement
josip at lanl.gov
Thu Jan 27 09:26:36 PST 2005
Karen Shaeffer wrote:
> If DDMs were interested in helping customers discriminate based on the
> actual expected lifetime of drives, they would all publish running infant
> mortality rates, updated weekly, during the production run of their disk
> drives. Afterall, this is the one metric the entire organization is focused
> on during production. But, what they hand out is this MTBF number to
> prospective customers. A number they pay no attention to internally.
Karen's excellent introduction to the logic of disk drive manufacturing
(DDM) is well worth reading -- particularly since the same factors drive
other computer manufacturers: rapid product cycles, insane time
pressures, thin profit margins, limited opportunity to prevent
financially ruinous mistakes, etc.
I'd just like to offer my personal guesses of what manufacturers of
commodity disk drives want to achieve: product lifespan of about 5
years, infant mortality under 1%, and competitive MTBF numbers. While
MTBF claims are indeed soft, they are the only published data that
relates to the mid-life failure rate, i.e. the period of peak interest
to cluster users.
Infant mortality <1% is probably acceptable to cluster builders, but as
Karen pointed out, things can go wrong. Although DDMs will try to fix
problems before too many bad drives are sold, the basic fact is that a
bad batch of drives is something that neither DDMs nor their customers
could have predicted (otherwise, DDMs would not put themselves at
financial risk -- and they have more information than we do).
Therefore, deciding which drive model (or other component) to use fits
under the topic of optimal decision making under uncertainty -- which is
a standard part of game theory, often used in operations research, etc.
Making rational choices, which can withstand scrutiny even when things
unexpectedly go wrong, is not just an art. There is theory to build on.
More information about the Beowulf