[Beowulf] Re: Cooling vs HW replacement

Josip Loncaric josip at lanl.gov
Thu Jan 27 09:26:36 PST 2005

Karen Shaeffer wrote:
> [...]
If DDMs were interested in helping customers discriminate based on the
actual expected lifetime of drives, they would all publish running infant
mortality rates, updated weekly, during the production run of their disk
drives. Afterall, this is the one metric the entire organization is focused
on during production. But, what they hand out is this MTBF number to
prospective customers. A number they pay no attention to internally.

Karen's excellent introduction to the logic of disk drive manufacturing 
(DDM) is well worth reading -- particularly since the same factors drive 
other computer manufacturers: rapid product cycles, insane time 
pressures, thin profit margins, limited opportunity to prevent 
financially ruinous mistakes, etc.

I'd just like to offer my personal guesses of what manufacturers of 
commodity disk drives want to achieve: product lifespan of about 5 
years, infant mortality under 1%, and competitive MTBF numbers.  While 
MTBF claims are indeed soft, they are the only published data that 
relates to the mid-life failure rate, i.e. the period of peak interest 
to cluster users.

Infant mortality <1% is probably acceptable to cluster builders, but as 
Karen pointed out, things can go wrong.  Although DDMs will try to fix 
problems before too many bad drives are sold, the basic fact is that a 
bad batch of drives is something that neither DDMs nor their customers 
could have predicted (otherwise, DDMs would not put themselves at 
financial risk -- and they have more information than we do).

Therefore, deciding which drive model (or other component) to use fits 
under the topic of optimal decision making under uncertainty -- which is 
a standard part of game theory, often used in operations research, etc.

Making rational choices, which can withstand scrutiny even when things 
unexpectedly go wrong, is not just an art.  There is theory to build on.


