hard disk reliability
Bob Cat
ClusterHack@snet.net
Fri, 4 Jun 1999 02:22:24 -0400
> I've actually worked with large samples of hard drives in both PCs and
> Suns, and have found the MTBFs to be accurate for prediction of failures
for
> large sample sizes. Granted I don't expect a 1,000,000 MTBF drive to run
> for 114 years, but I do expect to replace one in a 1000 drive installation
> every 41 days (give or take a few days). Shawn
Excellent answer, Shawn. And so:
1) MTBF or MTTR is calculated using the assumed probability of failure of
the individual parts used in assembling the unit. They *do* tend to
disregard infant mortality. This used to be known as a SWAG. Actually, it
DOES come out pretty close to the true number.
2) You must divide MTBF by the number of units in a group to get the MTBF of
*any* unit in a group.
3) PLAN FOR FAILURE!!!! It ALWAYS happens. You already knew that, didn't
you?
4) Statistics are fun.
5) I think we'll find that power supply and CPU fans are the most common
failure points. Why? Because they are usually cheaply made. Are there roller
bearings in YOUR fans? I thought not. Real world, we save maybe 10-20 US$
per node by using cheap fans. Only you can determine if this is false
economy for your purposes.
6) There should be significant cost savings in both hardware and electricity
using properly sized, well engineered, and well constructed power supplies
and cooling systems to service multiple nodes. Does anyone have figures on
the actual power/cooling reqs of a typical node?
7) This tends to get us away from commodity hardware, but what are we trying
to accomplish, anyway? More bang for the buck, I say. Any other ideas on
increasing bang/buck?
:ßobÇat.Bat 1.0 >^^< In base(one half) an infinite number approaches unity.
Echo f b800:0000 fff 32 00 e1 09 6f 0f 62 0f 80 04 61 0f 74 0f 32 00 >
Bob.Cat
Echo q >> Bob.Cat
DeBug < Bob.Cat > Nul
@Erase Bob.Cat > Nul