hard disk reliability
Shawn Masters
scm@tcdi.com
Thu, 3 Jun 1999 10:49:19 -0400
-----Original Message-----
From: Christoph Wasshuber <wasshub@spdc.ti.com>
To: beowulf <beowulf@beowulf.gsfc.nasa.gov>
Date: Thu, 3 Jun 1999 10:49:19 -0400
Subject: hard disk reliability
>Some days ago someone mentioned that one of
>the big benefits of running a diskless cluster
>is the increased reliability. Hard disks are
>the most unreliable part in PCs. Does anybody
>have manufacturer numbers like MTBF (mean time
>between failure)?
Seagate posts all the MTBFs for their drives last I checked (about 6
months ago). Quantum gave MTBF for some, and another number that could be
used to derive the MTBF for others.
The seagate drives range from 300,000 MTBF on soem of the older drives
to 1,000,000 on some of the newer ones, with quite a few in the 800,000
range.
My measured numbers on the quantums is about 250,000 for the bigfoot 6.4
gig, and 320,000 on the same sized fireball. This is with sample sizes of
120 and 24 respectivly.
>I would also be interested in comments from
>people running beowulfs with 100 or more
>nodes, where every node has a hard disk. Do
>you guys exchange a hard disk every month?
>Or even every week?
With the low end drives and that number you will be replacing a drive
every few months. We have experienced a MTBF of about 60 days when cooling
wasn't adequate (note there were eight drives in each system), but some
simple fixes brought it up to about 90 before we finished with the array.
>How serious is the hard disk reliability issue
>in reality?
With a hundred drives you will notice the MTBF, even under perfect
conditions. Choose your drives based on the cost of losing one at the
calculated frequency, and that will tell you what you can afford. If 90
days between node lose is acceptable then you can buy cheap. If runs need
to be over 180 days then you need to look at higher MTBFs (in the 600,000+
range). Overall the price difference isn't as much as when I did these
arrays.
73,