[Beowulf] Re: failure trends in a large disk drive population

Justin Moore justin at cs.duke.edu
Fri Feb 16 16:28:39 PST 2007


>> Despite my Duke e-mail address, I've been at Google since July.  While
>> I'm not a co-author, I'm part of the group that did this study and can
>> answer (some) questions people may have about the paper.
>>
>
> Dangling meat in front of the bears, eh?  Well...

I can always hide behind my duck-blind-slash-moat-o'-NDA. :)

> Is there any info for failure rates versus type of main bearing
> in the drive?
>
> Failure rate versus any other implementation technology?

We haven't done this analysis, but you might be interested in this paper 
from CMU:

http://www.usenix.org/events/fast07/tech/schroeder.html

They performed a similar study on drive reliability -- with the help of 
some people/groups here, I believe -- and found no significant 
differences in reliability between different disk technologies (SATA, 
SCSI, IDE, FC, etc).

> Failure rate vs. drive speed (RPM)?

Again, we may have the data but it hasn't been processed.

> Or to put it another way, is there anything to indicate which
> component designs most often result in the eventual SMART
> events (reallocation, scan errors) and then, ultimately, drive
> failure?

One of the problems noted in the paper is that even if you assume that 
*any* SMART event is indicative in some way of an upcoming failure -- 
and are willing to deal with a metric boatload of false positives -- 
over one-third of failed drives had zero counts on all SMART parameters. 
And one of these parameters -- seek errors -- were observed on nearly 
three-quarters of the drives in our fleet, so you really would be 
dealing with boatloads of false positives.

> Failure rates versus rack position?  I'd guess no effect here,
> since that would mostly affect temperature, and there was
> little temperature effect.

I imagine it wouldn't matter.  Even if it did, I'm not sure we have this 
data in an easy-to-parse-and-include format.

> Failure rates by data center?  (Are some of your data centers
> harder on drives than others?  If so, why?)

The CMU study is broken down by data center.  There is certainly the 
case in their study that some data centers appear to be harder on drives 
than others, but there may be age and vintage issues coming into play in 
their study (an issue they acknowledge in the paper).  My intuition -- 
again, not having analyzed the data -- is that application 
characteristics and not data center characteristics are going to have a 
more pronounced effect.  There is a section on how utilization effects 
AFR over time.

> Are there air pressure and humidity measurements from your data 
> centers? Really low air pressure (as at observatory height) is a known 
> killer of disks, it would be interesting if lesser changes in air 
> pressure also had a measurable effect.  Low humidity cranks up static 
> problems, high humidity can result in condensation.

Once we start getting data from our Tibetan Monastery/West Asia data 
center I'll let you know. :)

-jdm

Department of Computer Science, Duke University, Durham, NC 27708-0129
Email:	justin at cs.duke.edu
Web:	http://www.cs.duke.edu/~justin/



More information about the Beowulf mailing list