Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: failure trends in a large disk drive population

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Robin Harker robin at workstationsuk.co.uk
Thu Feb 22 00:10:49 PST 2007


So if we now know, (and we have seen similarly spirious behaviour with
SATA Raid arrays), isn't the real solution to lose the node discs?

Regards

Robin


>
>>> How did they look for predictive models on the SMART data?  It sounds
>>> like they did a fairly linear data decomposition, looking for first
>>> order correlations.  Did they try to e.g. build a neural network on it,
>>> or use fully multivariate methods (ordinary stats can handle it up to
>>> 5-10 variables).
>>>
>>> This is really an extension of David's questions below.  It would be
>>> very interesting to add variables to the problem (if possible) until
>>> the
>>> observed correlations resolve (in sufficiently high dimensionality)
>>> into
>>> something significantly predictive.  That would be VERY useful.
>>>
>>
>> RGB, good idea, apply clustering/GA/MOGA analisys techniques to all of
>> this data. Now the question is, will we ever get access to this data?
>> ;)
>
> As mentioned in an earlier e-mail (I think) there were 4 SMART variables
> whose values were strongly correlated with failure, and another 4-6 that
> were weakly correlated with failure.  However, of all the disks that
> failed, less than half (around 45%) had ANY of the "strong" signals and
> another 25% had some of the "weak" signals.  This means that over a
> third of disks that failed gave no appreciable warning.  Therefore even
> combining the variables would give no better than a 70% chance of
> predicting failure.
>
> To make things worse, many of the "weak" signals were found on a
> significant number of disks.  For example, among the disks that failed,
> many had a large number of seek error; however, over 70% of disks in the
> fleet -- failed and working -- had a large number of seek errors.
>
> About all I can say beyond what's in the paper is that we're aware of
> the shortcomings of the existing work and possible paths forward.  In
> response, we are
> <GOOGLE_NDA_BOT>
> Hello, this is the Google NDA bot.  In our massive trawling of the
> Internet and other data sources, I have detected a possible violation of
> the Google NDA.  This has been corrected.  We now return you to your
> regularly scheduled e-mail.
> [ Continue ]  [ I'm Feeling Confidential ]
> </GOOGLE_NDA_BOT>
>
> So that's our master plan.  Just don't tell anyone. :)
> -jdm
>
> P.S. Unfortunately, I doubt that we'll be willing or able to release the
> raw data behind the disk drive study.
>
> Department of Computer Science, Duke University, Durham, NC 27708-0129
> Email:	justin at cs.duke.edu
> Web:	http://www.cs.duke.edu/~justin/
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


Robin Harker
Workstations UK Ltd
DDI: 01494 787710
Tel: 01494 724498




More information about the Beowulf mailing list