[Beowulf] Re: failure trends in a large disk drive population
hahn at mcmaster.ca
Wed Feb 21 18:44:26 PST 2007
> weakly correlated with failure. However, of all the disks that failed, less
> than half (around 45%) had ANY of the "strong" signals and another 25% had
> some of the "weak" signals. This means that over a third of disks that
> failed gave no appreciable warning. Therefore even combining the variables
> would give no better than a 70% chance of predicting failure.
well, a factorial analysis might still show useful interactions.
> number of disks. For example, among the disks that failed, many had a large
> number of seek error; however, over 70% of disks in the fleet -- failed and
> working -- had a large number of seek errors.
was there any trend across time in the seek errors?
> So that's our master plan. Just don't tell anyone. :)
hah. well, if it were me, the M.P. would involve some sort of proactive
treatment: say, a full-disk read once a day. smart self-tests _ought_
to be more valuable than that, but otoh, the vendor probably munge the
measurements pretty badly.
regards, mark hahn.
More information about the Beowulf