[Beowulf] Surviving a double disk failure
landman at scalableinformatics.com
Fri Apr 10 10:42:27 PDT 2009
Michael Will wrote:
> raid6 is also new code with new bugs that can lead to dataloss as well,
> regardless of its nice 'can survive
> two drive failures' feature. I have seen it happen.
All code (anywhere) can have bugs. Arguing that raid6 module has bugs a
non-sequitur. It is well tested, and in use at a large and growing
number of sites.
Raid6 is indeed younger than raid5 code in the kernel. As the Raid6
kernel was derived from the Raid5 code ...
I do agree that bugs can take down your storage. That bad adapters or
bad code, or bad drivers can (and do) result in damaged data. Which is
why frequent backups are so important. Raid is not a backup (a favorite
expression of mine).
This said, raid6 buys you a bit more time to solve your problem than
raid5 does. The google paper from 2 years ago notes that a second drive
failure was well correlated with the first drive failure within 1000
seconds, e.g. during the rebuild. That second failure occurs, in a
raid5 system, and you are (largely) toast, unless you go to the
Herculean levels that Stuart went through. Even then, you aren't
guaranteed to get anything back.
The point being, raid6 may not be perfect, but it can likely stop a bad
day from going really pear-shaped. The statistics are against you
surviving a failure, rather strongly, for RAID5 with large disk drives.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf