[Beowulf] Surviving a double disk failure
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Joe Landman landman at scalableinformatics.comFri Apr 10 05:18:03 PDT 2009
- Previous message: [Beowulf] Surviving a double disk failure
- Next message: [Beowulf] Surviving a double disk failure
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Stuart Midgley wrote: Good work Stuart! > What are the lessons learnt? Well with software raid Linux is both your 1) Use RAID6. It is your friend. RAID5 is unashamedly your enemy. 2) Scrub early, scrub often. We cron this ~1/week on Delta-V's (sounds similar to your box). 3) pay attention to any/every error. Disk keeps giving you errors, toss it. > friend and enemy. The behaviour of md got us in this mess. When md gets > an error on read it recovers the data from the other disks and re-writes > the blocks to the failed disk hoping the disk will reallocate. You do > get a warning saying that md encountered a recoverable error. So you > think it is ok. BUT the disk still failed on read and you haven't > swapped it out. Some time later when another disk fails hard and you get > a failed read on your other dodgy disk md sees 2 failed disks. And it's > all over. This is why RAID6 is your friend. Aside from this, the scrubbing mode of MD (would require a later kernel, bug me offline if you want to try one), is a lifesaver. This and the later versions of the md tools. The kernel, drivers, and tools with your distro are *ancient* by most standards. > My advice: don't let Linux collude with the disk vendors and reduce heh ... > your reliability. Swap any disk that gets a correctable error on read. > Reallocation on write is fine not on read. The disk has failed. add to this: 4) scheduled scrubbing to specifically detect these errors. Turn on error correction bits for scrub to force it to try to correct errors. Glad you were able to get your data back. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
- Previous message: [Beowulf] Surviving a double disk failure
- Next message: [Beowulf] Surviving a double disk failure
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
