[Beowulf] Surviving a double disk failure
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Chris Samuel csamuel at vpac.orgSun Apr 19 01:40:52 PDT 2009
- Previous message: [Beowulf] Repenting for sins against Dell (on good Friday, no less)
- Next message: [Beowulf] Rackable / SGI
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
----- "Joe Landman" <landman at scalableinformatics.com> wrote: > 2) Scrub early, scrub often. As long as you don't have IBM gear where what appears to be a firmware issue somewhere (possibly on the disks themselves) can mean that the LSI RAID controller they rebadge thinks that up to 12 drives have just failed in the space of a few minutes. Of course none of them really have failed, but your RAID60 is still toast and boy does it take a few years off your life, not to mention days and days to recover from tape.. Sigh.. Happens under Debian (with mainline kernel) and CentOS with its stock kernel (we copied over the scrub script that Debian packages), but of course IBM wouldn't take any notice until we could do it under RHEL - you can trigger a scrub manually through (for example): echo check > /sys/block/md0/md/sync_action We now have another vendors storage unit and won't think about using the IBM unit in anger until we can confirm that the latest round of firmware updates have solved the problem. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
- Previous message: [Beowulf] Repenting for sins against Dell (on good Friday, no less)
- Next message: [Beowulf] Rackable / SGI
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
