[Beowulf] Surviving a double disk failure
orion at cora.nwra.com
Fri Apr 10 10:16:20 PDT 2009
Bill Broadley wrote:
> Guy Coates wrote:
>> Yikes, epic recovery.
>>> What are the lessons learnt?
>> You forgot the obvious one.
> I suggest ditching silly old centos/redhat kernels and run something new
> enough to allow for scrubbing. So that all your disks don't silently start
> collecting errors waiting to cascade into a lost RAID upon the first
> non-silent error.
As a stop-gap solution here I periodically use "smartctl -t long
/dev/<blah>" on all the disks to check their status. I have a daily
cron that does one disk a day on my 26 disk servers so each disk checks
checked once a month.
Technical Manager 303-415-9701 x222
NWRA/CoRA Division FAX: 303-415-9702
3380 Mitchell Lane orion at cora.nwra.com
Boulder, CO 80301 http://www.cora.nwra.com
More information about the Beowulf