[Beowulf] Big storage

Gerry Creager gerry.creager at tamu.edu
Thu Apr 17 06:01:37 PDT 2008


Bruce Allen wrote:
> Hi Gerry,
> 
>> Areca replacement; RAID rebuild (usually successful); backup; Areca 
>> replacement with 3Ware controller or CoRAID (or JetStor) shelf; create 
>> new RAID instance; restore from backup.
>>
>> Let's just say we lost confidence.
> 
> I understand.  Was this with 'current generation' controllers and 
> firmware or was this two or three years ago?  It's my impression that 
> (when used with compatible drives and drive backplanes) the latest 
> generation of Areca hardware is quite stable.

Within the last year with current drivers.  The possibilities include a 
pair of bad drives taht failed at the same time, anomalies caused by a 
near melt-down in our campus data center (ambient temps spiked past 
130F), and almost any other random event that could have inflicted 
itself on a large number of continuously rotating drives.

My sample's too small to indict Areca but when we had two failures in a 
10 day period coincidence was enough to overwhelm failure statistics 
analysis...

gerry

>> Bruce Allen wrote:
>>> What was needed to fix the systems?  Reboot?  Hardware replacement?
>>>
>>> On Wed, 16 Apr 2008, Gerry Creager wrote:
>>>
>>>> We've had two fail rather randomly.  The failures did cause disk 
>>>> corruption but it wasn't an undetected/undetectable sort.  They 
>>>> started throwing errors to syslog, then fell over and stopped 
>>>> accessing disks.
>>>>
>>>> gerry
>>>>
>>>> Bruce Allen wrote:
>>>>> Hi Gerry,
>>>>>
>>>>> So far the only problem we have had is with one Areca card that had 
>>>>> a bad 2GB memory module.  This generated lots of (correctable) 
>>>>> single bit errors but eventually caused real problems.  Could you 
>>>>> say something about the reliability issues you have seen?
>>>>>
>>>>> Cheers,
>>>>>     Bruce
>>>>>
>>>>>
>>>>> On Wed, 16 Apr 2008, Gerry Creager wrote:
>>>>>
>>>>>> We've used AoE (CoRAID hardware) with pretty good success (modulo 
>>>>>> one RAID shelf fire that was caused by a manufacturing defect and 
>>>>>> dealt with promptly by CoRAID).  We've had some reliability issues 
>>>>>> with Areca cards but no data corruption on the systems we've built 
>>>>>> that way.
>>>>>>
>>>>>> gerry
>>>>>>
>>>>>> Bruce Allen wrote:
>>>>>>> Hi Xavier,
>>>>>>>
>>>>>>>>>>> PPS: We've also been doing some experiments with putting 
>>>>>>>>>>> OpenSolaris+ZFS on some of our generic (Supermicro + Areca) 
>>>>>>>>>>> 16-disk RAID systems, which were originally intended to run 
>>>>>>>>>>> Linux.
>>>>>>>
>>>>>>>>>>  I think that DESY proved some data corruption with such 
>>>>>>>>>> configuration, so they switched to OpenSolaris+ZFS.
>>>>>>>
>>>>>>>>> I'm confused.  I am also talking about OpenSolaris+ZFS.  What 
>>>>>>>>> did DESY try, and what did they switch to?
>>>>>>>
>>>>>>>> Sorry, I am indeed not clear. As far as I know, DESY found data 
>>>>>>>> corruption using Linux and Areca cards. They moved from linux to 
>>>>>>>> OpenSolaris and ZFS, avoiding other corruption. This has been 
>>>>>>>> discussed in HEPiX storage workgroup. However, I can not speak 
>>>>>>>> on their behalf at all. I'll try to get you in touch with 
>>>>>>>> someone more aware of this issue, as my statements lack of figures.
>>>>>>>
>>>>>>> I think that would be very interesting to the entire Beowulf 
>>>>>>> mailing list, so please suggest that they respond to the entire 
>>>>>>> group, not just to me personally.  Here is an LKML thread about 
>>>>>>> silent data corruption:
>>>>>>> http://kerneltrap.org/mailarchive/linux-kernel/2007/9/10/191697
>>>>>>>
>>>>>>> So far we have not seen any signs of data corruption on 
>>>>>>> Linux+Areca systems (and our data files carry both internal and 
>>>>>>> external checksums, so we would be sensitive to this).
>>>>>>>
>>>>>>> Cheers,
>>>>>>>     Bruce
>>>>>>> _______________________________________________
>>>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>>>> To change your subscription (digest mode or unsubscribe) visit 
>>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843



More information about the Beowulf mailing list