[Beowulf] Big storage

Gerry Creager gerry.creager at tamu.edu
Wed Apr 16 10:56:46 PDT 2008

Areca replacement; RAID rebuild (usually successful); backup; Areca 
replacement with 3Ware controller or CoRAID (or JetStor) shelf; create 
new RAID instance; restore from backup.

Let's just say we lost confidence.


Bruce Allen wrote:
> What was needed to fix the systems?  Reboot?  Hardware replacement?
> On Wed, 16 Apr 2008, Gerry Creager wrote:
>> We've had two fail rather randomly.  The failures did cause disk 
>> corruption but it wasn't an undetected/undetectable sort.  They 
>> started throwing errors to syslog, then fell over and stopped 
>> accessing disks.
>> gerry
>> Bruce Allen wrote:
>>> Hi Gerry,
>>> So far the only problem we have had is with one Areca card that had a 
>>> bad 2GB memory module.  This generated lots of (correctable) single 
>>> bit errors but eventually caused real problems.  Could you say 
>>> something about the reliability issues you have seen?
>>> Cheers,
>>>     Bruce
>>> On Wed, 16 Apr 2008, Gerry Creager wrote:
>>>> We've used AoE (CoRAID hardware) with pretty good success (modulo 
>>>> one RAID shelf fire that was caused by a manufacturing defect and 
>>>> dealt with promptly by CoRAID).  We've had some reliability issues 
>>>> with Areca cards but no data corruption on the systems we've built 
>>>> that way.
>>>> gerry
>>>> Bruce Allen wrote:
>>>>> Hi Xavier,
>>>>>>>>> PPS: We've also been doing some experiments with putting 
>>>>>>>>> OpenSolaris+ZFS on some of our generic (Supermicro + Areca) 
>>>>>>>>> 16-disk RAID systems, which were originally intended to run Linux.
>>>>>>>>  I think that DESY proved some data corruption with such 
>>>>>>>> configuration, so they switched to OpenSolaris+ZFS.
>>>>>>> I'm confused.  I am also talking about OpenSolaris+ZFS.  What did 
>>>>>>> DESY try, and what did they switch to?
>>>>>> Sorry, I am indeed not clear. As far as I know, DESY found data 
>>>>>> corruption using Linux and Areca cards. They moved from linux to 
>>>>>> OpenSolaris and ZFS, avoiding other corruption. This has been 
>>>>>> discussed in HEPiX storage workgroup. However, I can not speak on 
>>>>>> their behalf at all. I'll try to get you in touch with someone 
>>>>>> more aware of this issue, as my statements lack of figures.
>>>>> I think that would be very interesting to the entire Beowulf 
>>>>> mailing list, so please suggest that they respond to the entire 
>>>>> group, not just to me personally.  Here is an LKML thread about 
>>>>> silent data corruption:
>>>>> http://kerneltrap.org/mailarchive/linux-kernel/2007/9/10/191697
>>>>> So far we have not seen any signs of data corruption on Linux+Areca 
>>>>> systems (and our data files carry both internal and external 
>>>>> checksums, so we would be sensitive to this).
>>>>> Cheers,
>>>>>     Bruce
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>> To change your subscription (digest mode or unsubscribe) visit 
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf

Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843

More information about the Beowulf mailing list