[Beowulf] Big storage

Gerry Creager gerry.creager at tamu.edu
Wed Apr 16 06:24:43 PDT 2008


We've had two fail rather randomly.  The failures did cause disk 
corruption but it wasn't an undetected/undetectable sort.  They started 
throwing errors to syslog, then fell over and stopped accessing disks.

gerry

Bruce Allen wrote:
> Hi Gerry,
> 
> So far the only problem we have had is with one Areca card that had a 
> bad 2GB memory module.  This generated lots of (correctable) single bit 
> errors but eventually caused real problems.  Could you say something 
> about the reliability issues you have seen?
> 
> Cheers,
>     Bruce
> 
> 
> On Wed, 16 Apr 2008, Gerry Creager wrote:
> 
>> We've used AoE (CoRAID hardware) with pretty good success (modulo one 
>> RAID shelf fire that was caused by a manufacturing defect and dealt 
>> with promptly by CoRAID).  We've had some reliability issues with 
>> Areca cards but no data corruption on the systems we've built that way.
>>
>> gerry
>>
>> Bruce Allen wrote:
>>> Hi Xavier,
>>>
>>>>>>> PPS: We've also been doing some experiments with putting 
>>>>>>> OpenSolaris+ZFS on some of our generic (Supermicro + Areca) 
>>>>>>> 16-disk RAID systems, which were originally intended to run Linux.
>>>
>>>>>>  I think that DESY proved some data corruption with such 
>>>>>> configuration, so they switched to OpenSolaris+ZFS.
>>>
>>>>> I'm confused.  I am also talking about OpenSolaris+ZFS.  What did 
>>>>> DESY try, and what did they switch to?
>>>
>>>> Sorry, I am indeed not clear. As far as I know, DESY found data 
>>>> corruption using Linux and Areca cards. They moved from linux to 
>>>> OpenSolaris and ZFS, avoiding other corruption. This has been 
>>>> discussed in HEPiX storage workgroup. However, I can not speak on 
>>>> their behalf at all. I'll try to get you in touch with someone more 
>>>> aware of this issue, as my statements lack of figures.
>>>
>>> I think that would be very interesting to the entire Beowulf mailing 
>>> list, so please suggest that they respond to the entire group, not 
>>> just to me personally.  Here is an LKML thread about silent data 
>>> corruption:
>>> http://kerneltrap.org/mailarchive/linux-kernel/2007/9/10/191697
>>>
>>> So far we have not seen any signs of data corruption on Linux+Areca 
>>> systems (and our data files carry both internal and external 
>>> checksums, so we would be sensitive to this).
>>>
>>> Cheers,
>>>     Bruce
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843



More information about the Beowulf mailing list