[Beowulf] Big storage

Fri Aug 24 14:17:11 PDT 2007

Bruce Allen wrote:
> Hi Jeff,
>
> OK, I see the point.  You are not worried about multiple unreadable 
> sectors making it impossible to reconstruct lost data.  You are 
> worried about 'whole disk' failure.

Well, no actually. I'm worried about unrecoverable reads on the
remaining disks during reconstruction. :) Is that what you are referring
to?

> I definitely agree that this is a possible problem.  In fact we 
> operate all of our UWM data archives (about 300 TB) as RAID-6 to 
> reduce the probability of this.  The idea of a second disk failing in 
> a RAID-5 array during rebuild does not make for a good night's sleep!

Did you see Garth's comments? Even using a number of 500TB drives
greatly increases the probability of a URE during reconstruction. RAID-6
helps you sleep, but not as much as you think :) Scares the cr** out of me.
I'm looking to build a home server and I think I'm going to do RAID-61
to give myself some extra protection. I just have to figure out how to power
all of them and find a case where they can fit and a motherboard with enough
SATA connectors :)

Enjoy!

Jeff

>
> Cheers,
>     Bruce
>
> On Fri, 24 Aug 2007, Jeffrey B. Layton wrote:
>
>> Bruce,
>>
>> I urge you to read Garth's comments. Your description of what
>> RAID controllers do is very good when there are no failed drives.
>> If a drive fails though, you can't scan the disks looking for bad
>> sectors.
>>
>> During a reconstruction, the RAID controller is reconstructing
>> the data based on the remaining drives and the parity.
>> Unfortunately, the controller is likely to be block based so it has
>> to rebuild every block of the failed disk. But if the controller is
>> doing a reconstruction and hits a URE, then the reconstruction
>> process just stops and the controller cries uncle. This means you
>> have to restore the failed array from a backup. This means the
>> entire volume.
>>
>> With drives getting larger and larger all the time, the window of
>> vulnerability during reconstruction (where a second drive failure
>> will fail the entire volume) has grown because it takes longer and
>> longer to reconstruct so much data. This is why people are moving
>> to RAID-6. But RAID-6 is expensive in terms of capacity and performance
>> (Note: it has worse write performance than RAID-5). It gives the
>> ability to tolerate a second drive failure, but it may not reduce the
>> window of vulnerability during reconstruction because it takes longer
>> to reconstruct.
>>
>> Here's an article where Garth talks about this (it's at the end):
>>
>> http://www.eweek.com/article2/0,1895,2168821,00.asp
>>
>> I wanted to note one quick thing from the article:
>>
>> "The probability of the disk failing to read back data is the same as
>> it was long ago, so today you can expect at least one failed read every
>> 10TB to 100TB. But the reconstruction of a failed 500GB disk in an
>> 11-disk array has to read 5TB, so there can be an unacceptably large
>> chance of failure to rebuild every one of the 1 billion sectors on the
>> failed disk."
>>
>> So if a reconstruction fails, you have to copy 5TB of data from the
>> backup to the volume. If you do this from tape - you're going to wait
>> a long time. You can do it from a disk backup but it still may take
>> some time to move 5TB across the wire depending upon how you
>> everything connected.
>>
>> Jeff
>>
>>
>>> Hi Jeff,
>>>
>>> For this reason, in a RAID system with a lot of disks it is 
>>> important to scan the disks looking for unreadable (UNC = 
>>> uncorrectable) data blocks on a regular basis.  If these are found, 
>>> then the missing data at that Logical Block Address (LBA) has to be 
>>> reconstructed from the *other* disks and re-written onto the failed 
>>> disk.
>>>
>>> In a well-designed (hardware or software) RAID implementation, you 
>>> can reconstruct the missing data by only reading a handful of 
>>> logical blocks from the redundant disks.  It is not necessary to 
>>> read the entire disk surface just to get a few 512 byte sectors of 
>>> data.  So a failure for different data somewhere else on a disk 
>>> should not (in principle) prevent reconstruction of the lost/missing 
>>> data.  In a poorly-designed RAID implementation, you have to read 
>>> the ENTIRE disk surface to get data from a few sectors.  In this 
>>> case, another uncorrectable disk sector can be crippling.
>>>
>>> Most good hardware RAID cards have an option for continous disk 
>>> scanning. For example ARECA called this 'consistency checking'.  It 
>>> should be done on a regular basis.
>>>
>>> You can use smartmontools to do this also, by cayring out regular 
>>> read scans of the disk surface and then forcing a RAID consistency 
>>> check/rebuild if there is a read failure at some disk block.
>>>
>>> Note that continous scanning is also needed for ECC memory to 
>>> prevent correctable single-bit errors from becomming uncorrectable 
>>> double-bit errors.  In this RAM/memory context it is called 'memory 
>>> scrubbing'
>>>
>>> Cheers,
>>>     Bruce
>>>
>>> On Thu, 23 Aug 2007, Jeffrey B. Layton wrote:
>>>
>>>> This isn't really directed at Jeff, but it seemed like a good segway
>>>> for a comment. Everyone - please read some recent article by
>>>> Garth Gibson about large capacity disks and large number of
>>>> disks in a RAID group. Just to cut to the chase, given the
>>>> Unrecoverable Read Error (URE) rate and large disks, during
>>>> a rebuild you are almost guaranteed to hit a URE. When that
>>>> happens, the rebuild stops and you have to restore everything
>>>> from a backup. RAID-6 can help, but given enough disks and
>>>> large enough disks, the same thing can happen (plus RAID-6
>>>> rebuilds take longer since there are more computations involved).
>>>>
>>>> Jeff
>>>>
>>>> P.S. I guess I should disclose that my day job is at Panasas. But
>>>> regardless, I would recommend reading some of Garth's comments.
>>>> Maybe I can also get one of his presentations to pass around.
>>>>
>>>> P.P.S. If you don't know Garth, he's one of the fathers of RAID.
>>>>
>>>>> Hello Jakob,
>>>>> A couple of things...
>>>>> 1. ClusterFS has an easy to understand calculation on why raid 6 is
>>>>> necessary for the amount of disks you're considering. You do need to
>>>>> plan for multi-disk failure, especially with the rebuild time of 1TB
>>>>> disks.
>>>>> http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-10-1.html#wp1037512 
>>>>>
>>>>> 2. Avoid tape if you can. At this scale, the administrative time and
>>>>> costs far outweigh the benefits. Of course if you need to move your
>>>>> data to a secure vault that's another thing. If you really want to do
>>>>> tape, some people choose to do disk > disk > tape. This eliminates 
>>>>> the
>>>>> read interrupts on the primary storage and provides some added
>>>>> redundancy.
>>>>>
>>>>> 3. We do use Nexsan's satabeasts for storage similar to this. Without
>>>>> commenting on costs, the jackrabbit is technologically superior.
>>>>>
>>>>> Thanks,
>>>>>                 jeff
>>>>>
>>>>> On 8/23/07, Jakob Oestergaard <jakob at unthought.net> wrote:
>>>>>
>>>>>> On Thu, Aug 23, 2007 at 07:56:15AM -0400, Joe Landman wrote:
>>>>>>
>>>>>>> Greetings Jakob:
>>>>>>>
>>>>>>>
>>>>>> Hi Joe,
>>>>>>
>>>>>> Thanks for answering!
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>> up front disclaimer: we design/build/market/support such things.
>>>>>>>
>>>>>> That does not disqualify you  :)
>>>>>>
>>>>>>
>>>>>>>> I'm looking at getting some big storage. Of all the parameters, 
>>>>>>>> getting as low
>>>>>>>> dollars/(month*GB) is by far the most important. The price of 
>>>>>>>> acquiring and
>>>>>>>> maintaining the storage solution is the number one concern.
>>>>>>>>
>>>>>>> Should I presume density, reliability, and performance also 
>>>>>>> factor in
>>>>>>> somewhere as 2,3,4 (somehow) on the concern list?
>>>>>>>
>>>>>> I expect that the major components of the total cost of running 
>>>>>> this beast will
>>>>>> be something like
>>>>>>
>>>>>>    acquisition
>>>>>>  + power
>>>>>>  + cooling
>>>>>>  + payroll (disk-replacing admins :)
>>>>>>
>>>>>> Real-estate is a concern as well, of course. The rent isn't free. 
>>>>>> It would be
>>>>>> nice to pack this in as few racks as possible.  Reliability, 
>>>>>> well... I expect
>>>>>> frequent drive failures, and I would expect that we'd run some 
>>>>>> form of RAID to
>>>>>> mitigate this. If the rest of the hardware is just reasonably 
>>>>>> well designed,
>>>>>> the most frequently failing components should be redundant and 
>>>>>> hot-swap
>>>>>> replacable (fans and PSUs).
>>>>>>
>>>>>> It's acceptable that a head-node fails for a short period of 
>>>>>> time. The entire
>>>>>> system will not depend on all head nodes functioning simultaneously.
>>>>>>
>>>>>>
>>>>>>>> The setup will probably have a number of "head nodes" which 
>>>>>>>> receive a large
>>>>>>>> amount of data over standard gigabit from a large amount of 
>>>>>>>> remote sources.
>>>>>>>> Data is read infrequently from the head nodes by remote 
>>>>>>>> systems. The primary
>>>>>>>> load on the system will be data writes.
>>>>>>>>
>>>>>>> Ok, so you are write dominated.  Could you describe (guesses are 
>>>>>>> fine)
>>>>>>> what the writes will look like?  Large sequential data, small 
>>>>>>> random
>>>>>>> data (seek, write, close)?
>>>>>>>
>>>>>> I would expect something like 100-1000 simultaneous streaming 
>>>>>> writes to just as
>>>>>> many files (one file per writer). The files will be everything 
>>>>>> from a few
>>>>>> hundred MiB to many GiB.
>>>>>>
>>>>>> I guess that on most filesystems these streaming sequential 
>>>>>> writes will result
>>>>>> in something close to "random writes" to the block layer. 
>>>>>> However, we can be
>>>>>> very generous with write buffering.
>>>>>>
>>>>>>
>>>>>>>> The head nodes need not see the same unified storage; so I am 
>>>>>>>> not required to
>>>>>>>> have one big shared filesystem. If beneficial, each of the head 
>>>>>>>> nodes could
>>>>>>>> have their own local storage.
>>>>>>>>
>>>>>>> There are some interesting designs with a variety of systems, 
>>>>>>> including
>>>>>>> GFS/Lustre/... on those head nodes, and a big pool of drives behind
>>>>>>> them.  These designs will add to the overall cost, and increase 
>>>>>>> complexity.
>>>>>>>
>>>>>> Simple is nice :)
>>>>>>
>>>>>>
>>>>>>>> The storage pool will start out at around 100TiB and will grow 
>>>>>>>> to ~1PiB within
>>>>>>>> a year or two (too early to tell). It would be nice to use as 
>>>>>>>> few racks as
>>>>>>>> possible, and as little power as possible  :)
>>>>>>>>
>>>>>>> Ok, so density and power are important.  This is good.  Coupled 
>>>>>>> with the
>>>>>>>  low management cost and low acquisition cost, we have about 3/4 
>>>>>>> of what
>>>>>>> we need.  Just need a little more description of the writes.
>>>>>>>
>>>>>> I hope the above helped.
>>>>>>
>>>>>>
>>>>>>> Also, do you intend to back this up?
>>>>>>>
>>>>>> That is a *very* good question.
>>>>>>
>>>>>>
>>>>>>> How important is resiliency of the
>>>>>>> system?  Can you tolerate a failed unit (assume the units have hot
>>>>>>> spares, RAID6, etc).
>>>>>>>
>>>>>> Yes. Single head nodes may fail. They must be fairly quick to get 
>>>>>> back on line
>>>>>> (having a replacement box I would expect no more than an hour of 
>>>>>> downtime).
>>>>>>
>>>>>>
>>>>>>> When you look at storage of this size, you have to
>>>>>>> start planning for the eventual (and likely) failure of a 
>>>>>>> chassis (or
>>>>>>> some number of them), and think about with a RAIN configuration.
>>>>>>>
>>>>>> Yep. I don't know how likely a "many-disk" failure would be... If 
>>>>>> I have a full
>>>>>> replacement chassis, I would guess that I could simply pull out 
>>>>>> all the disks
>>>>>> from a failed system, move them to the replacement chassis and be 
>>>>>> up and
>>>>>> running again in "short" time.
>>>>>>
>>>>>> If a PSU decides to fry everything connected to it including the 
>>>>>> disks, then
>>>>>> yes, I can see the point in RAIN or a full backup.
>>>>>>
>>>>>> It's a business decision if a full node loss would be acceptable. 
>>>>>> I honestly
>>>>>> don't know that, but it is definitely interesting to consider 
>>>>>> both "yes" and
>>>>>> "no".
>>>>>>
>>>>>>
>>>>>>> Either
>>>>>>> that, or invest into massive low level redundancy (which should 
>>>>>>> be scope
>>>>>>> limited to the box it is on anyway).
>>>>>>>
>>>>>> Yes; I had something like RAID-5 or so in mind on the nodes.
>>>>>>
>>>>>>
>>>>>>>> It *might* be possible to offload older files to tape; does 
>>>>>>>> anyone have
>>>>>>>> experience with HSM on Linux?  Does it work?  Could it be 
>>>>>>>> worthwhile to
>>>>>>>> investigate?
>>>>>>>>
>>>>>>> Hmmm...  First I would suggest avoiding tape, you should likely be
>>>>>>> looking at disk to disk for backup, and use slower nearline 
>>>>>>> mechanisms.
>>>>>>>
>>>>>> Why would you avoid tape?
>>>>>>
>>>>>> Let's say there was software which allowed me to offload data to 
>>>>>> tape in a
>>>>>> reasonable manner. Considering the running costs of disk versus 
>>>>>> tape, tape
>>>>>> would win hands down on power, cooling and replacements.
>>>>>>
>>>>>> Sure, the random seek time of a tape library sucks golf balls 
>>>>>> through a garden
>>>>>> hose, but assuming that one could live with that, are there more 
>>>>>> important
>>>>>> reasons to avoid tape?
>>>>>>
>>>>>>
>>>>>>>> One setup I was looking at, is simply using SunFire X4500 
>>>>>>>> systems (you can put
>>>>>>>> 48 standard 3.5" SATA drives in each 4U system). Assuming I can 
>>>>>>>> buy them with
>>>>>>>> 1T SATA drives shortly, I could start out with 3 systems (12U) 
>>>>>>>> and grow the
>>>>>>>> entire setup to 1P with 22 systems in little over two full racks.
>>>>>>>>
>>>>>>>> Any better ideas?  Is there a way to get this more dense 
>>>>>>>> without paying an arm
>>>>>>>> and a leg?  Has anyone tried something like this with HSM?
>>>>>>>>
>>>>>>> Yes, but I don't want to turn this into a commercial, so I will be
>>>>>>> succinct.  Scalable Informatics (my company) has a similar product,
>>>>>>> which does have a good price and price per gigabyte, while 
>>>>>>> providing
>>>>>>> excellent performance.  Details (white paper, benchmarks, 
>>>>>>> presentations)
>>>>>>> at the http://jackrabbit.scalableinformatics.com web site.
>>>>>>>
>>>>>> Yep, I was just looking at that actually.
>>>>>>
>>>>>> The hardware looks similar in concept to the SunFire, but as I 
>>>>>> see it you guys
>>>>>> have thought about a number of services atop of that (RAIN etc.)
>>>>>>
>>>>>>
>>>>>> Very interesting!
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>>  / jakob
>>>>>>