[Beowulf] Big storage

Thu Aug 23 16:30:21 PDT 2007

This isn't really directed at Jeff, but it seemed like a good segway
for a comment. Everyone - please read some recent article by
Garth Gibson about large capacity disks and large number of
disks in a RAID group. Just to cut to the chase, given the
Unrecoverable Read Error (URE) rate and large disks, during
a rebuild you are almost guaranteed to hit a URE. When that
happens, the rebuild stops and you have to restore everything
from a backup. RAID-6 can help, but given enough disks and
large enough disks, the same thing can happen (plus RAID-6
rebuilds take longer since there are more computations involved).

Jeff

P.S. I guess I should disclose that my day job is at Panasas. But
regardless, I would recommend reading some of Garth's comments.
Maybe I can also get one of his presentations to pass around.

P.P.S. If you don't know Garth, he's one of the fathers of RAID.

> Hello Jakob,
> A couple of things...
> 1. ClusterFS has an easy to understand calculation on why raid 6 is
> necessary for the amount of disks you're considering. You do need to
> plan for multi-disk failure, especially with the rebuild time of 1TB
> disks.
> http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-10-1.html#wp1037512
>
> 2. Avoid tape if you can. At this scale, the administrative time and
> costs far outweigh the benefits. Of course if you need to move your
> data to a secure vault that's another thing. If you really want to do
> tape, some people choose to do disk > disk > tape. This eliminates the
> read interrupts on the primary storage and provides some added
> redundancy.
>
> 3. We do use Nexsan's satabeasts for storage similar to this. Without
> commenting on costs, the jackrabbit is technologically superior.
>
> Thanks,
>                 jeff
>
> On 8/23/07, Jakob Oestergaard <jakob at unthought.net> wrote:
>   
>> On Thu, Aug 23, 2007 at 07:56:15AM -0400, Joe Landman wrote:
>>     
>>> Greetings Jakob:
>>>
>>>       
>> Hi Joe,
>>
>> Thanks for answering!
>>
>> ...
>>     
>>> up front disclaimer: we design/build/market/support such things.
>>>       
>> That does not disqualify you  :)
>>
>>     
>>>> I'm looking at getting some big storage. Of all the parameters, getting as low
>>>> dollars/(month*GB) is by far the most important. The price of acquiring and
>>>> maintaining the storage solution is the number one concern.
>>>>         
>>> Should I presume density, reliability, and performance also factor in
>>> somewhere as 2,3,4 (somehow) on the concern list?
>>>       
>> I expect that the major components of the total cost of running this beast will
>> be something like
>>
>>    acquisition
>>  + power
>>  + cooling
>>  + payroll (disk-replacing admins :)
>>
>> Real-estate is a concern as well, of course. The rent isn't free. It would be
>> nice to pack this in as few racks as possible.  Reliability, well... I expect
>> frequent drive failures, and I would expect that we'd run some form of RAID to
>> mitigate this. If the rest of the hardware is just reasonably well designed,
>> the most frequently failing components should be redundant and hot-swap
>> replacable (fans and PSUs).
>>
>> It's acceptable that a head-node fails for a short period of time. The entire
>> system will not depend on all head nodes functioning simultaneously.
>>
>>     
>>>> The setup will probably have a number of "head nodes" which receive a large
>>>> amount of data over standard gigabit from a large amount of remote sources.
>>>> Data is read infrequently from the head nodes by remote systems. The primary
>>>> load on the system will be data writes.
>>>>         
>>> Ok, so you are write dominated.  Could you describe (guesses are fine)
>>> what the writes will look like?  Large sequential data, small random
>>> data (seek, write, close)?
>>>       
>> I would expect something like 100-1000 simultaneous streaming writes to just as
>> many files (one file per writer). The files will be everything from a few
>> hundred MiB to many GiB.
>>
>> I guess that on most filesystems these streaming sequential writes will result
>> in something close to "random writes" to the block layer. However, we can be
>> very generous with write buffering.
>>
>>     
>>>> The head nodes need not see the same unified storage; so I am not required to
>>>> have one big shared filesystem. If beneficial, each of the head nodes could
>>>> have their own local storage.
>>>>         
>>> There are some interesting designs with a variety of systems, including
>>> GFS/Lustre/... on those head nodes, and a big pool of drives behind
>>> them.  These designs will add to the overall cost, and increase complexity.
>>>       
>> Simple is nice :)
>>
>>     
>>>> The storage pool will start out at around 100TiB and will grow to ~1PiB within
>>>> a year or two (too early to tell). It would be nice to use as few racks as
>>>> possible, and as little power as possible  :)
>>>>         
>>> Ok, so density and power are important.  This is good.  Coupled with the
>>>  low management cost and low acquisition cost, we have about 3/4 of what
>>> we need.  Just need a little more description of the writes.
>>>       
>> I hope the above helped.
>>
>>     
>>> Also, do you intend to back this up?
>>>       
>> That is a *very* good question.
>>
>>     
>>> How important is resiliency of the
>>> system?  Can you tolerate a failed unit (assume the units have hot
>>> spares, RAID6, etc).
>>>       
>> Yes. Single head nodes may fail. They must be fairly quick to get back on line
>> (having a replacement box I would expect no more than an hour of downtime).
>>
>>     
>>> When you look at storage of this size, you have to
>>> start planning for the eventual (and likely) failure of a chassis (or
>>> some number of them), and think about with a RAIN configuration.
>>>       
>> Yep. I don't know how likely a "many-disk" failure would be... If I have a full
>> replacement chassis, I would guess that I could simply pull out all the disks
>> from a failed system, move them to the replacement chassis and be up and
>> running again in "short" time.
>>
>> If a PSU decides to fry everything connected to it including the disks, then
>> yes, I can see the point in RAIN or a full backup.
>>
>> It's a business decision if a full node loss would be acceptable. I honestly
>> don't know that, but it is definitely interesting to consider both "yes" and
>> "no".
>>
>>     
>>> Either
>>> that, or invest into massive low level redundancy (which should be scope
>>> limited to the box it is on anyway).
>>>       
>> Yes; I had something like RAID-5 or so in mind on the nodes.
>>
>>     
>>>> It *might* be possible to offload older files to tape; does anyone have
>>>> experience with HSM on Linux?  Does it work?  Could it be worthwhile to
>>>> investigate?
>>>>         
>>> Hmmm...  First I would suggest avoiding tape, you should likely be
>>> looking at disk to disk for backup, and use slower nearline mechanisms.
>>>       
>> Why would you avoid tape?
>>
>> Let's say there was software which allowed me to offload data to tape in a
>> reasonable manner. Considering the running costs of disk versus tape, tape
>> would win hands down on power, cooling and replacements.
>>
>> Sure, the random seek time of a tape library sucks golf balls through a garden
>> hose, but assuming that one could live with that, are there more important
>> reasons to avoid tape?
>>
>>     
>>>> One setup I was looking at, is simply using SunFire X4500 systems (you can put
>>>> 48 standard 3.5" SATA drives in each 4U system). Assuming I can buy them with
>>>> 1T SATA drives shortly, I could start out with 3 systems (12U) and grow the
>>>> entire setup to 1P with 22 systems in little over two full racks.
>>>>
>>>> Any better ideas?  Is there a way to get this more dense without paying an arm
>>>> and a leg?  Has anyone tried something like this with HSM?
>>>>         
>>> Yes, but I don't want to turn this into a commercial, so I will be
>>> succinct.  Scalable Informatics (my company) has a similar product,
>>> which does have a good price and price per gigabyte, while providing
>>> excellent performance.  Details (white paper, benchmarks, presentations)
>>> at the http://jackrabbit.scalableinformatics.com web site.
>>>       
>> Yep, I was just looking at that actually.
>>
>> The hardware looks similar in concept to the SunFire, but as I see it you guys
>> have thought about a number of services atop of that (RAIN etc.)
>>
>>
>> Very interesting!
>>
>> --
>>
>>  / jakob
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>     
>
>
>