[Beowulf] Big storage

Bruce Allen ballen at gravity.phys.uwm.edu
Fri Aug 24 01:58:15 PDT 2007

Hi Jeff,

For this reason, in a RAID system with a lot of disks it is important to 
scan the disks looking for unreadable (UNC = uncorrectable) data blocks on 
a regular basis.  If these are found, then the missing data at that 
Logical Block Address (LBA) has to be reconstructed from the *other* disks 
and re-written onto the failed disk.

In a well-designed (hardware or software) RAID implementation, you can 
reconstruct the missing data by only reading a handful of logical blocks 
from the redundant disks.  It is not necessary to read the entire disk 
surface just to get a few 512 byte sectors of data.  So a failure for 
different data somewhere else on a disk should not (in principle) prevent 
reconstruction of the lost/missing data.  In a poorly-designed RAID 
implementation, you have to read the ENTIRE disk surface to get data from 
a few sectors.  In this case, another uncorrectable disk sector can be 

Most good hardware RAID cards have an option for continous disk scanning. 
For example ARECA called this 'consistency checking'.  It should be done 
on a regular basis.

You can use smartmontools to do this also, by cayring out regular read 
scans of the disk surface and then forcing a RAID consistency 
check/rebuild if there is a read failure at some disk block.

Note that continous scanning is also needed for ECC memory to prevent 
correctable single-bit errors from becomming uncorrectable double-bit 
errors.  In this RAM/memory context it is called 'memory scrubbing'


On Thu, 23 Aug 2007, Jeffrey B. Layton wrote:

> This isn't really directed at Jeff, but it seemed like a good segway
> for a comment. Everyone - please read some recent article by
> Garth Gibson about large capacity disks and large number of
> disks in a RAID group. Just to cut to the chase, given the
> Unrecoverable Read Error (URE) rate and large disks, during
> a rebuild you are almost guaranteed to hit a URE. When that
> happens, the rebuild stops and you have to restore everything
> from a backup. RAID-6 can help, but given enough disks and
> large enough disks, the same thing can happen (plus RAID-6
> rebuilds take longer since there are more computations involved).
> Jeff
> P.S. I guess I should disclose that my day job is at Panasas. But
> regardless, I would recommend reading some of Garth's comments.
> Maybe I can also get one of his presentations to pass around.
> P.P.S. If you don't know Garth, he's one of the fathers of RAID.
>> Hello Jakob,
>> A couple of things...
>> 1. ClusterFS has an easy to understand calculation on why raid 6 is
>> necessary for the amount of disks you're considering. You do need to
>> plan for multi-disk failure, especially with the rebuild time of 1TB
>> disks.
>> http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-10-1.html#wp1037512
>> 2. Avoid tape if you can. At this scale, the administrative time and
>> costs far outweigh the benefits. Of course if you need to move your
>> data to a secure vault that's another thing. If you really want to do
>> tape, some people choose to do disk > disk > tape. This eliminates the
>> read interrupts on the primary storage and provides some added
>> redundancy.
>> 3. We do use Nexsan's satabeasts for storage similar to this. Without
>> commenting on costs, the jackrabbit is technologically superior.
>> Thanks,
>>                 jeff
>> On 8/23/07, Jakob Oestergaard <jakob at unthought.net> wrote:
>>> On Thu, Aug 23, 2007 at 07:56:15AM -0400, Joe Landman wrote:
>>>> Greetings Jakob:
>>> Hi Joe,
>>> Thanks for answering!
>>> ...
>>>> up front disclaimer: we design/build/market/support such things.
>>> That does not disqualify you  :)
>>>>> I'm looking at getting some big storage. Of all the parameters, getting 
>>>>> as low
>>>>> dollars/(month*GB) is by far the most important. The price of acquiring 
>>>>> and
>>>>> maintaining the storage solution is the number one concern.
>>>> Should I presume density, reliability, and performance also factor in
>>>> somewhere as 2,3,4 (somehow) on the concern list?
>>> I expect that the major components of the total cost of running this beast 
>>> will
>>> be something like
>>>    acquisition
>>>  + power
>>>  + cooling
>>>  + payroll (disk-replacing admins :)
>>> Real-estate is a concern as well, of course. The rent isn't free. It would 
>>> be
>>> nice to pack this in as few racks as possible.  Reliability, well... I 
>>> expect
>>> frequent drive failures, and I would expect that we'd run some form of 
>>> RAID to
>>> mitigate this. If the rest of the hardware is just reasonably well 
>>> designed,
>>> the most frequently failing components should be redundant and hot-swap
>>> replacable (fans and PSUs).
>>> It's acceptable that a head-node fails for a short period of time. The 
>>> entire
>>> system will not depend on all head nodes functioning simultaneously.
>>>>> The setup will probably have a number of "head nodes" which receive a 
>>>>> large
>>>>> amount of data over standard gigabit from a large amount of remote 
>>>>> sources.
>>>>> Data is read infrequently from the head nodes by remote systems. The 
>>>>> primary
>>>>> load on the system will be data writes.
>>>> Ok, so you are write dominated.  Could you describe (guesses are fine)
>>>> what the writes will look like?  Large sequential data, small random
>>>> data (seek, write, close)?
>>> I would expect something like 100-1000 simultaneous streaming writes to 
>>> just as
>>> many files (one file per writer). The files will be everything from a few
>>> hundred MiB to many GiB.
>>> I guess that on most filesystems these streaming sequential writes will 
>>> result
>>> in something close to "random writes" to the block layer. However, we can 
>>> be
>>> very generous with write buffering.
>>>>> The head nodes need not see the same unified storage; so I am not 
>>>>> required to
>>>>> have one big shared filesystem. If beneficial, each of the head nodes 
>>>>> could
>>>>> have their own local storage.
>>>> There are some interesting designs with a variety of systems, including
>>>> GFS/Lustre/... on those head nodes, and a big pool of drives behind
>>>> them.  These designs will add to the overall cost, and increase 
>>>> complexity.
>>> Simple is nice :)
>>>>> The storage pool will start out at around 100TiB and will grow to ~1PiB 
>>>>> within
>>>>> a year or two (too early to tell). It would be nice to use as few racks 
>>>>> as
>>>>> possible, and as little power as possible  :)
>>>> Ok, so density and power are important.  This is good.  Coupled with the
>>>>  low management cost and low acquisition cost, we have about 3/4 of what
>>>> we need.  Just need a little more description of the writes.
>>> I hope the above helped.
>>>> Also, do you intend to back this up?
>>> That is a *very* good question.
>>>> How important is resiliency of the
>>>> system?  Can you tolerate a failed unit (assume the units have hot
>>>> spares, RAID6, etc).
>>> Yes. Single head nodes may fail. They must be fairly quick to get back on 
>>> line
>>> (having a replacement box I would expect no more than an hour of 
>>> downtime).
>>>> When you look at storage of this size, you have to
>>>> start planning for the eventual (and likely) failure of a chassis (or
>>>> some number of them), and think about with a RAIN configuration.
>>> Yep. I don't know how likely a "many-disk" failure would be... If I have a 
>>> full
>>> replacement chassis, I would guess that I could simply pull out all the 
>>> disks
>>> from a failed system, move them to the replacement chassis and be up and
>>> running again in "short" time.
>>> If a PSU decides to fry everything connected to it including the disks, 
>>> then
>>> yes, I can see the point in RAIN or a full backup.
>>> It's a business decision if a full node loss would be acceptable. I 
>>> honestly
>>> don't know that, but it is definitely interesting to consider both "yes" 
>>> and
>>> "no".
>>>> Either
>>>> that, or invest into massive low level redundancy (which should be scope
>>>> limited to the box it is on anyway).
>>> Yes; I had something like RAID-5 or so in mind on the nodes.
>>>>> It *might* be possible to offload older files to tape; does anyone have
>>>>> experience with HSM on Linux?  Does it work?  Could it be worthwhile to
>>>>> investigate?
>>>> Hmmm...  First I would suggest avoiding tape, you should likely be
>>>> looking at disk to disk for backup, and use slower nearline mechanisms.
>>> Why would you avoid tape?
>>> Let's say there was software which allowed me to offload data to tape in a
>>> reasonable manner. Considering the running costs of disk versus tape, tape
>>> would win hands down on power, cooling and replacements.
>>> Sure, the random seek time of a tape library sucks golf balls through a 
>>> garden
>>> hose, but assuming that one could live with that, are there more important
>>> reasons to avoid tape?
>>>>> One setup I was looking at, is simply using SunFire X4500 systems (you 
>>>>> can put
>>>>> 48 standard 3.5" SATA drives in each 4U system). Assuming I can buy them 
>>>>> with
>>>>> 1T SATA drives shortly, I could start out with 3 systems (12U) and grow 
>>>>> the
>>>>> entire setup to 1P with 22 systems in little over two full racks.
>>>>> Any better ideas?  Is there a way to get this more dense without paying 
>>>>> an arm
>>>>> and a leg?  Has anyone tried something like this with HSM?
>>>> Yes, but I don't want to turn this into a commercial, so I will be
>>>> succinct.  Scalable Informatics (my company) has a similar product,
>>>> which does have a good price and price per gigabyte, while providing
>>>> excellent performance.  Details (white paper, benchmarks, presentations)
>>>> at the http://jackrabbit.scalableinformatics.com web site.
>>> Yep, I was just looking at that actually.
>>> The hardware looks similar in concept to the SunFire, but as I see it you 
>>> guys
>>> have thought about a number of services atop of that (RAIN etc.)
>>> Very interesting!
>>> --
>>>  / jakob
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list