[Beowulf] Software RAID?

Mon Nov 26 18:56:24 PST 2007

Ekechi Nwokah wrote:
> Reposting with (hopefully) more readable formatting.

[...]

>> Of course there are a zillion things you didn't mention.  How 
>> many drives did you want to use?  What kind? (SAS? SATA?)  If 
>> you want 16 drives often you get hardware RAID hardware even 
>> if you don't use it.
>> What config did you want? 
>> Raid-0? 1? 5? 6? Filesystem?
>>
> 
> So let's say it's 16. But in theory it could be as high as 192. Using
> multiple JBOD cards that present the drives individually (as separate
> LUNs, for lack of a better term), and use software RAID to do all the
> things that a 3ware/Areca, etc. card would do across the total span of
> drives:

Hmmm... Anyone with a large disk count SW raid want to run a few 
bonnie++ like loads on it and look at the interrupt/csw rates?  Last I 
looked on a RAID0 (2 disk) we were seeing very high interrupt/csw rates. 
  This would quickly swamp any perceived advantages of "infinitely many" 
or "infinitely fast" cores.  Sort of like an Amdahl's law.  Make the 
expensive parallel computing portion take zero time, and you are still 
stuck with the serial time (which you can't do much about).  Worse, it 
is size extensive, so as you increase the number of disks, you have to 
increase the interrupt rate (one controller per drive currently), and 
the base SATA drivers seem to have a problem with lots of CSW.

> 
> RAID 0/1/5/6, etc., hotswap, SAS/SATA capability, etc.
> 
>> Oh, and how do you measure performance?  Bandwidth?  Seeks?
>> Transactions?
>> Transaction size?  Mostly read? write?
>>
> 
> 
> All of the above. We would be max per-drive performance, say 70MB/s
> reads with 100 IOPs on SATA, 120MB/s reads with 300 IOPs on SAS using 4k
> transaction sizes. Hopefully eliminate any queueing bottlenecks on the
> hardware RAID card.

This (queuing bottleneck) hasn't really been an issue in most of the 
workloads we have seen.  Has anyone seen this as an issue on their 
workloads?

> Assume that we are using RDMA as the network transfer protocol so there
> are no network interrupts on the cpus being used to do the XORs, etc.

er .... so your plan is to use something like a network with RDMA to 
attach the disks.  So you are not using SATA controllers.  You are using 
network controllers.  With some sort of offload capability (RDMA without 
it is a little slow).

How does this save money/time again?  You are replacing "expensive" RAID 
controllers with "expensive" Network controller (unless you forgo 
offload, in which case RDMA doesn't make much sense)?

Which network were you planning on using for the disks?  Gigabit?  10 
GbE?  IB?

You sort-of have something like this today, in Coraid's AOE units.  If 
you don't have experience with them, you should ask about what happens 
to the user load under intensive IO operations.  Note:  there is nothing 
wrong with Coraid units, we like them (and in full disclosure, we do 
resell them, and happily connect them with our JackRabbit units).

> Right now, all the hardware cards start to precipitously drop in
> performance under concurrent access, particularly read/write mixes.

Hmmm.... Are there particular workloads you are looking at?  Huge reads 
with a tiny write?  Most of the RAID systems we have seen suffer from 
small block random I/O.  There your RAID system will get in the way (all 
the extra seeks and computations will slow you down relative to single 
disks).  There you want RAID10's.

We have put our units (as well as software RAIDs) through some pretty 
hard tests: single RAID card feeding 4 simultaneous IOzone and bonnie++ 
tests (each test 2x the RAM in the server box) through channel bonded 
quad gigabit.  Apart from uncovering some kernel OOPses due to the 
channel bond driver not liking really heavy loads, we sustained 360-390 
MB/s out of the box, with large numbers of concurrent reads and writes. 
  We simply did not see degradation.  Could you cite some materials I 
can go look at, or help me understand which workloads you are talking about?

> Areca is the best of the bunch, but it's not saying much compared to
> Tier 1 storage ASICs/FPGAs. 

You get what you pay for.

> The idea here is twofold. Eliminate the cost of the hardware RAID and

I think you are going to wind up paying more than that cost in other 
elements, such as networking, JBOD cards (good ones, not the crappy 
driver ones).

> handle concurrent access accesses better. My theory is that 8 cores
> would handle concurrent ARRAY access much better than the chipsets on
> the hardware cards, and that if you did the parity calculations, CRC,
> etc. using SSE instruction set you could acheive a high level of
> parallelism and performance. 

The parity calculations are fairly simple, and last I checked, at MD 
driver startup, it *DOES* check which method makes the parity check 
fastest in the md assemble stage.  In fact, you can see, in the Linux 
kernel source, SSE2, MMX, Altivec implementations of RAID6. 
Specifically, look at raid6sse2.c

/*
  * raid6sse2.c
  *
  * SSE-2 implementation of RAID-6 syndrome functions
  *
  */

You can see the standard calc, the unrolled by 2 calc, etc.

If this is limited by anything (just eyeballing it), it would be a) a 
lack of functional units, b) SSE2 issue rate, c) SSE2 operand width.

Lack of functional units can sort of be handled by more cores.  However, 
this code is assembly (in C) language.  Parallel assembly programming is 
not fun.

Moreover, OS jitter, context switching away from these calculations will 
be *expensive* as you have to restore not just the full normal register 
stack and frame, but all of the SSE2 registers.  You would want to be 
able to dedicate entire cores to this, and isolate interrupt handling to 
other cores.

> I just haven't seen something like that and I was not aware that md
> could acheive anything close to the performance of a hardware RAID card
> across a reasonable number of drives (12+), let alone provide the
> feature set. 

Due to SATA driver CSW/interrupt handling, I would be quite surprised if 
it were able to do this (achieve similar performance).  I would bet 
performance would top out below 8 drives.  My own experience suggests 4 
drives.  After that, you have to start spending money on those SATA 
controllers.  And you will still be plagued by interrupts/CSW.  Which 
will limit your performance.  Your costs will start approaching the 
"expensive" RAID cards.

What we have found is, generally, performance on SATA is very much a 
function of the quality of the driver, the implementation details of the 
controller, how it handles heavy IO (does it swamp the motherboard with 
interrupts?).  I have a SuperMicro 8 core deskside unit with a small 
RAID0 on 3 drives.  When I try to push the RAID0 hard, I swamp the 
motherboard with huge numbers of interrupts/CSW.  Note that this is not 
even doing RAID calculations, simply IO.

You are rate limited by how fast the underlying system can handle IO. 
The real value of any offload processor is how it, not so oddly enough, 
offloads stuff (calculations, interrupts, IO, ...) from the main CPUs. 
Some of the RAID cards for these units do a pretty good job of 
offloading, some are crap (and even with SW raid issues, it is faster 
than the crappy ones).

> 
> -- Ekechi
> 

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615