[Beowulf] SSD caching for parallel filesystems

Vincent Diepeveen diep at xs4all.nl
Sat Feb 9 13:22:18 PST 2013


On Feb 9, 2013, at 4:16 PM, Ellis H. Wilson III wrote:

> On 02/09/13 13:16, Mark Hahn wrote:
>>> They buy a controller design from one place (some
>>> make this component), SSD packages from someplace else, some channel
>>> controllers, etc, etc, and strap it all together.  Which is totally
>>
>> well, I only pay attention to the SATA SSD market, but the media
>> controller is in the same chip as the flash controler, wear logic,  
>> etc.
>
> Of course -- there is no reason to have to translate to PCIe.  You are
> already getting things encoded for SATA from the NAND flash, so
> everything is easy.  It's only when you use a different protocol  
> outside
> that you need to translate, and this is the case I'm referring to.
> Pardon my snipping the previous aspects of this conversation -- for  
> SATA
> SSDs this entire conversation is moot.  The NAND packages there are as
> good as they could be relative to the overhead I'm discussing.
>
>> so yes, there is some shopping around of flash components, but having
>> industry-wide flash interface standards is hardly a bad thing.
>
> I like standards too, and this is why these guys made a standard  
> for the
> PCIe protocol:
> http://www.nvmexpress.org/
>
> I'm going to make a bold statement here, and you can take it or leave
> it, but I truly believe SATA is on it's way out for NVM devices.

SATA is very bad protocol for SSD's.

SSD's allows perfectly parallel stores and writes, SATA doesn't.
So SATA really limits the SSD's true performance.

> [I
> believe] We have to stop thinking about these things as "disks" and
> start thinking about them as slow, huge memory.  Just another step in
> that hierarchy (reg, L1, L2, DRAM, NVM).  Obviously we have some  
> hurdles
> to overcome before it's totally usable as plain-ol' OS-managed memory,
> but we're getting there.  I think PCIe is a step in the right  
> direction,
> and does a service to these devices in that it allows them to really
> perform (NVM memories are out-pacing SATA by a lot, at least right  
> now).
>   But ultimately, these things need to either be on-board or in DIMMs.
> Just my opinion of course.
>
>>> fine, but the problem arises because the volume for NAND flash  
>>> packages
>>> are for SATA based drives.  This results in most of the NAND  
>>> packages
>>> within to export a SATA protocol.
>>
>> that confuses me.  flash chips have a generic interface which I can't
>> really see as being at all specific to a particular blockdev  
>> interface.
>
> I'm obviously doing a horrible job explaining this -- my apologies!
> Please see page 4 of this whitepaper and the diagram at the top of  
> page
> 5 for what I think must be clear enough for me to convey this  
> overhead:
>
> http://www.marvell.com/storage/system-solutions/native-pcie-ssd- 
> controller/assets/Marvell-Native-PCIe-SSD-Controllers-WP.pdf
>
> Getting more technical than I was willing to earlier to summarize  
> this:
> there's a conversion going on between protocol encoding between the  
> SATA
> controller and the HBA controller in bridge-oriented PCIe-based SSDs
> that can incur as high as 25% overheads in the bandwidth.  Is this a
> game-ender for the end-user?  No.  Is this something to keep in mind?
> Sure.  That's all I was angling at.  It was just a suggestion for
> Prentice to research up on prior to committing to a bunch of PCIe
> drives.  I don't work for Micron, Samsung, Intel, or any other company
> in this space.  No tomfoolery going on here :D.
>
>> no, it doesn't.  Micron has simply invented their own
>> flash-to-disk-interface.  if you're saying "skipping SATA is  
>> important",
>> well, maybe.  it looks from Micron's whitepapers that they are  
>> focused
>> almost entirely on small random reads (not unreasonable).  but  
>> that's a
>> workload that doesn't stress the oddity of flash (managing pre-erased
>> blocks and wear levelling).  maybe I'm being picking at semantics,  
>> but
>
> Small random reads are actually the toughest things to deal with for
> properly burnt-in SSDs.  Sure, the underlying NAND prefers reads to
> writes, but ultimately, load-leveling across channels, packages, dies,
> and planes is much tougher for reads that are destined for specific  
> data
> that already "lives" somewhere.  This is particularly true for the  
> tiny
> queues available for SATA drives, where request reordering is  
> extremely
> limited.  OTOH, small writes can easily be spread out over all of the
> internal storage in parallel because the device can "choose" where  
> they
> go.  Anyhow, I think we are running into overly detailed, semantics
> issues here though.  The take-away is that, because of very different
> encoding between PCIe and SATA, it's inane to have SATA-controllered
> NAND in a PCIe-based flash device and transcode all the time.  Just  
> fix
> the damn bug.  That's my story and I'm sticking to it ;).
>
>>> Managing the raw flash at the filesystem or driver level is an  
>>> option,
>>> but not what I was talking about (nor what is currently in vogue  
>>> right
>>> now, to my knowledge).
>>
>> well, it's interesting that some people are talking about adding  
>> flash
>> to dram dimms - very unclear to me how that would work.  but maybe  
>> they're
>> merely using the dimm form-factor and proposing a new interface.
>
> Who are these folks?  I'd love to read what they are saying -- this is
> where my next research is geared, so this is exciting to hear!
>
>> I guess I have more respect for SATA than you do.  the Micron  
>> thing is
>
> Oh, don't get me wrong, I love SATA...for disks.  I just am  
> increasingly
> seeing these devices less as disks than giant, slow memory.  I just
> worry if we force SATA to keep up with these things as they rapidly
> improve we may hurt things for magnetic storage, which still has a  
> very
> real and valuable place in the storage hierarchy.  This is a new tier,
> and we've treated it so far like a tier in storage.  Maybe now that  
> it's
> so dang fast, it's time to rethink that.
>
>> still just a disk interface - block shuffling.  it arguably  
>> removes one
>> level of protocol reformatting, but I'm not sure how much difference
>> that would make to the consumer.  a raid0 across SATA channels does a
>> pretty good job of piling up IOPs...
>
> This encoding issue takes a bigger hit on bandwidth than IOPs, fwiw.
> You're probably correct about SATA working (in the short-term at  
> least)
> for IOPs.
>
>> OTOH, a PCIe card that really did map flash blocks directly into the
>> memory space would be quite interesting.  it just sounds tricky to  
>> get
>
> I agree completely.  Obviously from my previous ramblings, I'm  
> hoping we
> see more in this direction in the near future.  Sure there are  
> hurdles,
> but I do not believe they are any harder than those we've surmounted
> with making it a disk-like device.
>
> Best,
>
> ellis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list