[Beowulf] SSD caching for parallel filesystems
Ellis H. Wilson III
ellis at cse.psu.edu
Sat Feb 9 07:16:40 PST 2013
On 02/09/13 13:16, Mark Hahn wrote:
>> They buy a controller design from one place (some
>> make this component), SSD packages from someplace else, some channel
>> controllers, etc, etc, and strap it all together. Which is totally
> well, I only pay attention to the SATA SSD market, but the media
> controller is in the same chip as the flash controler, wear logic, etc.
Of course -- there is no reason to have to translate to PCIe. You are
already getting things encoded for SATA from the NAND flash, so
everything is easy. It's only when you use a different protocol outside
that you need to translate, and this is the case I'm referring to.
Pardon my snipping the previous aspects of this conversation -- for SATA
SSDs this entire conversation is moot. The NAND packages there are as
good as they could be relative to the overhead I'm discussing.
> so yes, there is some shopping around of flash components, but having
> industry-wide flash interface standards is hardly a bad thing.
I like standards too, and this is why these guys made a standard for the
I'm going to make a bold statement here, and you can take it or leave
it, but I truly believe SATA is on it's way out for NVM devices. [I
believe] We have to stop thinking about these things as "disks" and
start thinking about them as slow, huge memory. Just another step in
that hierarchy (reg, L1, L2, DRAM, NVM). Obviously we have some hurdles
to overcome before it's totally usable as plain-ol' OS-managed memory,
but we're getting there. I think PCIe is a step in the right direction,
and does a service to these devices in that it allows them to really
perform (NVM memories are out-pacing SATA by a lot, at least right now).
But ultimately, these things need to either be on-board or in DIMMs.
Just my opinion of course.
>> fine, but the problem arises because the volume for NAND flash packages
>> are for SATA based drives. This results in most of the NAND packages
>> within to export a SATA protocol.
> that confuses me. flash chips have a generic interface which I can't
> really see as being at all specific to a particular blockdev interface.
I'm obviously doing a horrible job explaining this -- my apologies!
Please see page 4 of this whitepaper and the diagram at the top of page
5 for what I think must be clear enough for me to convey this overhead:
Getting more technical than I was willing to earlier to summarize this:
there's a conversion going on between protocol encoding between the SATA
controller and the HBA controller in bridge-oriented PCIe-based SSDs
that can incur as high as 25% overheads in the bandwidth. Is this a
game-ender for the end-user? No. Is this something to keep in mind?
Sure. That's all I was angling at. It was just a suggestion for
Prentice to research up on prior to committing to a bunch of PCIe
drives. I don't work for Micron, Samsung, Intel, or any other company
in this space. No tomfoolery going on here :D.
> no, it doesn't. Micron has simply invented their own
> flash-to-disk-interface. if you're saying "skipping SATA is important",
> well, maybe. it looks from Micron's whitepapers that they are focused
> almost entirely on small random reads (not unreasonable). but that's a
> workload that doesn't stress the oddity of flash (managing pre-erased
> blocks and wear levelling). maybe I'm being picking at semantics, but
Small random reads are actually the toughest things to deal with for
properly burnt-in SSDs. Sure, the underlying NAND prefers reads to
writes, but ultimately, load-leveling across channels, packages, dies,
and planes is much tougher for reads that are destined for specific data
that already "lives" somewhere. This is particularly true for the tiny
queues available for SATA drives, where request reordering is extremely
limited. OTOH, small writes can easily be spread out over all of the
internal storage in parallel because the device can "choose" where they
go. Anyhow, I think we are running into overly detailed, semantics
issues here though. The take-away is that, because of very different
encoding between PCIe and SATA, it's inane to have SATA-controllered
NAND in a PCIe-based flash device and transcode all the time. Just fix
the damn bug. That's my story and I'm sticking to it ;).
>> Managing the raw flash at the filesystem or driver level is an option,
>> but not what I was talking about (nor what is currently in vogue right
>> now, to my knowledge).
> well, it's interesting that some people are talking about adding flash
> to dram dimms - very unclear to me how that would work. but maybe they're
> merely using the dimm form-factor and proposing a new interface.
Who are these folks? I'd love to read what they are saying -- this is
where my next research is geared, so this is exciting to hear!
> I guess I have more respect for SATA than you do. the Micron thing is
Oh, don't get me wrong, I love SATA...for disks. I just am increasingly
seeing these devices less as disks than giant, slow memory. I just
worry if we force SATA to keep up with these things as they rapidly
improve we may hurt things for magnetic storage, which still has a very
real and valuable place in the storage hierarchy. This is a new tier,
and we've treated it so far like a tier in storage. Maybe now that it's
so dang fast, it's time to rethink that.
> still just a disk interface - block shuffling. it arguably removes one
> level of protocol reformatting, but I'm not sure how much difference
> that would make to the consumer. a raid0 across SATA channels does a
> pretty good job of piling up IOPs...
This encoding issue takes a bigger hit on bandwidth than IOPs, fwiw.
You're probably correct about SATA working (in the short-term at least)
> OTOH, a PCIe card that really did map flash blocks directly into the
> memory space would be quite interesting. it just sounds tricky to get
I agree completely. Obviously from my previous ramblings, I'm hoping we
see more in this direction in the near future. Sure there are hurdles,
but I do not believe they are any harder than those we've surmounted
with making it a disk-like device.
More information about the Beowulf