[Beowulf] Checkpointing using flash

Ellis H. Wilson III ellis at cse.psu.edu
Fri Sep 21 09:44:39 PDT 2012


On 09/21/12 12:29, Lux, Jim (337C) wrote:
> Flash is slow, though...  SLC NAND flash (pretty fast, 8 Gbit part) is 250
> microseconds to write a 4kbyte (approx) page.  Erasing is about 700
> microseconds  (reading is 25 microseconds)
>
> MLC flash (say 512Gbit parts with 8 kBbyte pages) takes 1.3milliseconds to
> write a page, 3.8 ms to erase (75us to read)... And has a life of 3000
> write/erase cycles.

Modern MLC has at least a 10k cycle guarantee per-page, and research I'm 
doing at PSU has shown to me at least that this is a very low bar. 
Often it's way higher than that.

> That's 53 Mpbs streaming to the part.  Yeah, any practical design is going
> to have multiple interleaved devices, etc. so you can probably do it
> faster..
>
> But still, say you are checkpointing 8Gbyte.. That's 1300 seconds (yep,
> about 20 minutes), assuming you've previously erased everything.

As you mention there are multiple interleaved devices.  Specifically, 
modern flash devices (SSDs, which is what they plan to do this 
checkpointing with) have many layers of parallelism within them -- 
channels, packages and dies to be exact.  Something like 4-8 channels, 
each having multiple packages on each channel (8-16 I think in modern 
devices) and each package having multiple dies inside (2-4 is common). 
And inside of each die you finally are looking at an individual flash 
page/block/cell/etc.

So you can't calculate their speeds like HDDs -- it doesn't work like that.

Basic COTS SSDs can provide upwards of 200MB/s sustained writes until 
erases have to be done or you've filled greater than ~80% of the drive. 
  So it's more like 40-60 seconds for 8GB, certainly not 1300 seconds. 
Use PCI-E flash devices and you're looking at much closer to 500MB/s to 
1GB/s, depending on what you are willing to spend.

I mean, think about it -- modern HDDs can easily hit 100MB/s streaming 
sequential writes.  At 1300s to do 8GB you're suggesting flash is much 
slower (around 6MB/s) than that, which is definitely not the case. 
Maybe for USB thumb drives or some ridiculous single-deviced medium, but 
not real SSDs (especially PCI-E flash devices).

> Fast compared to disk, maybe, but very slow.  Why not just mirror memory
> (other than cost and power:  RAM is much less dense than flash)

The cost and power concerns you mention with RAM mirroring are 
absolutely huge.  Flash is a steal compared to RAM on both counts.

> There's also the write cycle limit.. If you're looking for very high
> densities (USB thumb drive) you're looking at
> A) serial interfaces
> B) MLC NAND with maybe 10k cycle life on each page

Let's say you do one checkpoint that saturates your flash every 4 hours 
and let the flash trickle that out to the underlying HDDs over the next 
4 hours before your next checkpoint.  Even with MLC (10k guarantee) 
that's around 5 years before you hit the guarantee, and I bet you'll be 
able to go a while after that.  Given that major supers don't last 5 
years, this is a non-issue.

Best,

ellis



More information about the Beowulf mailing list