[Beowulf] Checkpointing using flash

Ellis H. Wilson III ellis at cse.psu.edu
Fri Sep 21 10:09:41 PDT 2012


On 09/21/12 12:58, Lux, Jim (337C) wrote:
> Yes.. If that's the frequency of checkpoints.  I was thinking more like 1
> checkpoint per second or 10 seconds.

While I suppose they might exist that frequent somehow in the wild, I've 
never heard of checkpoints at that low of time interval.  These huge 
cluster checkpoints are near to the entire memories, so even today we're 
talking near to 64 or 128 GB of RAM per node.  In ten years we're 
talking what, near to if not above a TB of RAM per node?  Moreover, they 
all tend to write their checkpoint at the same time and the SSDs aren't 
on the compute nodes -- they're on some intermediate I/O storage nodes 
(akin to BlueGene's intermediate layer).  So were talking about huge 
cluster-wide dumps of data to the flash intermediate layer, which then 
takes some hours to dump that data down to the more persistent HDDs. 
This takes at the very least many minutes, and in the normal case hours. 
  I would not be surprised if the best they could do at exascale was one 
checkpoint a day.  Again, I don't think these are used as the front-line 
of defense against failures.  That would really suck :D.

Best,

ellis



More information about the Beowulf mailing list