[Beowulf] Checkpointing using flash

Fri Sep 21 04:29:54 PDT 2012

On 09/21/12 10:49, Hearns, John wrote:
> http://www.theregister.co.uk/2012/09/21/emc_abba/
>
> Frequent checkpointing will of course be vital for exascale, given the
> MTBF of individual nodes.
>
> However how accurate is this statement:
>
> HPC jobs involving half a million compute cores ... have a series of
> checkpoints set up in their code with the entire memory state stored at
> each checkpoint in a storage node.

Are your concerns about the accuracy of this statement related to the 
fact that elReg is claiming that they must dump "the entire memory" or 
some concern about flash being used as a temporary checkpointing medium?

If the former -- note that with many, many physics and climate codes the 
application data dominates memory.  So while it may not be technically 
true that the "entire memory" is dumped in the checkpoint (the OS 
certainly won't/shouldn't dump it's own memory), it is effectively true 
because 90% of the memory does end up getting dumped.

For what it's worth, flash (or some other reasonably dense medium faster 
than disk) being used in exascale machines is an absolute necessity for 
checkpointing according to my research and discussions.  I was lucky 
enough to sit in on a talk by Gary Grider of LANL last Fall (the guy 
that basically designs and signs off on the purchase of their largest 
clusters, from what I understand) and John Bent (also of LANL, now at 
EMC).  They explained the nasty costs involved if they went totally disk 
or totally flash.  A hybrid solution was effectively the only 
cost-effective way to do this for them, and I expect we'll see similar 
trends in other labs in the near future.  I don't even think he was 
talking full exascale either -- like 100 petaflop.

Disclaimer: Possible Bias -- My research is on flash development and 
caching for cluster computing at PSU.

Best,

ellis