[Beowulf] Checkpointing using flash

Fri Sep 21 09:13:27 PDT 2012

I would suggest that some scheme of redundant computation might be more
effective.. Rather than try to store a single node's state on the node,
and then, if any node hiccups, restore the state (perhaps to a spare), and
restart, means stopping the entire cluster while you recover.

Or, if you can factor your computation to make use of extra processing
nodes, you can just keep on moving.  Think of this as a higher level
scheme than, say, Hamming codes for memory protection:  use 11 bits to
store 8, and you're still synchronous.

Assuming your algorithm has the ability to self detect an error, you could
just use 2N nodes, and only take correct outputs from node I and/or node
I+1 to feed to Node M (and M+1).

This has been done for some specialized algorithms at a lower level (e.g.
FFT) where there are some tricks to know if there was an arithmetic error.
 Or, you could go the brute force Triple/Vote, but that has its share of
problems (the voter has to be very reliable)

Yes, it will require clever algorithm design (of a comparable cleverness
to the design of the original Hamming codes, but more complex),
particularly to find a way to do it generically that is not problem
specific.  But when that is figured out, then we'll really be able to make
progress, because transient (or permanent) failures won't slow down the
computation.

Checkpointing is a fairly crude approach to fault tolerance, after all.

On 9/21/12 8:15 AM, "Justin YUAN SHI" <shi at temple.edu> wrote:

>It looks fairly accurate.
>
>This is because reconcile distributed checkpoints is theoretically
>difficult. Therefore, frequent checkpointing is cost prohibitive for
>exacscale apps.
>
>Justin
>
>On Fri, Sep 21, 2012 at 10:49 AM, Hearns, John <john.hearns at mclaren.com>
>wrote:
>> http://www.theregister.co.uk/2012/09/21/emc_abba/
>>
>>
>>
>> Frequent checkpointing will of course be vital for exascale, given the
>>MTBF
>> of individual nodes.