[Beowulf] Checkpointing using flash
shi at temple.edu
Mon Sep 24 05:42:11 PDT 2012
I think you are not too far off. If the global "fluster like" mechanism can provide the theoretical upper bounded protection for its stored info, and can scale as we grow the machine size, it would look like a reasonable exascale machine.
On Sep 22, 2012, at 7:02 AM, Andrew Holway <andrew.holway at gmail.com> wrote:
>> To be exact, the OSI layers 1-4 can defend packet data losses and
>> corruptions against transient hardware and network failures. Layers
>> 5-7 provides no protection. MPI sits on top of layer 7. And it assumes
>> that every transmission must be successful (this is why we have to use
>> checkpoint in the first place) -- a reliability assumption that the
>> OSI model have never promised.
> I've been playing around with GFS and Gluster a bit recently and this
> has got me thinking... Given a fast enough, low enough latency network
> might it but possible to have a Gluster like or GFS like memory space?
> GFS like would involve an external box of memory with connections to,
> lets say for argument 5 + 1 nodes. In the event of failure the "hot
> spare" could take over processing of the failed node. Perhaps we can
> stop thinking about nodes and instead think about clusters of
> processors and memory within clusters?
> "Glusterlike" would work quite like Gluster but for memory. I can kind
> of see it in my head but am having problems describing it :)
> As far as I can understand (which is very limited) this whole Exascale
> thing is not going to happen with traditional Beowulf, Just a bunch of
> nodes blah blah. Machines are going to have to become far more tightly
> coupled with clever hardware tricks to protect against failed memory
> modules etc.
> I am almost sure this is why Intel recently bought chunks of Qlogic
> and Cray tech. Obviously your quite limited in what you can reasonably
> do with copper over distance but perhaps optical interconnects could
> provide some kind of answer...
>> In other words, any transient fault while processing the codes in
>> layers 5-7 (and MPI calls) can halt the entire app.
>> On Fri, Sep 21, 2012 at 12:29 PM, Ellis H. Wilson III <ellis at cse.psu.edu> wrote:
>>> On 09/21/12 12:13, Lux, Jim (337C) wrote:
>>>> I would suggest that some scheme of redundant computation might be more
>>>> effective.. Rather than try to store a single node's state on the node,
>>>> and then, if any node hiccups, restore the state (perhaps to a spare), and
>>>> restart, means stopping the entire cluster while you recover.
>>> I am not 100% about the nitty-gritty here, but I do believe there are
>>> schemes already in place to deal with single node failures. What I do
>>> know for sure is that checkpoints are used as a last line of defense
>>> against full cluster failure due to overheating, power failure, or
>>> excessive numbers of concurrent failures -- not for just one node going
>>> belly up.
>>> The LANL clusters I was learning about only checkpointed every 4-6 hours
>>> or so, if I remember correctly. With hundred-petascale clusters and
>>> beyond hitting failure rates on the frequency of not even hours but
>>> minutes, obviously checkpointing is not the go-to first attempt at
>>> failure recovery.
>>> If I find some of the nitty-gritty I'm currently forgetting about how
>>> smaller, isolated failures are handled now I'll report back.
>>> Nevertheless, great ideas Jim!
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf