[Beowulf] Checkpointing using flash

Robert G. Brown rgb at phy.duke.edu
Fri Sep 21 11:14:09 PDT 2012


On Fri, 21 Sep 2012, Lux, Jim (337C) wrote:

>
>
> On 9/21/12 9:21 AM, "Hearns, John" <john.hearns at mclaren.com> wrote:
>
>>
>> Or, if you can factor your computation to make use of extra processing
>> nodes, you can just keep on moving.  Think of this as a higher level
>> scheme than, say, Hamming codes for memory protection:  use 11 bits to
>> store 8, and you're still synchronous.
>>
>> Jim, you are smarter than me!
>> IW as going to air the idea of pairs of nodes in lock-step, with either
>> node being able to STONITH the other if
>> either there is a machine check event, or the other node does not keep up
>> with reporting results.
>> Then signal to the cluster management that "There's been a failure here -
>> but lets keep trucking to the end of the run,
>> When you can come along and replace my buddy and me"
>>
>> The obvious drawback being you get half an exaflop for your money!
>>
>
> I was assuming that you'd figure out a Hamming-esque way to get 8/11ths of
> an exaflop for an exaflops worth of horsepower.

Hm, yeah, probably not happening... as the intermediate step of
computing the encoding is likely to be a more difficult problem by far
than what the cluster is actually working on...;-)

    rgb

>
> It might actually be an ok trade without the future "Hearns Code",
> though.. Can you get computers with double the failure rate for less than
> half the cost (all in, capex and opex)?  Given that we are inevitably
> moving this way, maybe "design for perfect" isn't an appropriate paradigm.
> In the space biz, this is a HUGE issue.. For all we spend trying to make
> perfect, we don't, so is it time to bite the bullet and "design for
> failure"... I think it is, but, there are those with beards grayer than
> mine (and mine has a fair amount of gray in it) who don¹t.
>
>
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu



More information about the Beowulf mailing list