<div class="gmail_extra">And this is the bit that concerns me the most.  At scale you should only be making two assumptions: (1) everything breaks all the time (2) you will have network partitions.  Checkpoint/restart is a lazy option that has no place in modern software. Yet there doesn't seem to be a priority to go beyond checkpoint restart and rethinking software architecture. I would argue that's as much or more important than figuring out manycore.</div>


<div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_quote">On Fri, Nov 23, 2012 at 6:44 AM, Lux, Jim (337C) <span dir="ltr"><<a href="mailto:james.p.lux@jpl.nasa.gov" target="_blank">james.p.lux@jpl.nasa.gov</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":3wu">a lot of HPC software design<br>

assumes perfect hardware, or, that the hardware failure rate is<br>

sufficiently low that a checkpoint/restart (or "do it all over from the<br>

beginning") is an acceptable strategy.</div></blockquote></div><br></div>