[Beowulf] Supercomputers face growing resilience problems

Fri Nov 23 19:41:17 PST 2012

It's a "hard problem"… The natural tendency is to solve the easy problems first, and only when backed into the corner, do you take on the hard problems.  Or.. Someone comes out of the background with a really novel approach.  I'm sure folks thought about error correcting codes in an empirical way (e.g. Parity bits) but Hamming put it all together in a nice consistent theoretical framework.  Or Shannon, for that matter.

From: Deepak Singh <mndoci at gmail.com<mailto:mndoci at gmail.com>>
Date: Friday, November 23, 2012 11:45 AM
To: Jim Lux <james.p.lux at jpl.nasa.gov<mailto:james.p.lux at jpl.nasa.gov>>
Cc: Luc Vereecken <kineticluc at gmail.com<mailto:kineticluc at gmail.com>>, "beowulf at beowulf.org<mailto:beowulf at beowulf.org>" <beowulf at beowulf.org<mailto:beowulf at beowulf.org>>, "shi at temple.edu<mailto:shi at temple.edu>" <shi at temple.edu<mailto:shi at temple.edu>>
Subject: Re: [Beowulf] Supercomputers face growing resilience problems

And this is the bit that concerns me the most.  At scale you should only be making two assumptions: (1) everything breaks all the time (2) you will have network partitions.  Checkpoint/restart is a lazy option that has no place in modern software. Yet there doesn't seem to be a priority to go beyond checkpoint restart and rethinking software architecture. I would argue that's as much or more important than figuring out manycore.

On Fri, Nov 23, 2012 at 6:44 AM, Lux, Jim (337C) <james.p.lux at jpl.nasa.gov<mailto:james.p.lux at jpl.nasa.gov>> wrote:
a lot of HPC software design
assumes perfect hardware, or, that the hardware failure rate is
sufficiently low that a checkpoint/restart (or "do it all over from the
beginning") is an acceptable strategy.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20121124/891e3344/attachment.html>