[Beowulf] MPI, fault handling, etc.

Justin Y. Shi shi at temple.edu
Thu Mar 10 13:13:52 PST 2016


Not that fast though. 100% reliability is practically achievable. And we
are enjoying the results everyday. I mean the wireless and wired packet
switching networks.

The problem is our tendency of drawing fast conclusions. The one human made
architecture that defies the "curse" of component failure is the statistic
multiplexing principle (or packet switching).

It has proven to work perfectly using growing number of not-so-reliable
devises without suffering the scalability dilemma.  We should learn how to
apply that technology to extreme scale computing. To this day, the full
extends of protocol logics still cannot be adequately described formally on
paper. But they work well, if done right.


Justin

On Thu, Mar 10, 2016 at 3:44 PM, Douglas Eadline <deadline at eadline.org>
wrote:

>
> > I will supper C's "hater" listing effort just to keep a spot light on the
> > important subject.
> >
> > The question is not MPI is efficient or not. Fundamentally, all
> > electronics
> > will fail in unexpected ways. Bare metal computing was important decades
> > ago but detrimental to large scale computing. It is simply flawed for
> > extreme scale computing.
> >
> > The Alan Fekete, Nancy Lynch, John Spinneli's impossible proof is the
> > fundamental "line in the sand" that cannot be crossed.
> >
> > The corollary of that proof is that it is impossible to detect failure
> > reliably either. Therefore, those efforts for for runtime
> > detection/repair/reschedule are also flawed for extreme scale computing.
> >
>
> Well on that note, I suppose we should just call it day.
> Although, some thought Godel would put the whole math thing
> out of business as well.
>
> --
> Doug
>
>
>
>
> > Justin
> >
> > On Thu, Mar 10, 2016 at 8:44 AM, Lux, Jim (337C)
> > <james.p.lux at jpl.nasa.gov>
> > wrote:
> >
> >> This is interesting stuff.
> >> Think back a few years when we were talking about checkpoint/restart
> >> issues: as the scale of your problem gets bigger, the time to checkpoint
> >> becomes bigger than the time actually doing useful work.
> >> And, of course, the reason we do checkpoint/restart is because it’s
> >> bare-metal and easy.  Just like simple message passing is “close to
> >> the
> >> metal” and “straightforward”.
> >>
> >> Similarly, there’s “fine grained” error detection and correction:
> >> ECC
> >> codes in memory; redundant comm links or retries.  Each of them imposes
> >> some speed/performance penalty (it takes some non-zero time to compute
> >> the
> >> syndrome bits in a ECC, and some non-zero time to fix the errored
> >> bits… in
> >> a lot of systems these days, that might be buried in a pipeline, but the
> >> delay is there, and affects performance)
> >>
> >> I think of ECC as a sort of diffuse fault management: it’s pervasive,
> >> uniform, and the performance penalty is applied evenly through the
> >> system.
> >> Redundant (in the TMR sense) links are the same way.
> >>
> >> Retries are a bit different.  The “detecting” a fault is diffuse and
> >> pervasive (e.g. CRC checks occur on each message), but the correction of
> >> the fault is discrete and consumes resources at that time.  In a system
> >> with tight time coupling (a  pipelined systolic array would be the sort
> >> of
> >> worst case), many nodes have to wait to fix the one that failed.
> >>
> >> A lot depends on the application: tighter time coupling is worse than
> >> embarrassingly parallel (which is what a lot of the “big data” stuff
> >> is:
> >> fundamentally EP, scatter the requests, run in parallel, gather the
> >> results).
> >>
> >> The challenge is doing stuff in between:  You may have a flock with
> >> excess
> >> capacity (just as ECC memory might have 1.5N physical storage bits to be
> >> used to store N bits), but how do you automatically distribute the
> >> resources to be failure tolerant.   The original post in the thread
> >> points
> >> out that MPI is not a particularly facile tool for doing this.  But
> >> I’m not
> >> sure that there is a tool, and I’m not sure that MPI is the root of
> >> the
> >> lack of tools.    I think it’s that moving from close to the metal is
> >> a
> >> “hard problem” to do in a generic way.  (The issues about 32 bit
> >> counts are
> >> valid, though)
> >>
> >>
> >> James Lux, P.E.
> >>
> >> Task Manager, DHFR Space Testbed
> >>
> >> Jet Propulsion Laboratory
> >>
> >> 4800 Oak Grove Drive, MS 161-213
> >>
> >> Pasadena CA 91109
> >>
> >> +1(818)354-2075
> >>
> >> +1(818)395-2714 (cell)
> >>
> >>
> >>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> >> To change your subscription (digest mode or unsubscribe) visit
> >> http://www.beowulf.org/mailman/listinfo/beowulf
> >>
> >>
> >
> > --
> > Mailscanner: Clean
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
>
>
> --
> Doug
>
> --
> Mailscanner: Clean
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20160310/c408b8e8/attachment.html>


More information about the Beowulf mailing list