[Beowulf] Redundant Array of Independent Memory - fork(Re: Checkpointing using flash)

Andrew Holway andrew.holway at gmail.com
Tue Sep 25 03:19:47 PDT 2012


2012/9/24 Justin YUAN SHI <shi at temple.edu>:
> I think the Redundant Memory paper was really mis-configured. It uses
> a storage solution, trying to solve a volatle memory problem but
> insisting on eliminating volatility. It looks very much messed up.

http://thebrainhouse.ch/gse/silvio/74.GSE/Silvio's%20Corner%20Doc%20Jukebox/System%20z%20Redundant%20Array%20of%20Independent%20Memory.pdf

Maybe this paper is better. It explains the implementation of RAIM
into the newish IBM systemZ.

> My early comment on the OSI model still stands, even though MPI
> implementation is far down the stack that may not fit the OSI model
> well. The MPI implementation, even at the transport layer does NOT
> re-transmit messages.

I dont think you can even begin to apply tech like Infiniband or
Fibrechannel to the OSI model. TCP does not really fit on the OSI
model either. It was part of a standards framework developed by some
weird ISO sub group back in the mid 80s for an application stack that
was never used. People have then kinda munged together OSI and TCP and
other application stuff to make some horrific stupid mess that should
be consigned to a history book.

Ever heard of FTAM, X.400 or CMIP?

>When machine hangs running MPI protocol stack, the entire app hangs. this is the root cause for all our fault tolerance problems.

Im pretty sure faulty hardware is the root cause of out fault
tolerance problems :). In any case the main issue seems to be the loss
of a chunk of your application memory when the node fail not so much
the retransmission of messages. MPI has some kind of functionality
inside to address fault tolerance anyway.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.7837&rep=rep1&type=pdf



More information about the Beowulf mailing list