[Beowulf] Checkpointing using flash

Justin YUAN SHI shi at temple.edu
Mon Oct 1 15:59:24 PDT 2012

On Mon, Oct 1, 2012 at 2:22 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
>> My idea is to use data parallel API. This is nothing new. In theory,
> right, it's not new.  so why would it succeed this time around?

This is because the transformation of the application architecture
from static to statistic multiplexed for both computing and
communication failures.

>> can still be elegant looking. For example, you can have multiple
>> Infiniband interfaces (some machines already have) to help counter the
>> speed disparity between computing and communication.
> you lost me there.  MPI has no problem using multiple interfaces...

That only helps with communication failure and bandwidth. We need to
hedge for computing failures and power as well.


More information about the Beowulf mailing list