[Beowulf] Alternative to MPI ABI

Donald Becker becker at scyld.com
Tue Mar 22 09:10:11 PST 2005


On Tue, 22 Mar 2005, Jeff Squyres wrote:

> Instead, I would like to propose an alternative to an MPI ABI.

This is related to one of my pet "thought projects" for several years.
We have largely completed developing an alternate cluster programming
model that is better suited to most applications.

The current MPI design is the wrong model for clusters.
   It represents a static view of the cooperating machines
     The MPI "cluster size", MPI_Comm_size(MPI_COMM_WORLD, ...), is
     static for all time.  There is no way to take advantage of new
     machines, or reduce the number of machines that the application
     depends on.
   MPI has a model of initialize-compute-terminate.
     There is no explicit support for checkpointing, executing as a
     service, or running "forever".
   MPI has no concept of failures
     There is no mechanism to report that a rank has failed, and no way
     to recover from that failure.  MPI applications may issue undone
     work to other ranks and complete the actual task, but they cannot
     terminate cleanly with missing nodes.
   MPI's strength is collective mathematically-oriented operations, not
     communication.  I understand that even the name "Message Passing.."
     indicates that stream communication isn't the focus, but many
     applications expect and work well with a sockets-based model.
   Cross-architecture jobs are theoretically supported, but very
     difficult to implement.  The capability adds complexity without
     benefit.
   Communicators besides MPI_COMM_WORLD are rarely used.  The capability
     adds complexity with little benefit.


Some of the implications of using a dynamic model are
  - We need information and scheduling interfaces
  - We need true dynamic sizing, with process creation and termination
    primitives
  - We need status signaling external to application messages.

  There needs to be new information interfaces which
   - report usable nodes (which node are up, ready and will permit us
         to start processes)
   - report the capability of those nodes (speed, total memory)
   - report the availability of those nodes  (current load, available 
memory)
   Each of these information types is different and may be provided
   by a different library and subsystem.  We created 'beostat', a status
   and statistics library, to provide most of this information.
  There needs to be an explicit scheduler or mapper interface.
    We use 'beomap', which can utilize an external scheduler or
    internally create a list of usable compute nodes.

  An application should be able to run single-threaded if it decides
     that multiple processes are not useful.
  An application should be able to use only a subset of provided
    processors if they will not be useful (e.g. an application that uses
    a regular grid might choose to use only 16 of 23 provided nodes.
    The unused nodes should be truly unused: if they crash or are
    otherwise unexpectedly removed they shouldn't affect correct,
    error-free completion.  Ideally processes should never be started on
    those nodes.

  There needs to be new process creation primitives.
    We already have a well-tested model for this: Unix process
    management.  The only additional element is the ability to specify
    remote nodes when creating processes.  Monitoring and handling
    terminating processes does not need extensions.  We use BProc, but
    the concept of remote_fork() existed long before BProc.  A standard
    set of calls should not have the same names, but should use exactly
    the Unix semantics so that the library only needs to "wrap" the
    actual system calls.

  There needs to be asynchronous signaling methods
    We already have this as well: Unix signals.  Their (still modest)
    complexity represents many years of experience and demonstrates that
    getting the semantics for handling asynchronous events is difficult.


> 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on 
> pay-per-view (read: no need for time-consuming, potentially fruitless 
> attempts to get MPI implementors to agree on anything)

Hmmmm, does this show up on the sports page?

Donald Becker
Scyld Software
Annapolis MD 21403			410-990-9993





More information about the Beowulf mailing list