[Beowulf] Alternative to MPI ABI

Tue Mar 22 11:51:12 PST 2005

Hmmm, looks like the list is about to have Doug's much desired
discussion of Community Goals online.  I'll definitely play.

The following is a standard rgb thingie, so humans with actual work to
do might not want to read it all right now... (Jeff L., I do have a
special message coming to the list "just for you";-)

On Tue, 22 Mar 2005, Donald Becker wrote:

> Some of the implications of using a dynamic model are
>   - We need information and scheduling interfaces
>   - We need true dynamic sizing, with process creation and termination
>     primitives
>   - We need status signaling external to application messages.
> 
>   There needs to be new information interfaces which
>    - report usable nodes (which node are up, ready and will permit us
>          to start processes)
>    - report the capability of those nodes (speed, total memory)
>    - report the availability of those nodes  (current load, available 
> memory)
>    Each of these information types is different and may be provided
>    by a different library and subsystem.  We created 'beostat', a status
>    and statistics library, to provide most of this information.
>   There needs to be an explicit scheduler or mapper interface.
>     We use 'beomap', which can utilize an external scheduler or
>     internally create a list of usable compute nodes.

Agreed.  I also have been working on daemons and associated library
tools to provide some of this information as well on the fully
GPL/freely distributable side of things for a long time.  I've just
started a new project (xmlbenchd) that should be able to provide really
detailed capabilities information about nodes via a daemon interface
from "plug in" benchmarks, both micro and macro (supplemented with data
snarfed from /proc using xmlsysd code fragments).

xmlsysd already provides CPU clock, total memory, L2 cache size, total
and available memory, and PID snapshots of running jobs.  Both daemons
use (or will use in the case of xmlbenchd) xml tags to wrap all output,
which should make creating applications that use the data pretty simple,
and there is a provided library that can manage connections to a
specified cluster.  

XMLified tags also FORCE one to organize and present the data
hierarchically and extensibly.  I've tried (unsuccessfully) to convince
Linus to lean on the maintainers of the /proc interface to achieve some
small degree of hierarchical organization and uniformity in
presentation, but it looks like it will remain the disorganized and
inconsistent mess that it is now for the indefinite future (speaking as
one who has now written quite a lot of code parsing it, and having had
to hand-write and customize that code pretty much per /proc-based file).
(And yes, I know there are performance issues, but frankly either screw
ASCII altogether and make /proc a binary interface with an accompanying
library for real efficiency -- parts of it might as well be binary now
for all the documentation or consistency -- or provide a good topdown
hierarchical and consistent organization of the data in ASCII. Or do
both but with different refresh rates (fast binary, slow ascii/xml), but
don't do ugly.  What's there now is ugly.)

There are several advantages to using daemons (compared to a kernel
module or kernel-level insertion and a library of extended systems
calls) for providing system information both locally and across a
network:

  a) The daemons access information read only on the host system and can
be safely run from userspace (an actual user or nobody/xinetd) with very
familiar and standardized tools to control access on the basis of
privacy rather than modification.

  b) They can be trivially packaged as e.g. rpm's per distribution and
don't have to be rebuilt every time you update/change the kernel, which
means that kernels can dynamically update to fix e.g. security problems,
bugs, performance issues.  Even fairly major kernel changes will
generally not break the tools (as long as /proc remains structurally
intact).

  c) It's a bit safer and easier to modify and develop objects that live
in userspace than it is things that will run as root or inside the
kernel.  A bug is always a problem in code, but bugs won't necessarily
immediately compromise root on a system if they live in an anonymous
daemon, nor will they cause a system crash leading to a reboot.

  d) One hopes to be able to extensibly separate the SOURCE of a daemon
(especially a GPL daemon) from the COMMUNICATIONS "language" the daemon
speaks, so that there can be multiple implementations and so that
applications that use the information can be written independent of the
implementation.  A "trivial" ascii-based communications/command set and
XML-encapsulated returns mean that you can talk to the daemon with
telnet, perl, c code, absolutely anything that can manage a connection,
and write applications without worrying about how the information is
being provided.  This is possible with a kernel-based tool as well, of
course, but again, the level of programming expertise required to do
safe kernel programming is a fairly solid barrier against multiple
implementations, leaving control over the interface itself in a single
set of hands (which can be both good and bad, depending on the hands:-).

The weakness in what I'm doing is that I'm a single human being and have
numerous responsibilities outside of the projects, so I haven't yet
written ALL the applications that could use this information.  Its a
sort of "if I build it, they will come" thing; even though this isn't
really horribly likely (sorry Jeff, don't wait for people to pound down
your door -- if you want a nifty piece of code written, the only way to
get there is to carry it yourself).

Still, I totally agree that this is EXACTLY the kind of information that
needs to be available via an open standard, universal, extensible,
interface.

>   An application should be able to run single-threaded if it decides
>      that multiple processes are not useful.

Sure.

>   An application should be able to use only a subset of provided
>     processors if they will not be useful (e.g. an application that uses
>     a regular grid might choose to use only 16 of 23 provided nodes.
>     The unused nodes should be truly unused: if they crash or are
>     otherwise unexpectedly removed they shouldn't affect correct,
>     error-free completion.  Ideally processes should never be started on
>     those nodes.

Absolutely.  And this needs to be done in such a way that the programmer
doesn't have to work too hard to arrange it.  I imagine that this CAN be
done with e.g. PVM or some MPIs (although I'm not sure about the latter)
but is it easy?

>   There needs to be new process creation primitives.
>     We already have a well-tested model for this: Unix process
>     management.  The only additional element is the ability to specify
>     remote nodes when creating processes.  Monitoring and handling
>     terminating processes does not need extensions.  We use BProc, but
>     the concept of remote_fork() existed long before BProc.  A standard
>     set of calls should not have the same names, but should use exactly
>     the Unix semantics so that the library only needs to "wrap" the
>     actual system calls.

Agreed, but while dealing with this one also needs to think about
security.  Grids and other distributed parallel computing paradigms are
increasingly popular as problems that map well into them proliferate,
and with open networks of potential compute resources comes a need to be
able to manage SECURE and AUTHENTICATED new process creation.  On a WAN,
this will likely require both host and user identification,
bidirectional encryption of traffic, and more -- something much closer
to ssh and/or ssl than rsh.

I also personally would much prefer that the actual primitives run
outside the kernel for the reasons noted above -- at this point in time
it should be quite possible to build a system that can run remotely
submitted user applications in a chroot jail on top of an absolutely
standard distribution and kernel with not only no particularly special
privileges, but with FEWER privileges than any real user of the system.

This may be naive or not a viable solution for all kinds of programs,
but either way I think that we need to develop a viable and efficient
and controllable security model of clustering.  Up to now too many tools
assume that a "cluster" is de facto firewalled and that this makes it ok
to be sloppy about security.  This in turn severely restricts their
portability and generality.

>   There needs to be asynchronous signaling methods
>     We already have this as well: Unix signals.  Their (still modest)
>     complexity represents many years of experience and demonstrates that
>     getting the semantics for handling asynchronous events is difficult.

Amen.  Although I'm not sure that we have a network equivalent of
async/out of band signalling.  Wish we did.  This is where life gets
complicated from the point of view of root control and security -- Unix
signals live in a very solidly defined place in rootspace with a very
well defined user interface.  It gives me a bit of a headache to think
of how one might implement something "transparently similar" across a
network, especially a WAN, securely, with or without a kernel insertion.

Another point is that Unix signals tend to be largely predefined with
only a couple provided for "user defined" purposes.  I'd think that
cluster signals would be almost all user defined, with a relatively
small set that provide similar predefined/default functionality similar
to their underlying Unix counterparts.  That is, cluster signals might
need to be more extensible.  However it would be very good to implement
at least some signals that behave "just like" their Unix counterparts
for cluster-distributed apps and to add a few others that are just as
universally defined but cluster specific (such as a signal for
"checkpointing" that -- if implemented -- causes all node tasks to stop
what they are doing and checkpoint, possible exiting immediately
afterwards).

> > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on 
> > pay-per-view (read: no need for time-consuming, potentially fruitless 
> > attempts to get MPI implementors to agree on anything)
> 
> Hmmmm, does this show up on the sports page?

It's actually very interesting -- the Head Node article in CWM seemed to
me to be a prediction that MPI was "finished" in the sense that it is
complete and all anyone ever needs.  At the same time, on this list,
very knowledgeable people seem to be suggesting that far from being
"finished" in the sense of being complete and needing any further
tinkering, it may be "finished" in the sense that it needs such a
radical rewrite and set of extensions that the new product might as well
be given a new name altogether, even if it does share call names for the
actual message passing primitives.

This also brings to mind the ways that PVM and MPI are often
differentiated -- by their control interface.  PVM has its lovely
console, where one can go in (in userspace) and feel like one is
"configuring a virtual cluster" with at least some controls an
information gathering tools (even though they fall far short of what
they SHOULD be).  PVM also does allow one to add hosts and configure a
cluster INSIDE a PVM program -- one CAN write a program that starts on
node a, which creates a virtual cluster consisting of a, b and c, spawns
tasks across the cluster, and when the subtask on node c completes,
release it from the virtual cluster while a and b carry on.  PVM also
has a crude interface to signal(), although I'm not sure it is exactly
what you are suggesting for MPI or general cluster support.

To summarize, I think that the basic argument being advanced (correct me
if I'm wrong) is that there should be a whole layer of what amount to
meta-information tools inserted underneath message passing libraries of
any flavor so that (for example) the pvm console command "conf" returns
a number that MEANS SOMETHING for "speed", and in fact so that the pvm
library (or mpi library or an add-on library used both by the pvm
console code and the user's application) can contain primitives that can
retrieve e.g. the actual speeds of the system (where the plural is
deliberate) that might be needed to manage e.g. resource allocation,
task partitioning, dynamic code optimization, and more.  I also think
that it is worth looking at the philosophical differences between PVM
and MPI more closely as to me at least they ARE very different for all
that they are both message passing libraries.

> 
> Donald Becker
> Scyld Software
> Annapolis MD 21403			410-990-9993
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu