[Beowulf] Alternative to MPI ABI
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduTue Mar 22 11:51:12 PST 2005
- Previous message: [Beowulf] Alternative to MPI ABI
- Next message: [Beowulf] Alternative to MPI ABI
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hmmm, looks like the list is about to have Doug's much desired discussion of Community Goals online. I'll definitely play. The following is a standard rgb thingie, so humans with actual work to do might not want to read it all right now... (Jeff L., I do have a special message coming to the list "just for you";-) On Tue, 22 Mar 2005, Donald Becker wrote: > Some of the implications of using a dynamic model are > - We need information and scheduling interfaces > - We need true dynamic sizing, with process creation and termination > primitives > - We need status signaling external to application messages. > > There needs to be new information interfaces which > - report usable nodes (which node are up, ready and will permit us > to start processes) > - report the capability of those nodes (speed, total memory) > - report the availability of those nodes (current load, available > memory) > Each of these information types is different and may be provided > by a different library and subsystem. We created 'beostat', a status > and statistics library, to provide most of this information. > There needs to be an explicit scheduler or mapper interface. > We use 'beomap', which can utilize an external scheduler or > internally create a list of usable compute nodes. Agreed. I also have been working on daemons and associated library tools to provide some of this information as well on the fully GPL/freely distributable side of things for a long time. I've just started a new project (xmlbenchd) that should be able to provide really detailed capabilities information about nodes via a daemon interface from "plug in" benchmarks, both micro and macro (supplemented with data snarfed from /proc using xmlsysd code fragments). xmlsysd already provides CPU clock, total memory, L2 cache size, total and available memory, and PID snapshots of running jobs. Both daemons use (or will use in the case of xmlbenchd) xml tags to wrap all output, which should make creating applications that use the data pretty simple, and there is a provided library that can manage connections to a specified cluster. XMLified tags also FORCE one to organize and present the data hierarchically and extensibly. I've tried (unsuccessfully) to convince Linus to lean on the maintainers of the /proc interface to achieve some small degree of hierarchical organization and uniformity in presentation, but it looks like it will remain the disorganized and inconsistent mess that it is now for the indefinite future (speaking as one who has now written quite a lot of code parsing it, and having had to hand-write and customize that code pretty much per /proc-based file). (And yes, I know there are performance issues, but frankly either screw ASCII altogether and make /proc a binary interface with an accompanying library for real efficiency -- parts of it might as well be binary now for all the documentation or consistency -- or provide a good topdown hierarchical and consistent organization of the data in ASCII. Or do both but with different refresh rates (fast binary, slow ascii/xml), but don't do ugly. What's there now is ugly.) There are several advantages to using daemons (compared to a kernel module or kernel-level insertion and a library of extended systems calls) for providing system information both locally and across a network: a) The daemons access information read only on the host system and can be safely run from userspace (an actual user or nobody/xinetd) with very familiar and standardized tools to control access on the basis of privacy rather than modification. b) They can be trivially packaged as e.g. rpm's per distribution and don't have to be rebuilt every time you update/change the kernel, which means that kernels can dynamically update to fix e.g. security problems, bugs, performance issues. Even fairly major kernel changes will generally not break the tools (as long as /proc remains structurally intact). c) It's a bit safer and easier to modify and develop objects that live in userspace than it is things that will run as root or inside the kernel. A bug is always a problem in code, but bugs won't necessarily immediately compromise root on a system if they live in an anonymous daemon, nor will they cause a system crash leading to a reboot. d) One hopes to be able to extensibly separate the SOURCE of a daemon (especially a GPL daemon) from the COMMUNICATIONS "language" the daemon speaks, so that there can be multiple implementations and so that applications that use the information can be written independent of the implementation. A "trivial" ascii-based communications/command set and XML-encapsulated returns mean that you can talk to the daemon with telnet, perl, c code, absolutely anything that can manage a connection, and write applications without worrying about how the information is being provided. This is possible with a kernel-based tool as well, of course, but again, the level of programming expertise required to do safe kernel programming is a fairly solid barrier against multiple implementations, leaving control over the interface itself in a single set of hands (which can be both good and bad, depending on the hands:-). The weakness in what I'm doing is that I'm a single human being and have numerous responsibilities outside of the projects, so I haven't yet written ALL the applications that could use this information. Its a sort of "if I build it, they will come" thing; even though this isn't really horribly likely (sorry Jeff, don't wait for people to pound down your door -- if you want a nifty piece of code written, the only way to get there is to carry it yourself). Still, I totally agree that this is EXACTLY the kind of information that needs to be available via an open standard, universal, extensible, interface. > An application should be able to run single-threaded if it decides > that multiple processes are not useful. Sure. > An application should be able to use only a subset of provided > processors if they will not be useful (e.g. an application that uses > a regular grid might choose to use only 16 of 23 provided nodes. > The unused nodes should be truly unused: if they crash or are > otherwise unexpectedly removed they shouldn't affect correct, > error-free completion. Ideally processes should never be started on > those nodes. Absolutely. And this needs to be done in such a way that the programmer doesn't have to work too hard to arrange it. I imagine that this CAN be done with e.g. PVM or some MPIs (although I'm not sure about the latter) but is it easy? > There needs to be new process creation primitives. > We already have a well-tested model for this: Unix process > management. The only additional element is the ability to specify > remote nodes when creating processes. Monitoring and handling > terminating processes does not need extensions. We use BProc, but > the concept of remote_fork() existed long before BProc. A standard > set of calls should not have the same names, but should use exactly > the Unix semantics so that the library only needs to "wrap" the > actual system calls. Agreed, but while dealing with this one also needs to think about security. Grids and other distributed parallel computing paradigms are increasingly popular as problems that map well into them proliferate, and with open networks of potential compute resources comes a need to be able to manage SECURE and AUTHENTICATED new process creation. On a WAN, this will likely require both host and user identification, bidirectional encryption of traffic, and more -- something much closer to ssh and/or ssl than rsh. I also personally would much prefer that the actual primitives run outside the kernel for the reasons noted above -- at this point in time it should be quite possible to build a system that can run remotely submitted user applications in a chroot jail on top of an absolutely standard distribution and kernel with not only no particularly special privileges, but with FEWER privileges than any real user of the system. This may be naive or not a viable solution for all kinds of programs, but either way I think that we need to develop a viable and efficient and controllable security model of clustering. Up to now too many tools assume that a "cluster" is de facto firewalled and that this makes it ok to be sloppy about security. This in turn severely restricts their portability and generality. > There needs to be asynchronous signaling methods > We already have this as well: Unix signals. Their (still modest) > complexity represents many years of experience and demonstrates that > getting the semantics for handling asynchronous events is difficult. Amen. Although I'm not sure that we have a network equivalent of async/out of band signalling. Wish we did. This is where life gets complicated from the point of view of root control and security -- Unix signals live in a very solidly defined place in rootspace with a very well defined user interface. It gives me a bit of a headache to think of how one might implement something "transparently similar" across a network, especially a WAN, securely, with or without a kernel insertion. Another point is that Unix signals tend to be largely predefined with only a couple provided for "user defined" purposes. I'd think that cluster signals would be almost all user defined, with a relatively small set that provide similar predefined/default functionality similar to their underlying Unix counterparts. That is, cluster signals might need to be more extensible. However it would be very good to implement at least some signals that behave "just like" their Unix counterparts for cluster-distributed apps and to add a few others that are just as universally defined but cluster specific (such as a signal for "checkpointing" that -- if implemented -- causes all node tasks to stop what they are doing and checkpoint, possible exiting immediately afterwards). > > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on > > pay-per-view (read: no need for time-consuming, potentially fruitless > > attempts to get MPI implementors to agree on anything) > > Hmmmm, does this show up on the sports page? It's actually very interesting -- the Head Node article in CWM seemed to me to be a prediction that MPI was "finished" in the sense that it is complete and all anyone ever needs. At the same time, on this list, very knowledgeable people seem to be suggesting that far from being "finished" in the sense of being complete and needing any further tinkering, it may be "finished" in the sense that it needs such a radical rewrite and set of extensions that the new product might as well be given a new name altogether, even if it does share call names for the actual message passing primitives. This also brings to mind the ways that PVM and MPI are often differentiated -- by their control interface. PVM has its lovely console, where one can go in (in userspace) and feel like one is "configuring a virtual cluster" with at least some controls an information gathering tools (even though they fall far short of what they SHOULD be). PVM also does allow one to add hosts and configure a cluster INSIDE a PVM program -- one CAN write a program that starts on node a, which creates a virtual cluster consisting of a, b and c, spawns tasks across the cluster, and when the subtask on node c completes, release it from the virtual cluster while a and b carry on. PVM also has a crude interface to signal(), although I'm not sure it is exactly what you are suggesting for MPI or general cluster support. To summarize, I think that the basic argument being advanced (correct me if I'm wrong) is that there should be a whole layer of what amount to meta-information tools inserted underneath message passing libraries of any flavor so that (for example) the pvm console command "conf" returns a number that MEANS SOMETHING for "speed", and in fact so that the pvm library (or mpi library or an add-on library used both by the pvm console code and the user's application) can contain primitives that can retrieve e.g. the actual speeds of the system (where the plural is deliberate) that might be needed to manage e.g. resource allocation, task partitioning, dynamic code optimization, and more. I also think that it is worth looking at the philosophical differences between PVM and MPI more closely as to me at least they ARE very different for all that they are both message passing libraries. > > Donald Becker > Scyld Software > Annapolis MD 21403 410-990-9993 > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Alternative to MPI ABI
- Next message: [Beowulf] Alternative to MPI ABI
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
