[Beowulf] Re: Re: Home beowulf - NIC latencies
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Feb 17 07:14:48 PST 2005
- Previous message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Next message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 16 Feb 2005, Greg Lindahl wrote: > On Wed, Feb 16, 2005 at 03:03:16PM -0500, Robert G. Brown wrote: > > > Optimizing any particular MPI (or PVM) command for either extreme is > > then like robbing Peter to pay Paul, when Peter and Paul are a single > > bicephalic individual that has to pay protection money to the mob for > > every theft transaction (oh how I just LOVE to fold, spindle and > > mutilate metaphors). > > Um, most MPI implementations have at least 3 algorithms, for short, > long, and very long messages. So are they all breaking your rule? No, as I noted later in the (yes, long:-) message. That's what there should be. Although that they do isn't clear to the user, and the user has no control over it (that I can see in the standard). They have to trust the implementation to do the right thing. > It's *unoptimizing* some of the cases that's at question. Most MPIs > unoptimize compute/communication overlap with long messages, because > it's hard work to get that right without hurting all short messages. Again, I think that we in agreement. All I was ultimately suggesting is that message passing libraries that contain complex higher level commands that make optimization decisions (including the decision not to optimize) that result in a complex command that may not be optimal for a significant number of complex cases might benefit from having access to lower level primitives from which the actual complex commands are built so users can roll their own within the library without having to resort to raw networking. You might not agree with this suggestion, but it is as you say the point in question. As I also said, I'm not an MPI expert by any means and therefore have to go look up commands beyond the MPI 1 standard (which look not horribly unlike the PVM command set as far as communication is concerned) and am probably shaky there, but looking them up on the mpi-forum.org site, it looks like MPI 2 adds MPI_PUT, MPI_GET, MPI_ACCUMULATE which are just exactly what I was suggesting and what I would have hoped for, especially if they are indeed the primitives from which at least some of the higher order commands are built. If so, users can either choose to use the optimized/unoptimized higher level commands provided or (if they understand their problem and hardware) roll their own. This is the distinction I was talking about. MPI originally passed messages at a high level of abstraction to wrap a variety of mechanisms in use on big supercomputers (not forgetting that it was a consortium of the vendors of such supercomputers that wrote the standard in response to pressure from the government and other major consumers who were tired of rewriting code every time a new supercomputer was released with its own internals and API for moving data between processors/processes). It (I think deliberately) avoided providing any sort of interface that might be interpreted as a "thin" wrapper to those internals that were responsible for minimal latency, maximum bandwidth movement of data. Whether this was to make the government happy (hiding the detail) or to make themselves happy (leaving a purchaser of a supercomputer with an incentive to write optimizations in their native API and hence become "hooked" on the hardware) is a moot point. PVM has a different, but related, history. It was built on top of networking from the beginning, more or less, and was deliberately designed to hide the networking primitives (specifically) from the programmer where MPI might have been hiding shared memory primitives and create a "virtual machine" where MPI was running on REAL machines. It if anything went out of its way to avoid RMA-like message passing commands that "look" like a wrapper to shared memory following instead a fairly simple reliable message transmission model and in the end (3.x) had almost exactly the same range and general form of commands as MPI 1.x for the bulk of what a user was likely to do, with maybe a bit nicer control interface over the virtual machine and a bit less control over collective operations. Looking over (for the first time) the MPI 2 additions, I have to say that they look very nice, possibly nice enough to finally consider switching to MPI from PVM. Alternatively, it is something that should be cloned in PVM -- PVM would really benefit from PVM_GET, PVM_PUT, and some synchronization primitives. Provision of what amount to wrappers on raw RMA primitive commands (that can be/should be tuned for the hardware) and the separation of the RMA part and any synchronization components mean that a serious programmer has a lot of ability to control and optimize (assuming only that these commands truly are implemented as primitives as used to develop the higher level commands) without leaving the library, while people are able to use the higher level collectives when they are either a good match for their task or if they are a beginner and not ready to tackle lower level programming. The only thing I still don't find (on a fairly rapid lookover) is a discussion on just what e.g. broadcast does or how to make it vary what it does. Part of this of course doesn't belong in a standards document which isn't intended to describe algorithms or implementations at that level of detail. However, one part does. I think it matters a great deal to the programmer to know whether or not broadcast (and other commands) are indeed hardware primitive or if they are implemented on top of point-to-point communications primitives that may or may not involve diverting intermediary processors from their running tasks (and ditto for scatter/gather type operations). This seems like it might be a programming decision point for people who really want to hand-optimize their code. Again, this is based on my experiences in PVM, where I've tried using broadcast several times in master/slave contexts expecting to reduce latency and communications times only to find that the command was de facto serialized and in fact took as long or longer than just running a loop over point to point communications calls. Perhaps MPI does it better, or differently, but it doesn't LOOK like it is anything but a black box which can swing from being good on one network to terrible on another without warning. How to implement such a thing in a standard is an open question, but from a programmer interface point of view having a set of commands that can query and set variables to control the back end behavior of collectives or determine properties of the hardware in the cluster would be very useful. Just one creative idea might be for MPI to provide an optional initialization command to run on a cluster that builds a table of quiescent-state and cpu-loaded-state latencies for short, medium, and long messages both point to point and in collective mode. The same table might hold some describing the selected hardware device such as hw_bcast=TRUE along with the broadcast latency. >From this one might be able to build portable MPI programs that run optimally on Myrinet while they still run optimally on gig Ethernet, with or without e.g. a hardware RDMA command that significantly affects and redistributes the CPU loading per message. But maybe this is all too complicated, or doesn't belong in the standard per se. It is indeed like the ATLAS thing, but then, I think that ATLAS is sheer genius although it is also cumbersome and clunky to build...;-) I just dream of the day that ATLAS-like runtime optimization isn't so clunky and is based on tools that create tables of microbenchmark numbers that ARE sufficiently accurate and rich to achieve near-optimization without running a build loop that sweeps and searches a high-dimensional space...:-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Next message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
