[Beowulf] MPI programming question: Interleaved MPI_Gatherv?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Mar 3 10:35:08 PST 2005
- Previous message: [Beowulf] MPI programming question: Interleaved MPI_Gatherv?
- Next message: [Beowulf] purchasing sources for Newisys 2100 (and 4300)?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 3 Mar 2005, Rob Ross wrote: > > What about RMA-like commands? MPI_Get in a loop? Since that is > > controlled by the gatherer, one would presume that it preserves call > > order (although it is non-blocking). > > I would hope that one would read the spec instead! MPI_Get()s don't > necessarily *do* anything until the corresponding synchronization call. > This allows the implementation to aggregate messages. Call order (of > the MPI_Get()s in an epoch) is ignored. Ouch! I did read the spec, about ten seconds before replying, and note that I SAID it was non-blocking (from the spec) and was thinking about using it with sync's. However, yes, this is a bit oxymoronic on a reread (preserve call order vs non-blocking? Jeeze:-) and I consider myself whomped upside the head:-) Still, what IS to prevent him from alternating gets and synchronization calls while retrieving . I don't think there is any alternative to doing this in ANY non-blocking, potentially aggregating or parallel communications scenario, although where he puts the barrier might vary. Depending on whether he cares about the (ABCD)(ABCD)... order per se he might try (however he does the "get"ting or receiving, or whatever): get A (sync) get B (sync) get C (sync) (inefficient but absolutely guarantees loop order). If (ABCD)(BADC)(DCAB)... is ok (he doesn't care what order they arrive in but he doesn't want the A communications to be aggregated so that two A's get there before he gets the BCD from the same cycle of computations) then he should be able to do: get A get B get C get D (sync) get A get B get C... If I understand things correctly (where there is a very definite chance that I don't!) these (on top of any library) will in the first case be equivalent to a blocking TCP read from A, B, C... in order (but probably not as efficient as TCP would be in this particular case because MPI is optimized against the near diametrically opposite assumption) while the second would be equivalent to using select to monitor a group of open sockets for I/O, reading from them in the order that data becomes available, but adding a toggle so you don't read from one twice before reading from all of them. Although there is likely more than one way to do it, and where in low-level programming one might well want to implement handshaking of some sort to trigger the next cycle's send (blocking the remote client's execution as necessary) to avoid overrunning buffers or exhausting memory on the master/aggregator if for any reason one host turns out to be "slow" relative to the others. In MPI one hopes all of that is handled for you, and more. > > Or of course there are always raw sockets... where you have complete > > control. Depending on how critical it is that you preserve this strict > > interleaving order. > > > > rgb > > No you don't! You're just letting the kernel buffer things instead of > the MPI implementation. Plus, Michael's original concern was doing this > in an elegant way, not explicitly controlling the ordering. Of course one has maximum control with raw sockets (or more generically, raw devices). Somewhere down inside MPI there ARE raw sockets (or equivalent at some level in the ISO/OSI stack). The MPI library wraps them up and hides all sorts of details while making all the device features uniform across the now "invisible" devices and in the process necessarily excluding one from ACCESSING all of those features a.k.a. the details being hidden. I may have misunderstood the recent discussion about the possible need for an MPI ABI but I thought that was what it was all about -- providing a measure of low level control that is BY DESIGN hidden by the API but is there, should one every try to code for it, in the actual kernel/device interface (e.g. regulated by ioctls). Note that at this low (device driver) level I would expect the kernel to handle at least some asynchronous low-level buffering and the primary interrupt processing for the physical device layer FOR the MPI implementation or any other program that uses the device -- you cannot safely escape this. This does not mean that you cannot control just where you stop using the kernel and rely on your own intermediary layers for handing the device above the level of raw read/write plus ioctls. That is the application layer or higher order kernel networking layers (depending on just where and how you access the device itself) may well manage buffers of their own, reliability, retransmission, RDMA, blocking/non-blocking of the interface, timeouts, and more. Low level networking is not easy, which is WHY people wrote PVM and the MPI network devices. So ultimately, all I was observing is that it is pretty straightforward (not necessarily easy, but straightforward) to write an application that very definitely and without any question goes and gets a chunk of data (say, contents of a struct) from an open socket on system A and puts it in a memory location (with appropriate pointers and sizeof and so forth for the struct), THEN gets a chunk of data from an open socket on system B and puts it in the next memory location, THEN gets a chunk of data from an open socket on system C and puts it ... In fact, since TCP generally does block on a read until there is data on the socket, it is relatively difficult to do it any other way in a simple loop over sockets -- you have to use select as noted above to avoid polling and non-blocking I/O, and in all cases one has to be pretty careful not to drop things and to handle cases where a stream runs slow or does other bad things. As I've learned the hard way. As far as elegance is concerned: a) That's a bit in the eye of the beholder. There are tradeoffs between simplicity of code and ease of development work vs performance and control but it is hard to say which is more "elegant". It's fairer to make the value-neutral statement that you have to work much harder to write a parallel application on top of raw sockets (no question at all;-) but have all the control and optimization potential available to userspace (at the ioctl level above the kernel and device driver itself) if you do so. To cite a metaphorical situation, is coding in assembler, whether one is coding a complete application or an embedded optimized operation, "inelegant"? Perhaps, but that's not the word I would have used. There are times when assembler is very elegant, in the sense that it directly encodes an algorithm with the greatest degree of optimization and control where a compiler might well generate indifferent code or fail to use all the features of the hardware. Once upon a time many many years ago I handed coded e.g. trig functions and arithmetic for the 8087 in assembler because my compiler generated 8088 code that ran about ten times more slowly. Elegant or inelegant? Compare to just this situation -- if for some reason you require e.g. absolute control over the order of utilization of your network links in a parallel computation (perhaps to avoid collisions or device/line contention, to do something exotic with transmission order and pattern on a hypercubical network) you may well find that MPI or PVM simply do not provide that degree of control, period. They try to "do the right thing" for a generic class of problem and simple assumptions as to the kind and number of interfaces and routes between nodes and load patterns along those routes, BUT there is no >>guarantee<< that the right thing they end up with (often chosen for robustness and optimization for the most common classes of problems) will be right for YOUR problem and network and no way to tweak it if it is not. In that case, using raw network devices (whatever they might be) might well be the only way to achieve the requisite level of control and yes, might be worth a factor of 10 in speed. I'll bet money that if you polled the list, you'd find that there exist people who have gone in and hacked MPI at the source level to "break" it (de-optimize it for the most common applications so they run worse) or who have run over time several versions of MPI including "new and improved" ones, who have found empirically that there are applications for which the hacked/older "disimproved" versions perform better. b) Anyway, this explains why I mentioned raw sockets at he end. Note well the "Depending on how..." Maybe I read the original message incorrectly, but I thought that the issue was that (for reasons unknown) he wanted to guarantee collection in the order A then B then C... Why he wanted to do this wasn't clear, nor was it clear whether (in any given cycle) it would be ok to do A then C then B then... (and just not overlap the next A). If the strict interleaving wasn't an issue, than I would have thought just putting a barrier at the end of an ABC...Z cycle would have forced all communications to complete before starting the next cycle. So IF this really IS a critical requirement -- he has to read from A, complete the read blocking no fooling, move on and read from B, etc, no data parallelism or asynchronicity permitted that might violate this strict order (or if he's interleaving communications on four different network devices along different routes to different sub-clusters of nodes), then doing it within MPI might or might not be efficient. Raw TCP sockets (or lower level hardware-dependent I/O) would a PITA to code, but you can pretty much guarantee that the resulting code is as efficient as possible, given the requirement, and it might be the ONLY way to accomplish a complicated interleave of node I/O for some very specific set of reasons. If you do the considerable work required to make it so, of course, copy of the complete works of Stevens in hand...;-) > Joachim had some good options for MPI. I agree. I don't even disagree with what you say above -- I understand what you mean. I just think that we need more data before concluding that those options were enough. He described his design goal but not his motivation. For some design goals there are probably lots of good ways to do it in MPI. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] MPI programming question: Interleaved MPI_Gatherv?
- Next message: [Beowulf] purchasing sources for Newisys 2100 (and 4300)?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
