[Beowulf] MPI programming question: Interleaved MPI_Gatherv?
Robert G. Brown
rgb at phy.duke.edu
Thu Mar 3 10:35:08 PST 2005
On Thu, 3 Mar 2005, Rob Ross wrote:
> > What about RMA-like commands? MPI_Get in a loop? Since that is
> > controlled by the gatherer, one would presume that it preserves call
> > order (although it is non-blocking).
> I would hope that one would read the spec instead! MPI_Get()s don't
> necessarily *do* anything until the corresponding synchronization call.
> This allows the implementation to aggregate messages. Call order (of
> the MPI_Get()s in an epoch) is ignored.
Ouch! I did read the spec, about ten seconds before replying, and note
that I SAID it was non-blocking (from the spec) and was thinking about
using it with sync's. However, yes, this is a bit oxymoronic on a
reread (preserve call order vs non-blocking? Jeeze:-) and I consider
myself whomped upside the head:-)
Still, what IS to prevent him from alternating gets and synchronization
calls while retrieving . I don't think there is any alternative to
doing this in ANY non-blocking, potentially aggregating or parallel
communications scenario, although where he puts the barrier might vary.
Depending on whether he cares about the (ABCD)(ABCD)... order per se he
might try (however he does the "get"ting or receiving, or whatever):
get A (sync) get B (sync) get C (sync)
(inefficient but absolutely guarantees loop order).
If (ABCD)(BADC)(DCAB)... is ok (he doesn't care what order they arrive
in but he doesn't want the A communications to be aggregated so that two
A's get there before he gets the BCD from the same cycle of
computations) then he should be able to do:
get A get B get C get D (sync) get A get B get C...
If I understand things correctly (where there is a very definite chance
that I don't!) these (on top of any library) will in the first case be
equivalent to a blocking TCP read from A, B, C... in order (but probably
not as efficient as TCP would be in this particular case because MPI is
optimized against the near diametrically opposite assumption) while the
second would be equivalent to using select to monitor a group of open
sockets for I/O, reading from them in the order that data becomes
available, but adding a toggle so you don't read from one twice before
reading from all of them. Although there is likely more than one way to
do it, and where in low-level programming one might well want to
implement handshaking of some sort to trigger the next cycle's send
(blocking the remote client's execution as necessary) to avoid
overrunning buffers or exhausting memory on the master/aggregator if for
any reason one host turns out to be "slow" relative to the others. In
MPI one hopes all of that is handled for you, and more.
> > Or of course there are always raw sockets... where you have complete
> > control. Depending on how critical it is that you preserve this strict
> > interleaving order.
> > rgb
> No you don't! You're just letting the kernel buffer things instead of
> the MPI implementation. Plus, Michael's original concern was doing this
> in an elegant way, not explicitly controlling the ordering.
Of course one has maximum control with raw sockets (or more generically,
raw devices). Somewhere down inside MPI there ARE raw sockets (or
equivalent at some level in the ISO/OSI stack). The MPI library wraps
them up and hides all sorts of details while making all the device
features uniform across the now "invisible" devices and in the process
necessarily excluding one from ACCESSING all of those features a.k.a.
the details being hidden. I may have misunderstood the recent
discussion about the possible need for an MPI ABI but I thought that was
what it was all about -- providing a measure of low level control that
is BY DESIGN hidden by the API but is there, should one every try to
code for it, in the actual kernel/device interface (e.g. regulated by
Note that at this low (device driver) level I would expect the kernel to
handle at least some asynchronous low-level buffering and the primary
interrupt processing for the physical device layer FOR the MPI
implementation or any other program that uses the device -- you cannot
safely escape this. This does not mean that you cannot control just
where you stop using the kernel and rely on your own intermediary layers
for handing the device above the level of raw read/write plus ioctls.
That is the application layer or higher order kernel networking layers
(depending on just where and how you access the device itself) may well
manage buffers of their own, reliability, retransmission, RDMA,
blocking/non-blocking of the interface, timeouts, and more. Low level
networking is not easy, which is WHY people wrote PVM and the MPI
So ultimately, all I was observing is that it is pretty straightforward
(not necessarily easy, but straightforward) to write an application that
very definitely and without any question goes and gets a chunk of data
(say, contents of a struct) from an open socket on system A and puts it
in a memory location (with appropriate pointers and sizeof and so forth
for the struct), THEN gets a chunk of data from an open socket on system
B and puts it in the next memory location, THEN gets a chunk of data
from an open socket on system C and puts it ...
In fact, since TCP generally does block on a read until there is data on
the socket, it is relatively difficult to do it any other way in a
simple loop over sockets -- you have to use select as noted above to
avoid polling and non-blocking I/O, and in all cases one has to be
pretty careful not to drop things and to handle cases where a stream
runs slow or does other bad things. As I've learned the hard way.
As far as elegance is concerned:
a) That's a bit in the eye of the beholder. There are tradeoffs
between simplicity of code and ease of development work vs performance
and control but it is hard to say which is more "elegant". It's fairer
to make the value-neutral statement that you have to work much harder to
write a parallel application on top of raw sockets (no question at
all;-) but have all the control and optimization potential available to
userspace (at the ioctl level above the kernel and device driver itself)
if you do so.
To cite a metaphorical situation, is coding in assembler, whether one is
coding a complete application or an embedded optimized operation,
"inelegant"? Perhaps, but that's not the word I would have used. There
are times when assembler is very elegant, in the sense that it directly
encodes an algorithm with the greatest degree of optimization and
control where a compiler might well generate indifferent code or fail to
use all the features of the hardware. Once upon a time many many years
ago I handed coded e.g. trig functions and arithmetic for the 8087 in
assembler because my compiler generated 8088 code that ran about ten
times more slowly. Elegant or inelegant?
Compare to just this situation -- if for some reason you require e.g.
absolute control over the order of utilization of your network links in
a parallel computation (perhaps to avoid collisions or device/line
contention, to do something exotic with transmission order and pattern
on a hypercubical network) you may well find that MPI or PVM simply do
not provide that degree of control, period. They try to "do the right
thing" for a generic class of problem and simple assumptions as to the
kind and number of interfaces and routes between nodes and load patterns
along those routes, BUT there is no >>guarantee<< that the right thing
they end up with (often chosen for robustness and optimization for the
most common classes of problems) will be right for YOUR problem and
network and no way to tweak it if it is not. In that case, using raw
network devices (whatever they might be) might well be the only way to
achieve the requisite level of control and yes, might be worth a factor
of 10 in speed.
I'll bet money that if you polled the list, you'd find that there exist
people who have gone in and hacked MPI at the source level to "break" it
(de-optimize it for the most common applications so they run worse) or
who have run over time several versions of MPI including "new and
improved" ones, who have found empirically that there are applications
for which the hacked/older "disimproved" versions perform better.
b) Anyway, this explains why I mentioned raw sockets at he end.
Note well the "Depending on how..." Maybe I read the original message
incorrectly, but I thought that the issue was that (for reasons unknown)
he wanted to guarantee collection in the order A then B then C... Why
he wanted to do this wasn't clear, nor was it clear whether (in any
given cycle) it would be ok to do A then C then B then... (and just not
overlap the next A). If the strict interleaving wasn't an issue, than I
would have thought just putting a barrier at the end of an ABC...Z cycle
would have forced all communications to complete before starting the
So IF this really IS a critical requirement -- he has to read from A,
complete the read blocking no fooling, move on and read from B, etc, no
data parallelism or asynchronicity permitted that might violate this
strict order (or if he's interleaving communications on four different
network devices along different routes to different sub-clusters of
nodes), then doing it within MPI might or might not be efficient. Raw
TCP sockets (or lower level hardware-dependent I/O) would a PITA to
code, but you can pretty much guarantee that the resulting code is as
efficient as possible, given the requirement, and it might be the ONLY
way to accomplish a complicated interleave of node I/O for some very
specific set of reasons. If you do the considerable work required to
make it so, of course, copy of the complete works of Stevens in
> Joachim had some good options for MPI.
I agree. I don't even disagree with what you say above -- I understand
what you mean. I just think that we need more data before concluding
that those options were enough. He described his design goal but not
his motivation. For some design goals there are probably lots of good
ways to do it in MPI.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf