[Beowulf] Re: Re: Home beowulf - NIC latencies

Wed Feb 16 12:03:16 PST 2005

On Wed, 16 Feb 2005, Vincent Diepeveen wrote:

> It is possible for algorithms to have sequential properties, in short
> making it hard to scale well. Game tree search happens to have a few of
> such algorithms, from which one is performing superior with a number of
> enhancements having the same property that a faster blocked get latency
> speeds it up exponential.
> 
> For the basic idea why there is an exponential speedup see Knuth and search
> for the algorithm 'alfabeta'.
> 
> So the assumption that an algorithm sucks because it doesn't need bandwidth
> but latency is like sticking a hairpin in an electrical socket.
> 
> If users would JUST need a little bit of bandwidth they already can get
> quite far with $40 cards. 
> 
> So optimizing MPI for low latency small messages IMHO is very relevant. 

I sort of agree, but I think you miss my point.  Or maybe we totally
agree but I misunderstand.  Some algorithms will, as you note, have
sequential communications times associated with them that scale with the
number of nodes.  BOTH bandwidth AND/OR latency can be important in
minimizing those and other times in the algorithms, and which one is
important can even change during the course of the computation.  If I
have an algorithm where lots of small messages have to go to lots of
places in some order, latency becomes important.  If I have an algorithm
where lots of BIG messages have to go lots of places in some order,
bandwidth becomes important.  There is plenty of room in between the
extremes, order might or might not matter, resource contention might or
might not be an issue, and there is plenty of opportunity for both big
messages and small messages to be sent within a single application.

Optimizing any particular MPI (or PVM) command for either extreme is
then like robbing Peter to pay Paul, when Peter and Paul are a single
bicephalic individual that has to pay protection money to the mob for
every theft transaction (oh how I just LOVE to fold, spindle and
mutilate metaphors).  

To illustrate some of the complexity issues and how interesting Life can
be, consider the notion of a "broadcast".  Node A wants to send the same
message to N other nodes.  Fine fine fine, happens all the time in
certain kinds of parallel code.  Node A therefore uses one of the
collective operations (such as pvm_bcast or pvm_mcast in PVM, which is
where I have more experience).

Now, just what happens when this code is executed?  PVM in this case
promises that it will be a non-blocking send -- execution in the current
thread on A resumes as soon as the "message is safely on its way to the
receiving processors".  It of course cannot promise that those
processors will receive them, so "safely" is a matter of opinion.  It
also completely hides the details of how this >>difficult<< task is
implemented, and whether or not it varies depending on the hardware or
code context.  To find out what it really does you must Use the (pvm)
Source, Luke! and that is Not Easy because the actual thing it does is
hidden beneath all sorts of lower level primitives.

For example, the pvmd might send the message to each node in the
receiving list one at a time, essentially serializing the entire message
transmission (without blocking the caller, sure, but serial time is
serial time).  Or is it?  If RDMA is used to deliver the message at the
hardware device level so not even the pvmd is blocked for the full time
of delivery, maybe not.  Or, if the network device supports it, it might
use an actual network broadcast or multicast.  Then how efficiently it
proceeds depends on whether and how broadcasts are supported by e.g. the
kernel and intermediate network hubs.  It could be anything from truly
parallelized, so it takes a single latency hit to put the message on N
lines (as an old ethernet repeater broadcast would probably do) to a de
facto serialized latency (possibly with a much lower latency) hit as a
store and forward switch stores and then forwards to each line in turn.
If that's what they do these days -- it might even vary ethernet switch
to switch.  Myrinet, FC, SCI all might (and probably do) do
>>different<< things when one tries to optimally implement a
"broadcast".

Or PVM might refuse to second guess the hardware at all.  Instead it
might use some sort of tree, and send the message (serially) to only
some of the hosts in the receive list who both accept the message and
forward the message to others so (perhaps) you send to four hosts, the
first of these sends to 3 more while you finish, the second of these
sends to two more while you finish, the third to one while you finish,
and where each of THESE recipients find still more to send to so that
you cover a lot more than four hosts in four latency hits (at the
expense of involving all the intermediary systems heavily in delivery,
which may or may not delay the threads running on those nodes depending
very much on the HARDWARE.

Which of these it SHOULD do very likely depends strongly -- very
strongly -- on some arcane mix of the task itself and its organization
and communications requirements, the PARTICULAR size of the object being
sent THIS time, the networking hardware in use, the number of CPUs on
the node and number of CPUs on the node that are running a task thread,
and (often forgotten) the global communication pattern -- whether this
particular transmission is a one-way master-slave sort of thing or just
one piece of some sort of global message exchange where all the nodes in
the group are going to be adding their own messages to this stream while
avoiding collisions.

The programmer sees none of this, so they think of none of this.  They
think "Gee, a broadcast.  That's an easy way to send a message to many
hosts with one send command.  Great!"  They think "this will cost just
one latency hit to deliver the message to all N hosts because that's
what "broadcast" >>means<< as they well know from watching TV and
listening to the radio at the same time as a zillion other humans, in
parallel.  They may not (with PVM or MPI) even know what a "collision"
or tree >>is<<, as there is no prior assumption of knowledge about
physical networking or network algorithms in the learning or application
of the toolsets.

This is the Dark Side of APIs that hide detail and wrap it up in
powerful black-box commands.  They make the programming relatively easy,
but they also keep you from coming to grips with all those little
details upon which performance and sometimes even robustness or
functionality depend.

This is the "problem" I alluded to in the previous note.  To really do
the right thing for the user (presuming there IS such a thing as a
universally right thing to do or even hoping to do the right thing
nearly all the time) one either needs to write a large set of highly
differentiated and differently optimized commands -- not just pvm_bcast,
but pvm_bcast (presuming low latency hardware broadcast exists and is
efficient), pvm_bcast_tree (presuming that NO efficient hardware
broadcast exists and is efficient and possibly including additional
arguments to spell out some things about the tree), pvm_bcast_tree_join
(presuming that you need a tree and that each branch will both take
something off a message passing through as a leaf and adding back a
message to join the transmission to the next leaf as a root),
pvm_bcast_rr (round robin broadcasts that are optimally synchronized for
no collisions between "simultaneously" broadcasting hosts), and
pvm_bcast_for_other_special_cases for the ones I've forgotten or don't
know about, and maybe double or treble the entire stack or add flags to
be optimal for latency dominated patterns, bandwidth dominated patterns,
or somewhere in between.

Alternatively, one can make just one pvm_bcast command, but put God's
Own AI into it.  Make smart decisions inside the hardware-aware daemons
that automatically switch between all of the above and more, possibly
dynamically during the course of a computation, to minimize delivery
time and the load of all systems participating in the delivery.  Hope
that you do a good enough job of it that the result is still robust and
doesn't constantly hang and crash when assumptions you built into it
turned out to be incorrect or your "AI" turns out to be rather stupid.

All of this is just the opposite from the problems you encounter if you
program at the raw socket (or other hardware interface) level.  There
you have to work very hard to achieve the simplest of tasks -- open up a
socket to a remote host, establish bidirectional ports, work out some
sort of reliable message transmission sequence (which is nontrivial to
do if you work with the lowest level and hence fastest networking
commands, because communications is fundamentally asynchronous and
unreliable and thus simple read/write commands do not suffice).

However, once you get to where you CAN talk beween nodes with sockets,
you are forced to confront the communication and computational topology
questions and the various "special" capabilities of the hardware head
on.  No black boxes.  You want a tree, you get a tree, but YOU have to
program the tree and decide what to do if a leaf dies in mid-delivery,
etc.  You want to synchronize communications, you go right ahead but be
prepared to figure out how to communicate synchronization information
out of band.  You want non-blocking communications, set the appropriate
ioctl or fcntl, if you know how for your particular "file" or hardware.
Learn the select call.

Now things are totally controlled, but you have to be an experienced
network programmer (a.k.a. a network programming God) to do anything
complex.  Sleep with Stevens underneath your pillow, that sort of thing.
The big set, not just the single book version.  And if you're THAT good,
what are you doing working on a beowulf?  There are big money jobs out
there a-waitin', as there are for the other seventeen humans on the
planet with that kind of ability...;-)

What I think a lot of people (even experienced people) end up doing is
using PVM or MPI to mask out the annoying parts of raw networking --
maintaining a list of hosts in the cluster, dealing with the repetitive
parts of the network code that ensure reliable delivery, adding some
nice management tools to pass out-of-band information around for e.g.
process control and synchronization.  Then they use a relatively small
set of the available message passing commands, because they do not trust
the more advanced black box collectives.  

Usually this lack of trust has a basis -- they tried them in some
application or other and found that instead of speeding up, things got
slower or had unexpected side effects, and they had no way of knowing
what they actually DID, let alone how to fix them.  They have no access
to the real low level mechanism whereby those commands move data.

That's what I meant about making MPI or PVM more "atomic".  PVM has all
sorts of pack and unpack commands for a message that permit (I suppose)
typechecking and conversion to be done, where all I really want for most
communications is a single send command that takes a memory address and
a buffer length and a single receive command that takes a memory address
and a (maximum) length.  If I want to "pack" the buffer in some
particular way, that's what bcopy is for.  I don't want to "have" to
allocate and free it every time I use it, or use a command that very
likely allocates and frees a buffer I cannot see when I call it.  The
buffer might hold an int or int array, a double matrix, a struct, a
vector of structs -- who cares?  Pointers at both ends can unambiguously
manage the typing and alignment -- I'm the programmer, and this is what
I do.

With this much simpler structure one can at least think about
optimization as the problem is now much simpler.  A message is a block
of anonymous memory, period, with the programmer fully responsible for
aligning the send address or receive address on both ends with whatever
structure(s) or variable(s) are to be used there.  It is very definitely
less portable -- it leaves the user (or more likely a higher level
command set built on top of the primitives) with the hassle of having to
manage big-endian and little-endian issues as well as data size issues
if they use the message passer across/between incompatible systems.
These issues, however, were a lot more important in the past than they
are today, and they add a bunch of easily avoided overhead to the vast
majority of clusters where they aren't needed.

With a simple set of building block such as this, one could then
>>implement<< PVM or MPI or any other higher order, more portable
message passing API on top of it.  Indeed, I'd guess that this is very
much what PVM really does (don't know about MPI) -- the pvm_pack
routines very likely wrap up a malloc (of a struct), set metadata in
struct, bcopy into data region of struct, send struct, free struct, with
the inverse process on the other side driven/modified by the metadata
(such as endianness) as needed.  All sorts of tradeoffs of speed and
total memory footprint in that sequence, many of which are not always
necessary.  One could ALSO focus more energy on the higher
order/collective send routines, as one could write them INSIDE the low
level constructs provided so they become USER level software instead of
library black boxes.  With sources, modifying or tuning them would no
longer involve working with either raw sockets or a hidden set of
internals for the library itself.

I'm not sure I'm making myself clear on all of this, and for lots of
programs I'm sure it doesn't matter, but for really bleeding edge
parallel performance and complex code I suspect that raw sockets (or the
equivalent device interface for non-IP-socketed devices) still hold a
substantial edge over the same algorithms managed through same harddware
with the message passing libraries.  This was where I jumped in -- when
Patrick made much the same statement.  

This (if true) is a shame, and is likely due to the assumptions that
have gone into the implementation of the commands, some of which date
back to big iron supercomputer days where the hardware was VERY
DIFFERENT from today but ABSOLUTELY UNIFORM within a given hardware
platform, so that "universal" tuning was indeed possible.  Maybe it's
time to reassess these assumptions.  I am therefore trying to suggest
that instead of "fixing" the collectives to work better for optimal
latency at the expense of bw or vice versa (without even MENTIONING the
wide array of hardware the same command is supposed to "transparently"
achieve this miracle on) it might be better to work the other way -- add
some very low level primitives that do little more than encapsulate and
manage the annoying aspects of raw interfaces while still permitting
their "direct" use.

THEN implement PVM and MPI both on top of those low level primitives --
why not?  The differences are all higher order interface things --
ultimately what they do is move buffers across buses and wires, although
the process would be made a lot easier if there were a shared data
structure and primitives to describe and perform common tasks on a
"cluster" between them.  A coder could then choose to "use a compiler"
(metaphorically the encapsulated primitives) for some or all of their
code and accept the default optimizations, or "use an assembler" (the
primitives themselves) to hand-tune critical parts of their code,
without having to leave the nice safe portable bounds of their preferred
parallel library.  If done really well, it would accomplish the long
discussed merger of PVM and MPI almost as an afterthought with teeny
tweaks (perhaps) of the commands, since they would be based on the same
primitives and underlying data structures, after all.

Just dreaming, I guess. Possibly hallucinating.  That bump on the head,
y'know.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu