[Beowulf] Re: Re: Home beowulf - NIC latencies

Ashley Pittman ashley at quadrics.com
Mon Feb 14 09:42:42 PST 2005


On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote:
> If you used the non-blocking send to allow for overlapped communication, 
> then you would like the implementation to play nicely.  In this case the 
> user will compute and eventually call MPI_Test or MPI_Wait (or a flavor 
> thereof).
>
> If you used the non-blocking sends to post a bunch of communications that
> you are going to then wait to complete, you probably don't care about the
> CPU -- you just want the messaging done.  In this case the user will call 
> MPI_Wait after posting everything it wants done.
>
> One way the implementation *could* behave is to assume the user is trying
> to overlap comm. and comp. until it sees an MPI_Wait, at which point it
> could go into this theoretical "burn CPU to make things go faster" mode.  
> That mode could, for example, tweak the interrupt coalescing on an 
> ethernet NIC to process packets more quickly (I don't know off the top of 
> my head if that would work or not; it's just an example).

Maybe if you were using a channel interface (sockets) and all messages
were to the same remote process then it might make sense to coalesce all
the sends into a single transaction and just send this in the MPI_Wait
call.  The latency for a bigger network transaction *might* be lower
than the sum of the issue rates for smaller ones.

I'd hope that a well written application would bunch all it's sends into
a single larger block when possible though if this optimisation was
possible though.

Given any reasonably fast network not doing anything until the MPI_Wait
call however would destroy your latency.  It strikes me as this isn't
overlapping comms and compute though rather artificially delaying comms
to allow compute to finish, seems rather pointless?

If you had a bunch of sends to do to N remote processes then I'd expect
you to post them in order (non-blocking) and wait for them all at the
end, the time taken to do this should be (base_latency + ( (N-1) * M ))
where M is the recpipiocal of the "issue rate".  You can clearly see
here that even for small number of batched sends (even a 2d/3d nearest
neighbour matrix) the issue rate (that is how little CPU the send call
consumes) is at least as important that the raw latency.

> All of this is moot of course unless the implementation actually has more
> than one algorithm that it could employ...

In my experience there are often dozens of different algorithms for
every situation and each has their trade offs.  Choosing the right one
based on the parameters given is the tricky bit.

Ashley,



More information about the Beowulf mailing list