[Beowulf] 1.2 us IB latency?

Wed Mar 28 13:04:37 PDT 2007

Peter Kjellstrom wrote:
> On Wednesday 28 March 2007, Mark Hahn wrote:
>>>> start timer
>>>> send(other,small-message)		recv(first,small-message)
>>>> recv(other,small-message)		send(first,small-message)
>>>> stop timer
>>>>
>>>> I'll actually see 2.4 us between the timer calls?  if I understand,
>>>> aggregation would only help on a streaming test.  in fact, this kind
>>>> of isolated RPC-like exchange is what I see most commonly.
>>> Assuming you could time it with any accuracy, yes.
>> that's not an issue - rdtsc is perfectly good into the tens of ns range.
> 
> I'll have to hack together a rdtsc based mpi microbenchmark some day it seems 
> =)

Might I suggest just passing a MPI_INT back and forth and decrementing it
each time to insure that the message makes it all the way to userspace
before heading back to the other node?  Seems like it would allow for
easier timing (with gettimeofday) and also take into account various real
world effects like interrupts and schedule effects.  I guess it depends
if you want marketing numbers or real world numbers ;-).

Additionally you might want to do this in parallel, after all few clusters
let their communication layer sit idle while a single pair of nodes 
communicate.  Additionally you might want all possible pairs to communicate to 
see what effect locality has, this might be especially useful for comparing
interconnect layers with various fractions of backplane bandwidth and
differing methods for handling contention.

I've written a code that does the above, it's still somewhat raw, I've
yet to add some sanity checking and command line options to avoid recompiling,
both high on my todo list.

I do have some data from an infinipath cluster nto post, this data set is for 
4 processors per node on 64 nodes (or a 177 node cluster) each node has a 
single port on a 288 port IB switch:
   http://cse.ucdavis.edu/~bill/n64p256/band_results.txt
   http://cse.ucdavis.edu/~bill/n64p256/lat_results.txt

They should be easy to visualize, I personally use gnuplot' splot "filename" 
matrix.  To see 32 node, with a single process per node numbers just replace
the directory name above with "n32p1".

> Sorry for being unclear here. What I wanted to say was that, unrelated to 1.5 
> us ping-pong on mpi I have also observed verbs level latency (ib_write_lat) 
> of around 1 us. And that figure is not affected by any mvapich trickery :-).

Good to hear, I'll hailly source if you (or anyone else) is willing to run my 
benchmark.