[Beowulf] 1.2 us IB latency?

Wed Apr 25 07:35:08 PDT 2007

On Wed, 25 Apr 2007, Ashley Pittman wrote:

> You'd have thought that to be the case but PIO bandwidth is not a patch
> on DMA bandwidth.  On alphas you used to get a performance improvement
> by evicting the data from the cache immediately after you had submitted
> the DMA but this doesn't buy you anything with modern machines.

Not a patch, but the main goal in many of our cases is minimizing the
amount of time spent in MPI where conventional wisdom about offload
doesn't unconditionally apply (and yes this contrary to where I think
programming models should be headed).

> > Just to make sure we compare the same thing; the 
> > .15usec is the time from the CPU issuing the 
> > store instruction until the side effect is 
> > visible in the HCA? In other words, assume a CSR 
> > word read takes 0.5usec, a loop writing and 
> > reading the same CSR take 0.65usec, right? If 
> > that the case, CSR accesses have improved radically the last years.
> 
> Until it's visible is another question, typically we write a number of
> values and then flush them, how soon you can see the data from the NIC
> before the flush is almost entirely chipset dependant.  As you say reads
> are very bad and we avoid them wherever possible.

I recently measured that it takes InfiniPath 0.165usec to do a complete
MPI_Isend -- so in essence this is 0.165usec of software overhead
that also includes the (albeit cheap Opteron) store fence.  I don't
think that queueing a DMA request is much different in terms of
software overhead.  For small messages, I suspect that most of the
differences will be in the amount of time the request (PIO or DMA)
remains queued in the NIC before it can be put on the wire.  If
issuing a DMA request implies more work for the NIC compared to a PIO
that requires no DMA reads, this will be apparent in the resulting
message gap (and made worse as more sends put in flight).  

In this regard, we have a pretty useful test in GASNet called
testqueue to measure the effect of message gap as the number of sends
are increased.  Interconnects varied in performance -- QLogic's PIO
and Quadrics's STEN have a fairly flat profile, whereas Mellanox/VAPI
was not so flat after 2 messages in flight and my Myrinet results are
from very old hardware.  Obviously, I'd encourage everyone to run
their own tests as various HCA revisions will have their own
profiles.

I should come up with this test in an MPI form -- GASNet shows these
metrics with the lower-level software that is used in many MPI
implementations, so comparing the MPI metrics to the GASNet metrics
could help identify overheads in MPI implementations.

    . . christian

-- 
christian.bell at qlogic.com
(QLogic SIG, formerly Pathscale)