[Beowulf] 1.2 us IB latency?

Wed Apr 25 05:07:28 PDT 2007

On Wed, 2007-04-25 at 11:31 +0200, Håkon Bugge wrote:
> At 17:55 24.04.2007, Ashley Pittman wrote:
> >That would explain why qlogic use PIO for up to 64k messages and we
> >switch to DMA at only a few hundred.  For small messages you could best
> >describe what we use as a hybrid of the above descriptions, we write the
> >a network packet across the PCI bus and don't DMA at all.
> 
> I assume QsNet has to do something with the 
> packet after it has been written to the HCA. 
> Since the outbound PCI address space is only 
> 32-bits (who needs more than 4GigB of CSR, other 
> than cluster people attempting to map all the 
> accumulated memory of the nodes in the cluster 
> into a single address space?), I assume QsNet 
> uses part of the packet as 64-bit address 
> information and starts a DMA from the HCA local 
> buffer to the remove destination.

No, for small messages we don't use the DMA engine, in effect we write
the network packet directly from the main CPU onto the wire.

As to who needs more than 32bits of space quite a few people, I did some
work a number of years ago to enable tiling of multiple 32bit NIC
address spaces onto a single 64bit application space, in the end it
worked fairly well but it was a tricky thing to get right.

> >The downside to PIO of course is you need a CPU to drive it so besides
> >the fact it's slow you can't make do anything asynchronously.
> 
> This is a classic tradeoff. Most applications 
> _create_ the message before it is sent (contrary 
> to many p2p benchmarks). Hence, it resides in the 
> L1 or L2 cache of the CPU with a (MOESI) Modified 
> state. It is the very efficient to use the CPU to 
> read its local cache and write the message using 
> the WC buffer. Contrary, the HCA has to issue a 
> DMA read to memory, the CPU cache(s) is snooped, 
> data is transferred to the memory _and_ to the 
> HCA. The cache state ends up in Shared state, and 
> a bus transaction is required in order to make it 
> Modified again (when the buffer is written the next time).

You'd have thought that to be the case but PIO bandwidth is not a patch
on DMA bandwidth.  On alphas you used to get a performance improvement
by evicting the data from the cache immediately after you had submitted
the DMA but this doesn't buy you anything with modern machines.

> >That's an interesting theory, but I suspect your numbers are a little
> >out.  My own measurements put a PIO word write in the region of .15 uSec
> >depending on chipset.  Of course if you are right then the remaining PIO
> >write is happening in 1 uSec which leaves only .2uSec for the network
> >which seems a little fast to me.
> 
> Just to make sure we compare the same thing; the 
> .15usec is the time from the CPU issuing the 
> store instruction until the side effect is 
> visible in the HCA? In other words, assume a CSR 
> word read takes 0.5usec, a loop writing and 
> reading the same CSR take 0.65usec, right? If 
> that the case, CSR accesses have improved radically the last years.

Until it's visible is another question, typically we write a number of
values and then flush them, how soon you can see the data from the NIC
before the flush is almost entirely chipset dependant.  As you say reads
are very bad and we avoid them wherever possible.

Ashley,