[Beowulf] 1.2 us IB latency?

Wed Apr 25 02:31:05 PDT 2007

At 17:55 24.04.2007, Ashley Pittman wrote:
>That would explain why qlogic use PIO for up to 64k messages and we
>switch to DMA at only a few hundred.  For small messages you could best
>describe what we use as a hybrid of the above descriptions, we write the
>a network packet across the PCI bus and don't DMA at all.

I assume QsNet has to do something with the 
packet after it has been written to the HCA. 
Since the outbound PCI address space is only 
32-bits (who needs more than 4GigB of CSR, other 
than cluster people attempting to map all the 
accumulated memory of the nodes in the cluster 
into a single address space?), I assume QsNet 
uses part of the packet as 64-bit address 
information and starts a DMA from the HCA local 
buffer to the remove destination.

>The downside to PIO of course is you need a CPU to drive it so besides
>the fact it's slow you can't make do anything asynchronously.

This is a classic tradeoff. Most applications 
_create_ the message before it is sent (contrary 
to many p2p benchmarks). Hence, it resides in the 
L1 or L2 cache of the CPU with a (MOESI) Modified 
state. It is the very efficient to use the CPU to 
read its local cache and write the message using 
the WC buffer. Contrary, the HCA has to issue a 
DMA read to memory, the CPU cache(s) is snooped, 
data is transferred to the memory _and_ to the 
HCA. The cache state ends up in Shared state, and 
a bus transaction is required in order to make it 
Modified again (when the buffer is written the next time).

>That's an interesting theory, but I suspect your numbers are a little
>out.  My own measurements put a PIO word write in the region of .15 uSec
>depending on chipset.  Of course if you are right then the remaining PIO
>write is happening in 1 uSec which leaves only .2uSec for the network
>which seems a little fast to me.

Just to make sure we compare the same thing; the 
.15usec is the time from the CPU issuing the 
store instruction until the side effect is 
visible in the HCA? In other words, assume a CSR 
word read takes 0.5usec, a loop writing and 
reading the same CSR take 0.65usec, right? If 
that the case, CSR accesses have improved radically the last years.

Håkon