[Beowulf] 1.2 us IB latency?

Tue Apr 24 08:55:40 PDT 2007

On Sat, 2007-04-21 at 13:16 +0200, Håkon Bugge wrote:
> PIO is a term with an two different 
> interpretations. For a shared address space NIC, 
> such as Dolphin's SCI adapters, PIO implies a 
> sender CPU to write data directly into the user 
> space of a remote process on a remote node. The 
> cluster interconnect emulates a PCI to PCI bridge 
> in this case. On other NICs, PIO implies using 
> the processor to transmit the DMA description and 
> the data to the local NIC. Then the local NIC 
> issues a DMA to transmit the data/message to the 
> remote node from a local buffer on the NIC. The 
> main point is the local NIC doesn't have to issue 
> a DMA read to local memory in order to read the DMA descriptor and data.

That would explain why qlogic use PIO for up to 64k messages and we
switch to DMA at only a few hundred.  For small messages you could best
describe what we use as a hybrid of the above descriptions, we write the
a network packet across the PCI bus and don't DMA at all.

The downside to PIO of course is you need a CPU to drive it so besides
the fact it's slow you can't make do anything asynchronously.

> So, when Mellanox reduces the latency from around 
> 4 to around 1 usec, I assume they have modified 
> the hardware-software interface of their HCA to 
> enable PIO mode send operations, where DMA 
> descriptor+data is transmitted on the PCI(e) bus 
> using a single WC bus tenure.  I haven't used a 
> PCI analyzer on their HCAs, but a thumb of rule 
> is that every I/O operation to a NIC takes in the 
> order of 1usec. So may be they have managed to go 
> from 3 to one I/O operation in order to kick off 
> a transfer. Pure speculation fro my side though.

That's an interesting theory, but I suspect your numbers are a little
out.  My own measurements put a PIO word write in the region of .15 uSec
depending on chipset.  Of course if you are right then the remaining PIO
write is happening in 1 uSec which leaves only .2uSec for the network
which seems a little fast to me.

Regardless of how they have done it 1.2 is impressive, what would make
me even more impressed if it was quoted as 1.20 which would, as far as
I'm aware, mean that they had the lowest latency of anybody.

Ashley,