[Beowulf] 1.2 us IB latency?
kball at pathscale.com
Tue Apr 24 15:06:56 PDT 2007
On Tue, 2007-04-24 at 08:55, Ashley Pittman wrote:
> On Sat, 2007-04-21 at 13:16 +0200, Håkon Bugge wrote:
> > PIO is a term with an two different
> > interpretations. For a shared address space NIC,
> > such as Dolphin's SCI adapters, PIO implies a
> > sender CPU to write data directly into the user
> > space of a remote process on a remote node. The
> > cluster interconnect emulates a PCI to PCI bridge
> > in this case. On other NICs, PIO implies using
> > the processor to transmit the DMA description and
> > the data to the local NIC. Then the local NIC
> > issues a DMA to transmit the data/message to the
> > remote node from a local buffer on the NIC. The
> > main point is the local NIC doesn't have to issue
> > a DMA read to local memory in order to read the DMA descriptor and data.
> That would explain why qlogic use PIO for up to 64k messages and we
> switch to DMA at only a few hundred. For small messages you could best
> describe what we use as a hybrid of the above descriptions, we write the
> a network packet across the PCI bus and don't DMA at all.
> The downside to PIO of course is you need a CPU to drive it so besides
> the fact it's slow you can't make do anything asynchronously.
> > So, when Mellanox reduces the latency from around
> > 4 to around 1 usec, I assume they have modified
> > the hardware-software interface of their HCA to
> > enable PIO mode send operations, where DMA
> > descriptor+data is transmitted on the PCI(e) bus
> > using a single WC bus tenure. I haven't used a
> > PCI analyzer on their HCAs, but a thumb of rule
> > is that every I/O operation to a NIC takes in the
> > order of 1usec. So may be they have managed to go
> > from 3 to one I/O operation in order to kick off
> > a transfer. Pure speculation fro my side though.
> That's an interesting theory, but I suspect your numbers are a little
> out. My own measurements put a PIO word write in the region of .15 uSec
> depending on chipset. Of course if you are right then the remaining PIO
> write is happening in 1 uSec which leaves only .2uSec for the network
> which seems a little fast to me.
> Regardless of how they have done it 1.2 is impressive, what would make
> me even more impressed if it was quoted as 1.20 which would, as far as
> I'm aware, mean that they had the lowest latency of anybody.
This is true if the 1.2 number is quoted through a switch, but as I
understand it Mellanox quotes back-to-back numbers as their latency
numbers. I have measured QLogic HTX adapters within 50ns of 1.0 usec if
going back to back, but noone I'm aware of actually uses IB that way;
everyone wants to run in a cluster with more than 2 nodes using a
switch, so thats how we quote our latency.
Disclosure: in case its not clear from the above, I do work at QLogic,
but anyone with our HT cards can reproduce the above for themselves.
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf