[Beowulf] 1.2 us IB latency?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Håkon Bugge Hakon.Bugge at scali.comWed Apr 25 02:31:05 PDT 2007
- Previous message: [Beowulf] 1.2 us IB latency?
- Next message: [Beowulf] [CFP] EuroPVM/MPI'07 -- submission site now open!
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 17:55 24.04.2007, Ashley Pittman wrote: >That would explain why qlogic use PIO for up to 64k messages and we >switch to DMA at only a few hundred. For small messages you could best >describe what we use as a hybrid of the above descriptions, we write the >a network packet across the PCI bus and don't DMA at all. I assume QsNet has to do something with the packet after it has been written to the HCA. Since the outbound PCI address space is only 32-bits (who needs more than 4GigB of CSR, other than cluster people attempting to map all the accumulated memory of the nodes in the cluster into a single address space?), I assume QsNet uses part of the packet as 64-bit address information and starts a DMA from the HCA local buffer to the remove destination. >The downside to PIO of course is you need a CPU to drive it so besides >the fact it's slow you can't make do anything asynchronously. This is a classic tradeoff. Most applications _create_ the message before it is sent (contrary to many p2p benchmarks). Hence, it resides in the L1 or L2 cache of the CPU with a (MOESI) Modified state. It is the very efficient to use the CPU to read its local cache and write the message using the WC buffer. Contrary, the HCA has to issue a DMA read to memory, the CPU cache(s) is snooped, data is transferred to the memory _and_ to the HCA. The cache state ends up in Shared state, and a bus transaction is required in order to make it Modified again (when the buffer is written the next time). >That's an interesting theory, but I suspect your numbers are a little >out. My own measurements put a PIO word write in the region of .15 uSec >depending on chipset. Of course if you are right then the remaining PIO >write is happening in 1 uSec which leaves only .2uSec for the network >which seems a little fast to me. Just to make sure we compare the same thing; the .15usec is the time from the CPU issuing the store instruction until the side effect is visible in the HCA? In other words, assume a CSR word read takes 0.5usec, a loop writing and reading the same CSR take 0.65usec, right? If that the case, CSR accesses have improved radically the last years. Håkon
- Previous message: [Beowulf] 1.2 us IB latency?
- Next message: [Beowulf] [CFP] EuroPVM/MPI'07 -- submission site now open!
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
