[Beowulf] Re: Re: Home beowulf - NIC latencies
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Patrick Geoffray patrick at myri.comMon Feb 14 15:09:27 PST 2005
- Previous message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Next message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Greg, Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > > >>Let me ask some stupid's question: which MPI implementations allow >>really >> >>a) to overlap MPI_Isend w/computations >>and/or >>b) to perform a set of subsequent MPI_Isend calls faster than "the >>same" set of MPI_Send calls ? >> >>I say only about sending of large messages. > > > For large messages, everyone does (b) at least partly right. (a) is > pretty rare. It's difficult to get (a) right without hurting short > message performance. One of the commercial MPIs, at first release, had Many believe you just need RDMA support to overlap com and comp, but it's not enough. Zero-copy is needed because the copy is obviously a waste of host CPU (along with cache trashing), but the real problem is matching. Ron did a lot of work in Portals to offload the matching, because it is a big synchronization point: if you send a message and you need the CPU on the receive side to find the appropriate receive buffer, you cannot tell the user that it can have the CPU between the time he posts the MPI_Irecv() and the time he checks on it with MPI_Wait(). What will happen is the matching occurs in the MPI_Wait() and overlap goes to the toilettes. There are several ways to work around it: 1) You can have a thread on the receive side and wake it up with an interrupt. If you do that for all receives, then you add ~10 us in the critical path and the small message latency goes to the same place the overlap went before. This was what I believe the commercial MPI was doing at first. 2) If you can take decisions at the NIC level, you can receive small messages eagerly (with a copy) and fire an interrupt only for large messages (you want to steal some CPU cycles for matching). This is not bad, you steal (~5 us + cost of matching) worth of CPU cycles for large messages, that's not much for most people. 3) You can have the NIC doing the matching. Obviously the NIC is not as fast as the host CPU, so it's more expensive: you don't want to do that for small messages, it will hurt your latency. But you still has to do it for all messages to keep the matching order. One solution is to still receive small messages eagerly but match them in the shadow of the NIC->host DMA just to keep the list of posted receives consistent. For large messages, you match in the NIC in the critical path and you don't need the host CPU (assuming that the matched receive is in the small number that is kept on the NIC). It's still not obvious if 3) is worth it, it's much more complex to implement and 5us per large receive is not that big. And you can reduce that overhead with MSIs (on PCIe, only the Alpha Marvel provided MSI on PCI-X, AFAIK). There are more exotic work-arounds, like using 1) and polling at the same time, and hiding the interrupt overhead with some black magic on another processor. The one with the best potential would be to use HyperThreading on Intel chips to have a polling thread burning cycles continuously; it will run in-cache, won't use the FP unit or waste memory cycles. A perfect use for the otherwise useless HT feature. I wonder why nobody went that way... > right was more important. They've improved their short message > performance since, but I still haven't seen any real application > benchmarks that show benefit from their approach. That's the classical chicken-egg problem: Are people not trying to overlap in MPI because it is not implemented, or MPI implementations don't implement it because applications don't try to overlap ? I think it's the later, too complicated for most. Do you know the story/joke about the Physicist and unexpected messages ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com
- Previous message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Next message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
