[Beowulf] Re: Re: Home beowulf - NIC latencies
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Rossen Dimitrov rossen at VerariSoft.ComMon Feb 14 14:32:57 PST 2005
- Previous message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Next message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > > >>Let me ask some stupid's question: which MPI implementations allow >>really >> >>a) to overlap MPI_Isend w/computations >>and/or >>b) to perform a set of subsequent MPI_Isend calls faster than "the >>same" set of MPI_Send calls ? >> >>I say only about sending of large messages. > > > For large messages, everyone does (b) at least partly right. (a) is > pretty rare. It's difficult to get (a) right without hurting short > message performance. One of the commercial MPIs, at first release, had > very slow short message performance because they thought getting (a) > right was more important. They've improved their short message > performance since, but I still haven't seen any real application > benchmarks that show benefit from their approach. There is quite a bit of published data that for a number of real application codes modest increase of MPI latency for very short messages has no impact on the application performance. This can also be seen by doing traffic characterization, weighing the relative impact of the increased latency, and taking into account the computation/communication ratio. On the other hand, what you give the application developers with an interrupt-driven MPI library is a higher potential for effective overlapping, which they could chose to utilize or not, but unless they send only very short messages, they will not see a negative performance impact from using this library. There is evidence that re-coding the MPI part of an application to take advantage of overlapping and asynchrony when the MPI library (and network) supports these well actually leads to real performance benefit. There is evidence that even without changing anything in the code, but by just running the same code with an MPI library that plays nicer to the system leads to better application performance by improving the overall "application progress" - a loose term I used to describe all of the complex system activities that need to occur during the life-cycle of a parallel application not only on a single node, but on all nodes collectively. The question of short message latency is connected to system scalability in at least one important scenario - running the same problem size as fast as possible by adding more processors. This will lead to smaller messages, much more sensitive to overhead, thus negatively impacting scalability. In other practical scenarios though, users increase the problem size as the cluster size grows, or they solve multiple instances of the same problem concurrently, thus keeping the message sizes away from the extremely small sizes resulting from maximum scale runs, thus limiting the impact of shortest message latency. I have seen many large clusters whose only job run across all nodes is HPL for the top500 number. After that, the system is either controlled by a job scheduler, which limits the size of jobs to about 30% of all processors (an empirically derived number that supposedly improves the overall job throughput), or it is physically or logically divided into smaller sub-clusters. All this being said, there is obviously a large group of codes that use small messages no matter what size problem they solve or what the cluster size is. For these, the lowest latency will be the most important (if not the only) optimization parameter. For these cases, users can just run the MPI library in polling mode. With regard to the assessment that every MPI library does (a) partly right I'd like to mention that I have seen behavior where attempting to overlap computation and communication can lead to no performance improvement at all, or even worse, to performance degradation. This is one example of how a particular implementation of a standard API can affect the way users code against it. I use a metric called "degree of overlapping" which for "good" systems approaches 1, for "bad" systems approaches 0, and for terrible systems becomes negative... Here goodness is measured as how well the system facilitates overlapping. Rossen > > -- greg > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Next message: [Beowulf] Re: Re: Home beowulf - NIC latencies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
