[Beowulf] Lowered latency with multi-rail IB?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Dow Hurst DPHURST DPHURST at uncg.eduThu Mar 26 22:46:46 PDT 2009
- Previous message: [Beowulf] Lowered latency with multi-rail IB?
- Next message: [Beowulf] Lowered latency with multi-rail IB?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
To: beowulf at beowulf.org From: Greg Lindahl <lindahl at pbm.com> Sent by: beowulf-bounces at beowulf.org Date: 03/27/2009 12:03AM Subject: Re: [Beowulf] Lowered latency with multi-rail IB? On Thu, Mar 26, 2009 at 11:32:23PM -0400, Dow Hurst DPHURST wrote: > We've got a couple of weeks max to finalize spec'ing a new cluster. Has > anyone knowledge of lowering latency for NAMD by implementing a > multi-rail IB solution using MVAPICH or Intel's MPI? Multi-rail is likely to increase latency. BTW, Intel MPI usually has higher latency than other MPI implementations. If you look around for benchmarks you'll find that QLogic InfiniPath does quite well on NAMD and friends, compared to that other brand of InfiniBand adaptor. For example, at http://www.ks.uiuc.edu/Research/namd/performance.html the lowest line == best performance is InfiniPath. Those results aren't the most recent, but I'd bet that the current generation of adaptors has the same situation. -- Greg (yeah, I used to work for QLogic.) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf I'm very familiar with that benchmark page. ;-) One motivation for designing a MPI layer to lower latency with multi-rail is when making use of accelerator cards or GPUs. There is so much more work being done that the interconnect quickly becomes the limiting factor. One Tesla GPU is equal to 12 cores for the current implementation of NAMD/CUDA so the scaling efficiency really suffers. I'd like to see how someone could scale efficiently beyond 16 IB connections with only two GPUs per IB connection when running NAMD/CUDA. Some codes are sped up far beyond 12x and reach 100x such as VMD's cionize utility. I don't think that particular code requires parallelization (not sure). However, as NAMD/CUDA is tuned, the efficiency on the GPU is increased, and new bottlenecks found and fixed from previously ignored sections of code, there will be even more than a 12x speedup. So, a solution to the interconnect bottleneck needs to be developed and I wondered if multi-rail would be the answer. Thanks so much for your thoughts! Best wishes, Dow -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090327/5ae7c169/attachment.html
- Previous message: [Beowulf] Lowered latency with multi-rail IB?
- Next message: [Beowulf] Lowered latency with multi-rail IB?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
