[Beowulf] Lowered latency with multi-rail IB?

Thu Mar 26 23:35:47 PDT 2009

http://www.penguincomputing.com/cluster_computing

Can the above be of any help to you ?

Regards
Prajeev

On Fri, Mar 27, 2009 at 11:16 AM, Dow Hurst DPHURST <DPHURST at uncg.edu>wrote:

> To: beowulf at beowulf.org
> From: Greg Lindahl <lindahl at pbm.com>
> Sent by: beowulf-bounces at beowulf.org
> Date: 03/27/2009 12:03AM
> Subject: Re: [Beowulf] Lowered latency with multi-rail IB?
>
> On Thu, Mar 26, 2009 at 11:32:23PM -0400, Dow Hurst DPHURST wrote:
>
> > We've got a couple of weeks max to finalize spec'ing a new cluster.  Has
> > anyone knowledge of lowering latency for NAMD by implementing a
> > multi-rail IB solution using MVAPICH or Intel's MPI?
>
> Multi-rail is likely to increase latency.
>
> BTW, Intel MPI usually has higher latency than other MPI
> implementations.
>
> If you look around for benchmarks you'll find that QLogic InfiniPath
> does quite well on NAMD and friends, compared to that other brand of
> InfiniBand adaptor. For example, at
>
> http://www.ks.uiuc.edu/Research/namd/performance.html
>
> the lowest line == best performance is InfiniPath. Those results
> aren't the most recent, but I'd bet that the current generation of
> adaptors has the same situation.
>
> -- Greg
> (yeah, I used to work for QLogic.)
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> I'm very familiar with that benchmark page.  ;-)
>
> One motivation for designing a MPI layer to lower latency with multi-rail
> is when making use of accelerator cards or GPUs.  There is so much more work
> being done that the interconnect quickly becomes the limiting factor.  One
> Tesla GPU is equal to 12 cores for the current implementation of NAMD/CUDA
> so the scaling efficiency really suffers.  I'd like to see how someone could
> scale efficiently beyond 16 IB connections with only two GPUs per IB
> connection when running NAMD/CUDA.
>
> Some codes are sped up far beyond 12x and reach 100x such as VMD's cionize
> utility.  I don't think that particular code requires parallelization (not
> sure).  However, as NAMD/CUDA is tuned, the efficiency on the GPU is
> increased, and new bottlenecks found and fixed from previously ignored
> sections of code, there will be even more than a 12x speedup.  So, a
> solution to the interconnect bottleneck needs to be developed and I wondered
> if multi-rail would be the answer.  Thanks so much for your thoughts!
> Best wishes,
> Dow
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090327/c1a93109/attachment.html>