[Beowulf] many cores and ib

Gilad Shainer Shainer at mellanox.com
Mon May 5 15:32:10 PDT 2008


 
> >> Since we have some users that need
> >> shared memory but also we want to build a normal cluster for mpi 
> >> apps, we think that this could be a solution. Let's say about
> >> 8 machines (96 processors) pus infiniband. Does it sound correct?
> >> I'm aware of the bottleneck that means having one ib interface for 
> >> the mpi cores, is there any possibility of bonding?
> 
> > Bonding (or multi-rail) does not make sense with "standard 
> IB" in PCIe
> > x8 since the PCIe connection limits the transfer rate of a single 
> > IB-Link already.
> 
> PCIe x8 Gen2 provides additional bandwidth as Gilad said.  On 
> Opteron systems that is not available yet (and won't be for 
> some time), so you may want to search for AMD-CPU or 
> Intel-CPU based boards that have PCIe
> x16 slots.
> 

One more useful info, is that there are couple of installation in Japan
where they use 4 "regular IB DDR" adapters in 4 PCIe x8 slots to provide
6GB/s (1500MB per slot) and they do bonding to have it as a single pipe.
If you plan to use Intel you can use PCIe Gen2 with IB QDR and get
3200MB per PCIe Gen2 slot.


> > My hint would be to go for Infinipath from QLogic or the 
> new ConnectX
> from Mellanox since message rate is probably your limiting 
> factor and those technologies have a huge advantage over 
> standard Infiniband SDR/DDR. 
> 
> I agree that message rate may be your limiting factor.
> Results with QLogic (aka InfiniPath) DDR adapters:
> 
> DDR                Peak MPI Bandwidth      Peak Message Rate
> Adapter                                   (no message coalescing**)
> QLE7280    PCIe x16         1950 MB/s         20-26* 
> Million/sec (8 ppn)
> QLE7240    PCIe x8          1500 MB/s         19    
> Million/sec  (8 ppn)
> 
> Test details:  All run on two nodes, each with 2x Intel Xeon 
> 5410 (Harpertown, quad-core, 2.33 GHz CPUs), 8 cores per 
> node, SLES 10.
> except,
> * 26 M messages/sec requires faster CPUs, 3 to 3.2 Ghz.
> 
> 8 ppn means 8 MPI processes per node.  The non-coalesced 
> message rate performance of these adapters scales pretty 
> linearly from 1 to 8 cores.
> That is not the case with all modern DDR adapters.
> 

As Tom wrote, the message rate depends on the number of CPUs. With the
benchmark Tom indicated below and the same CPU, you can get up to 42M
msg/sec with ConnectX.


> Benchmark = OSU Multiple Bandwidth, Message Rate benchmark, 
> osu_mbw_mr.c The above performace results can be had with 
> either MVAPICH 1.0 or QLogic MPI 2.2 (other MPIs are in the 
> same ballpark with these adapters). 
> 
> Note that MVAPICH 0.9.9 had meassage-coalescing on by 
> default, and MVAPICH 1.0 has it off by default.  There must 
> be a reason.

As far as I know, the reason for that was to have the user pick his
choice. As OSU mentioned, there are some applications when this helps
and some that it does not.  

Gilad.

> 
> Revisiting:
> >
> > Bonding (or multi-rail) does not make sense with "standard 
> IB" in PCIe
> > x8 since the PCIe connection limits the transfer rate of a single 
> > IB-Link already.
> 
> Some 4-socket motherboards have independent PCIe buses to x8 
> or x16 slots.  In this case, multi-rail does make sense.  You 
> can run the QLogic adapters as dual-rail without bonding.  On 
> MPI applications, half
> of the cores will use one adapter and half will use the 
> other.   Whether
> the more expensive dual-rail arrangement is necessary and/or 
> cost-effective would be very application-specific.
> 
> Regards,
> -Tom Elken
>  
> 
> 
> >
> 
> >
> 
> > Infinipath and ConnectX are available as DDR Infiniband and 
> provide a
> 
> bandwidth of more than 1800 MB/s 
> 
>  
> 
> Good suggestion.   
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org To change your 
> subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 




More information about the Beowulf mailing list