[Beowulf] Q: IB message rate & large core counts (per node)?

Wed Feb 24 05:00:27 PST 2010

Hi Greg,

On Feb 23, 2010, at 23:32 , Greg Lindahl wrote:

> A traditional MPI implementation uses N QPs x N processes, so the
> global number of QPs is N^2. InfiniPath's pm library for MPI uses a
> much smaller endpoint than a QP. Using a ton of QPs does slow down
> things (hurts scaling), and that's why SRQ 

!!:gs/SRQ/XRC/

SRQ reduces the number of receive queues used, and thereby reduces the footprint of the receive buffers. As such, SRQ does not change number of QPs used. Actually, by using bucketed receive queues (one receive queue per bucket size), you need more QPs (I believe Open MPI use 4 QPs per connection using SRQ).

XRC on the other hand, reduces the number of QPs per node from NxPPN to N.

I have seen impacts of this running only 8 processes per node. In that particular case, the application ran faster using IPoIB for only 128 processes. I assume XRC would have alleviated this effect, but I had no opportunity to evaluate XRC at the time. Hence, I would advice anyone performing benchmarking around this ssue to also include XRC and/or lazy connection establishment to see if the number of QPs in use affects performance.

Thanks, Håkon