[Beowulf] Really efficient MPIs??
apittman at concurrent-thinking.com
Thu Nov 29 04:25:41 PST 2007
On Wed, 2007-11-28 at 10:21 -0600, Nathan Moore wrote:
> I've not tried their respective MPI libraries, but as a general rule,
> the people who manufacture the chips have the best idea of how to
> optimize a given library. (There are obvious counter-examples,
> gotoBLAS and fftw for example).
> That said, have you tried for Intel:
> or for AMD: http://developer.amd.com/devtools.jsp (they link to HP's
I'd disagree with this, the metric for MPI which is normally considered
most important is network latency so you are often better of using the
MPI provided with your network, where the chip vendors have experience
and are able to provide benefit is in a fast memcpy() implementation
which is what will limit bandwidth for intra-node comms so you should
compile your MPI library with the chip-vendor approved compiler.
As for shared memory comms the problem of using shared memory within a
node is that there are explicitly two copies of the data required, one
copy-in and one copy-out, the MPI's with the best intra-node bandwidth
will be the ones which use a in-kernel userspace to userspace memcpy
(which is *very* well optimised) to half the amount of memory traffic
needed to perform the data move.
Finally, just to really throw you I've seen occasions where *not* using
intra-node optimisations at all is the right thing to do, communicating
within a node consumes CPU cycles and if your code is overlapping comms
and compute to such an extent that latency is not a large factor handing
the comms of to the nic to be handled asynchronously without CPU
intervention can improve performance.
More information about the Beowulf