I've not tried their respective MPI libraries, but as a general rule, the people who manufacture the chips have the best idea of how to optimize a given library.  (There are obvious counter-examples, gotoBLAS and fftw for example).

<br><br>That said, have you tried for Intel: <a href="http://www.intel.com/cd/software/products/asmo-na/eng/308295.htm">http://www.intel.com/cd/software/products/asmo-na/eng/308295.htm</a><br><br>or for AMD:  <a href="http://developer.amd.com/devtools.jsp">

http://developer.amd.com/devtools.jsp</a> (they link to HP's MPI)<br><br>As a side note, IBM uses a slightly modified version of MPICH for Blue Gene.<br><br>Nathan<br><br> <br><div class="gmail_quote">On Nov 28, 2007 9:48 AM, Christian Bell <

<a href="mailto:christian.bell@qlogic.com">christian.bell@qlogic.com</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">But the main point with MPI implementations, more than usual with

<br>shared memory, is to run your application.<br><br>For 2 different MPI shared-memory implementations that show equal<br>performance on point-to-point microbenchmarks, you can measure very<br>different performance in applications (mostly at the bandwidth-bound

<br>level).<br><br>Microbenchmarks assume senders and receivers are always synchronized<br>in time and report memory copy performance for memory copies that go<br>mostly through the cache.  Memory transfers that are mostly out of

<br>cache are rarely tuned for or even measured.<br><br>Microbenchmarks also never have the receivers actually consume the<br>data that's received or have senders re-reference the data sent for<br>computation.  The cost of these application-level memory accesses is

<br>greatly determined by where in the memory hierarchy the MPI<br>implementation left the data to be computed on.  And finally, a given<br>implementation will have very different performance characteristics<br>on Opteron versus Intel, few-core versus many-core and point-to-point

versus collectives. It's safe to assume that most if not all MPIs try to do something about shared memory but I wouldn't be surprised if each of them can top out on some performance curve on some specific system.

<br><br><br>    . . christian<br><div><div></div><div class="Wj3C7c"><br>On Wed, 28 Nov 2007, amjad ali wrote:<br><br>> Hello,<br>><br>> Because today the clusters with multicore nodes are quite common and the<br>

> cores within a node share memory.<br>><br>> Which Implementations of MPI (no matter commercial or free), make automatic<br>> and efficient use of shared memory for message passing within a node. (means<br>> which MPI librarries auomatically communicate over shared memory instead of

<br>> interconnect on the same node).<br>><br>> regards,<br>> Ali.<br><br></div></div><div class="Ih2E3d">> _______________________________________________<br>> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">

Beowulf@beowulf.org</a><br>> To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

<br><br></div><font color="#888888">--<br><a href="mailto:christian.bell@qlogic.com">christian.bell@qlogic.com</a><br>(QLogic Host Solutions Group, formerly Pathscale)<br></font><div><div></div><div class="Wj3C7c">_______________________________________________

<br>Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a><br>To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">

http://www.beowulf.org/mailman/listinfo/beowulf</a><br></div></div></blockquote></div><br><br clear="all"><br>-- <br>- - - - - - -   - - - - - - -   - - - - - - - <br>Nathan Moore<br>Assistant Professor, Physics<br>Winona State University

<br>AIM: nmoorewsu <br>- - - - - - -   - - - - - - -   - - - - - - -