[Beowulf] MPICH vs. OpenMPI

Jan Heichler jan.heichler at gmx.net
Fri Apr 25 05:04:38 PDT 2008

Hallo Håkon,

Freitag, 25. April 2008, meintest Du:

HB> Hi Jan,

HB> At Wed, 23 Apr 2008 20:37:06 +0200, Jan Heichler <jan.heichler at gmx.net> wrote:
>> >From what i saw OpenMPI has several advantages:

>>- better performance on MultiCore Systems 
>>because of good shared-memory-implementation

HB> A couple of months ago, I conducted a thorough 
HB> study on intra-node performance of different MPIs 
HB> on Intel Woodcrest and Clovertown systems. I 
HB> systematically tested pnt-to-pnt performance 
HB> between processes on a) the same die on the same 
HB> socket (sdss), b) different dies on same socket 
HB> (ddss) (not on Woodcrest of course) and c) 
HB> different dies on different sockets (ddds). I 
HB> also measured the message rate using all 4 / 8 
HB> cores on the node. The pnt-to-pnt benchmarks used 
HB> was ping-ping, ping-pong (Scali?s `bandwidth´ and osu_latency+osu_bandwidth).

HB> I evaluated Scali MPI Connect 5.5 (SMC), SMC 5.6, 
HB> HP MPI, MVAPICH 0.9.9, MVAPICH2 0.9.8, Open MPI 1.1.1.

HB> Of these, Open MPI was the slowest for all 
HB> benchmarks and all machines, upto 10 times slower than SMC 5.6.

You are not gonna share these benchmark results with us, right? Would be very interesting to see that!

HB> Now since Open MPI 1.1.1 is quite old, I just 
HB> redid the message rate measurement on an X5355 
HB> (Clovertown, 2.66GHz). On an 8-byte message size, 
HB> OpenMPI 1.2.2 achieves 5.5 million messages per 
HB> seconds, whereas SMC 5.6.2 reaches 16.9 million 
HB> messages per second (using all 8 cores on the node, i.e., 8 MPI processes).

HB> Comparing OpenMPI 1.2.2 with SMC 5.6.1 on 
HB> ping-ping latency (usec) on an 8-byte payload yields:

HB> mapping OpenMPI   SMC
HB> sdss       0.95  0.18
HB> ddss       1.18  0.12
HB> ddds       1.03  0.12

Impressive. But i never doubted that commercial MPIs are faster. 

HB> So, Jan, I would be very curios to see any documentation of your claim above!

I did a benchmark of a customer application on a 8 node DualSocket DualCore Opteron cluster - unfortunately i can't remember the name. 

I used OpenMPI 1.2 , mpich 1.2.7p1, mvapich 0.97-something and Intel MPI 3.0 IIRC.

I don't have the detailed data available but from my memory:

Latency was worst for mpich (just TCP/IP ;-) ), then IntelMPI, then OpenMPI and mvapich the fastest. 
On a single machine mpich was the worst, then mvapich and then OpenMPI - IntelMPI was the fastest. 

Difference between mvapich and OpenMPI was quite big - Intel just had a small advantage over OpenMPI. 

Since this was not low-level i don't know which communication pattern the Application used but it seemed to me that the shared memory configuration on OpenMPI and Intel MPI was far better than on the other two. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080425/0fbfd586/attachment.html>

More information about the Beowulf mailing list