[Beowulf] MPI performance on clusters of SMP

Kozin, I (Igor) I.Kozin at dl.ac.uk
Thu Aug 26 09:22:19 PDT 2004


Nowadays clusters are typically built from SMP boxes.
Dual cpu nodes are common but quad and more available too.
Nevertheless I never saw that a parallel program runs quicker 
on N nodes x 2 cpus than on 2*N nodes x 1 cpu
even if local memory bandwidth requirements are very modest.
The appearance is such that shared memory communication always
comes at an extra cost rather than as an advantage although
both MPICH and LAM-MPI have support for shared memory.

Any comments? Is this MPICH/LAM or Linux issue?

At least in one case I observed a hint towards the OS.
I experimented running several instances of a small program on 
a 4-way Itanium2 Tiger box with 2.4 kernel. The program is 
basically a loop over an array which fits into L1 cache.
Up to 3 instances finish virtually simultaneously.
If 4 instances are launched then 3 finish first and the 4th later
the overall time being about 40% longer.

Igor




More information about the Beowulf mailing list