Performance Variations using MPI/Myrico

Patrick Geoffray patrick at myri.com
Fri Apr 27 00:52:08 PDT 2001


Thomas Davis wrote:
> 
> We are looking for sites that run Intel Linux/SMP(dual)/MPI/Myricom 2k,
> and have experienced performance variations.  IE, you've ran the NAS/FT
> parallel benchmark, densely packed (using all CPU's on the nodes), and
> noted that the runs come up different each time - and the difference is
> not minor (as much as 80%).

Hi Thomas,

I think I know the machine you are thinking about :-)
I have a lot of documentation (trace file, profiles, timings) for the NAS
FT benchmark on this cluster. I would be happy to show some screenshot of
the MPI trace but I am afraid it would be a too big file to send on the
list.

> 1) Did you figure out what caused the variation?

The traces, the PGI profiles and the manual timings confirm that the
variation was localized in the "fftz2" function, that is in the core of
the FFT. This function is pure computation, no communication.
Basically, each iteration of the NAS FT is composed of
computation/Alltoall/computation/AllReduce. The trace file show very well
that the computation phases start at the same time for all the processors
(synchronization side effect of the Alltoall and the Allreduce), but do
not finish at the same moment, some processes were faster than others. The
amplitude of these variation was about 50% !
 
Running with P4/Ethernet shown exactely the same variation in "fftz2", but
the overall variation was much smaller in percentage, because P4/Ethernet
is 4 times slower, so the variation was smaller compare to the total time.

> 2) What did you do to fix this problem?

The Linux kernel from the RedHat Enterprise Edition didn't seem to be
patched for the SSE support for the PIII. Applying the good patch
corrected the performance variation. It improve greatly the STREAM problem
that I submit to the list earlier today (BTW, thanks guys for the various
replies). I have run 40 iterations of the NAS FT class B on 8 nodes using
2 processes per node (16 processors total) about 1 hour ago and :

 Time in seconds =                   128.27
 Time in seconds =                   119.84
 Time in seconds =                   120.39
 Time in seconds =                   119.93
 Time in seconds =                   118.78
 Time in seconds =                   120.51
 Time in seconds =                   123.08
 Time in seconds =                   120.39
 Time in seconds =                   122.38
 Time in seconds =                   124.70
 Time in seconds =                   118.64
 Time in seconds =                   119.54
 Time in seconds =                   121.44
 Time in seconds =                   126.51
 Time in seconds =                   121.97
 Time in seconds =                   125.08
 Time in seconds =                   129.34
 Time in seconds =                   119.35
 Time in seconds =                   120.95
 Time in seconds =                   122.59
 Time in seconds =                   126.15
 Time in seconds =                   119.88
 Time in seconds =                   121.73
 Time in seconds =                   121.52
 Time in seconds =                   126.94
 Time in seconds =                   119.06
 Time in seconds =                   123.98
 Time in seconds =                   121.84
 Time in seconds =                   122.37
 Time in seconds =                   122.21
 Time in seconds =                   121.48
 Time in seconds =                   136.54
 Time in seconds =                   120.85
 Time in seconds =                   121.71
 Time in seconds =                   125.87
 Time in seconds =                   130.96
 Time in seconds =                   128.00
 Time in seconds =                   120.68
 Time in seconds =                   128.20
 Time in seconds =                   119.74

I let you judge (about 5 %, with a medianne at 122, looks not so bad for
me :-) ). For the one at 136, I was sshing to one of the node to check the
load, sorry.
 
> 3) Any ideas on what could cause this much variation?

I have some ideas, but nothing I would bet on. Mainly cache trashing : the
memory copy operation is improved with SSE by using the prefecthing
support, and this prefetch bypass the L2 cache. Without SSE, the L2 cache
is happilly flushed as a processor is doing a copy. As the FFT code
include a copy step, who knows... :-)

Ahh, I am eager to see dual athlon on the market...

If people on the list have ideas, they are very welcome !

Greg: your numbers for FT are on Alpha or x86 ?

-- 
Patrick Geoffray

---------------------------------------------------------------
|      Myricom Inc       |  University of Tennessee - CS Dept |
| 325 N Santa Anita Ave. |   Suite 203, 1122 Volunteer Blvd.  |
|   Arcadia, CA 91006    |      Knoxville, TN 37996-3450      |
|     (626) 821-5555     |      Tel/Fax : (865) 974-1950      |
---------------------------------------------------------------





More information about the Beowulf mailing list