Performance Variations using MPI/Myrico
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Patrick Geoffray patrick at myri.comFri Apr 27 00:52:08 PDT 2001
- Previous message: programs on remote nodes finding files (SCYLD)
- Next message: Performance Variations using MPI/Myrico
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thomas Davis wrote: > > We are looking for sites that run Intel Linux/SMP(dual)/MPI/Myricom 2k, > and have experienced performance variations. IE, you've ran the NAS/FT > parallel benchmark, densely packed (using all CPU's on the nodes), and > noted that the runs come up different each time - and the difference is > not minor (as much as 80%). Hi Thomas, I think I know the machine you are thinking about :-) I have a lot of documentation (trace file, profiles, timings) for the NAS FT benchmark on this cluster. I would be happy to show some screenshot of the MPI trace but I am afraid it would be a too big file to send on the list. > 1) Did you figure out what caused the variation? The traces, the PGI profiles and the manual timings confirm that the variation was localized in the "fftz2" function, that is in the core of the FFT. This function is pure computation, no communication. Basically, each iteration of the NAS FT is composed of computation/Alltoall/computation/AllReduce. The trace file show very well that the computation phases start at the same time for all the processors (synchronization side effect of the Alltoall and the Allreduce), but do not finish at the same moment, some processes were faster than others. The amplitude of these variation was about 50% ! Running with P4/Ethernet shown exactely the same variation in "fftz2", but the overall variation was much smaller in percentage, because P4/Ethernet is 4 times slower, so the variation was smaller compare to the total time. > 2) What did you do to fix this problem? The Linux kernel from the RedHat Enterprise Edition didn't seem to be patched for the SSE support for the PIII. Applying the good patch corrected the performance variation. It improve greatly the STREAM problem that I submit to the list earlier today (BTW, thanks guys for the various replies). I have run 40 iterations of the NAS FT class B on 8 nodes using 2 processes per node (16 processors total) about 1 hour ago and : Time in seconds = 128.27 Time in seconds = 119.84 Time in seconds = 120.39 Time in seconds = 119.93 Time in seconds = 118.78 Time in seconds = 120.51 Time in seconds = 123.08 Time in seconds = 120.39 Time in seconds = 122.38 Time in seconds = 124.70 Time in seconds = 118.64 Time in seconds = 119.54 Time in seconds = 121.44 Time in seconds = 126.51 Time in seconds = 121.97 Time in seconds = 125.08 Time in seconds = 129.34 Time in seconds = 119.35 Time in seconds = 120.95 Time in seconds = 122.59 Time in seconds = 126.15 Time in seconds = 119.88 Time in seconds = 121.73 Time in seconds = 121.52 Time in seconds = 126.94 Time in seconds = 119.06 Time in seconds = 123.98 Time in seconds = 121.84 Time in seconds = 122.37 Time in seconds = 122.21 Time in seconds = 121.48 Time in seconds = 136.54 Time in seconds = 120.85 Time in seconds = 121.71 Time in seconds = 125.87 Time in seconds = 130.96 Time in seconds = 128.00 Time in seconds = 120.68 Time in seconds = 128.20 Time in seconds = 119.74 I let you judge (about 5 %, with a medianne at 122, looks not so bad for me :-) ). For the one at 136, I was sshing to one of the node to check the load, sorry. > 3) Any ideas on what could cause this much variation? I have some ideas, but nothing I would bet on. Mainly cache trashing : the memory copy operation is improved with SSE by using the prefecthing support, and this prefetch bypass the L2 cache. Without SSE, the L2 cache is happilly flushed as a processor is doing a copy. As the FFT code include a copy step, who knows... :-) Ahh, I am eager to see dual athlon on the market... If people on the list have ideas, they are very welcome ! Greg: your numbers for FT are on Alpha or x86 ? -- Patrick Geoffray --------------------------------------------------------------- | Myricom Inc | University of Tennessee - CS Dept | | 325 N Santa Anita Ave. | Suite 203, 1122 Volunteer Blvd. | | Arcadia, CA 91006 | Knoxville, TN 37996-3450 | | (626) 821-5555 | Tel/Fax : (865) 974-1950 | ---------------------------------------------------------------
- Previous message: programs on remote nodes finding files (SCYLD)
- Next message: Performance Variations using MPI/Myrico
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
