FFT & Beowulf (summary)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Martin Siegert siegert at sfu.caThu Oct 19 20:20:18 PDT 2000
- Previous message: [ANNOUNCE] SGI Performance Co-Pilot 2.1.10 now available
- Next message: question about Bonding tools
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello, I concluded my "experiments" with FFTs on a cluster with switched fast ethernet. The full results can be found at http://www.sfu.ca/~siegert/fft-timings.html Short summary: I have tested 3 different algorithms for 2d FFTs with 3 different MPI distributions (mpich-1.2.1, mpipro-1.5b7-3t, lam-6.3.2). The algorithms are as follows: (1) fftwnd_mpi from the fftw library. This routine does a) column transforms b) 2d matrix transpose using MPI_Alltoall c) row transforms (2) as (1), but in step b) use MPI_Irecv and MPI_Isend between all processor pairs. (3) a) within a loop over all columns do i) 1d FFT ii) send/receive the just obtained data to/from the other processors b) 2d matrix transpose (doesn't require communication anymore since all data are already in place) c) row transforms (as above) For the 1d FFTs the fftw routine from the fftw library is used in all cases. Whereas (1) and (2) are fairly similar, algorithm (3) is different: The message size in (1) and (2) is (Lx/np)*(Ly/np) complex numbers [Lx*Ly system size, np # of processors]. For (3) the message size is Ly/np, i.e., much smaller. However, in (1) and (2) computation and communication are separated: steps a) and c) are all computation, whereas step b) is all communication. In (3a) computation and communication is done in parallel. Furthermore, persistent sends/receives can be used. So what is best? It depends :-) Obviously, (3) is only good with a MPI distribution that has very low latencies. The lam distribution has the smallest latencies of the three distributions tested and the combination (3)+lam gives the fastest FFTs for large system sizes and/or small # of processors. It would be interesting to see how the results change for, e.g., 3x100BaseT or better. Also, does the mpich distro that comes with Beowulf2 have smaller latencies? Comments? Cheers, Martin ======================================================================== Martin Siegert Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 ========================================================================
- Previous message: [ANNOUNCE] SGI Performance Co-Pilot 2.1.10 now available
- Next message: question about Bonding tools
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
