FFT & Beowulf (summary)
siegert at sfu.ca
Thu Oct 19 20:20:18 PDT 2000
I concluded my "experiments" with FFTs on a cluster with switched fast
ethernet. The full results can be found at
I have tested 3 different algorithms for 2d FFTs with 3 different MPI
distributions (mpich-1.2.1, mpipro-1.5b7-3t, lam-6.3.2).
The algorithms are as follows:
(1) fftwnd_mpi from the fftw library. This routine does
a) column transforms
b) 2d matrix transpose using MPI_Alltoall
c) row transforms
(2) as (1), but in step b) use MPI_Irecv and MPI_Isend between all processor
(3) a) within a loop over all columns do
i) 1d FFT
ii) send/receive the just obtained data to/from the other processors
b) 2d matrix transpose (doesn't require communication anymore since all
data are already in place)
c) row transforms (as above)
For the 1d FFTs the fftw routine from the fftw library is used in all cases.
Whereas (1) and (2) are fairly similar, algorithm (3) is different:
The message size in (1) and (2) is (Lx/np)*(Ly/np) complex numbers
[Lx*Ly system size, np # of processors]. For (3) the message size is Ly/np,
i.e., much smaller. However, in (1) and (2) computation and communication
are separated: steps a) and c) are all computation, whereas step b) is all
communication. In (3a) computation and communication is done in parallel.
Furthermore, persistent sends/receives can be used.
So what is best?
It depends :-)
Obviously, (3) is only good with a MPI distribution that has very low
latencies. The lam distribution has the smallest latencies of the three
distributions tested and the combination (3)+lam gives the fastest FFTs
for large system sizes and/or small # of processors.
It would be interesting to see how the results change for, e.g., 3x100BaseT
Also, does the mpich distro that comes with Beowulf2 have smaller latencies?
Academic Computing Services phone: (604) 291-4691
Simon Fraser University fax: (604) 291-4242
Burnaby, British Columbia email: siegert at sfu.ca
Canada V5A 1S6
More information about the Beowulf