Beowulf & FFT

Martin Siegert siegert at
Mon Jul 24 13:01:24 PDT 2000

On Mon, 24 Jul 2000, Robert G. Brown wrote:

> Sorry about the very late response, but I've been out of town.  
>   a) Have you looked into the various books on parallel algorithms that
> are on the market (and in a few cases on the web)?  I'd wager that this
> problem has been studied as FFT's seem like they'd be fairly important.
> These books also go over things like parallel matrix operations where
> the "best" algorithm can suddenly change to something quite nonintuitive
> depending on matrix scale and the various speeds.  Even optimizing only
> for the relative speed differential of L1/L2/Main memory, ATLAS has to
> work quite hard and changes algorithms and blocksizes altogether several
> times.

I've looked into everything I could get my hands on (which may not be enough).
Anyway, most of the literature on parallel FFTs (yes, at least in the past
FFTs were one of the most studied problems for parallelism) is on parallelizing
the 1d FFT. That is totally uniteresting from a beowulf perspective: the
theoretical speedup (if I remember correctly) is only proportional to log(np)
[np: # of processors], and that is almost certainly eaten up by the
communication overhead on a beowulf. But two and higher dimensional FFTs
are quite promissing (again theoretically): 2d FFTs can be split into two
loops of 1d FFTs (one over rows, one over columns). Both loops can be run
in parallel without any communication whatsoever [just choose the fastest
1d FFT routine you can find; I believe it is very hard to beat FFTW here].
The problem is that you have to do a matrix transpose between the two
loops. FFTW does this using MPI_Alltoall. As Tony pointed out this may not
be the best choice. So I've been busy writing a matrix transpose routine
myself - something that is trivial for serial programs is totally nontrivial 
when the matrix is distributed over several processors. If somebody can 
give me a hint or reference how this is done for best performance, I'll
be thankful forever :-) [right now I'm trying MPI_Type_vector to define
the datatype to be sent from one processor to the others and then 
MPI_Isend and MPI_Irecv].

>   b) Is there a way of reformulating the problem so that it is
> embarrassingly parallel?  That is, since you are scale limited anyway to
> smallish "runs" that might well fit on a single CPU UP system, is there
> any benefit to just running a bunch of these calculations independently
> in parallel?  I have similar tradeoffs in my Monte Carlo code -- I could
> probably do really large lattices by splitting up the lattices between
> systems and paying a surface-to-volume network IPC penalty, but critical
> slowing down and the brutal quadratic scaling of statistics (need four
> times as many samples to halve the error) kills me long before that
> becomes a viable alternative.  Instead I run lattices that fit easily
> onto single CPU's (which are also small enough that I have a chance of
> accumulating enough independent samples to get high precision results in
> my lifetime) and concentrate on accumulating decent statistics at these
> scales with independent parallel runs.  

Actually, I can: I have to average over random initial conditions. At least
in principle - I find that usually 4 samples are already enough, even a single
sample already gives you the growth exponent quite accurately. That's why
I hoped that I could run a single run in parallel - then I could go to
larger system sizes and later times. If it can't be done, I'll do exactly
as you said.

> >From your description it isn't clear that you can do this, but in "many"
> of the problems that will run a really long time one finds that there
> are obvious problem segmentations (exploring a parameter space in
> parallel, accumulating statistics in stochastic simulations in parallel,
> parallel processing independent images or datasets or initial
> conditions) that render them EP.  One hardly ever writes a program to
> just run one (very long) time with just one set of input values.  EP
> programs generally scale (in the parallel speedup sense) nearly
> perfectly on any hardware they'll fit on in the first place...

That's absolutely correct in my case as well (if I just average over initial
conditions). Unfortunately, the PDE doesn't have any parameters anymore -
I can scale everything away. This is different from the Monte-Carlo simulations
that I run to compare the PDE results with: Those are indeed embarrassingly

Thanks to everybody for the comments.


Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at
Canada  V5A 1S6

More information about the Beowulf mailing list