Beowulf & FFT

Tue Jul 18 17:02:14 PDT 2000

Folks...

1) FFTW is fast, but not the fastest FFT for all platforms, it is
   quite a nice piece of work however.

   It adapts
   to a platform (such as CISC Intel architecture), by running a sequence
   of different sizes, to build a strategy for the sequential FFTs
   you use.  This makes it competitive to be sure, but not optimal.

   Like ATLAS, competitive is hard to beat unless you write assembler
   and use the same tricks it uses plus more, or else specialize the
   code to RISC or CISC, etc, the cache hierarchy, prefetching, and
   permutation options, etc, etc.

2) The MPI_Alltoall call is far from optimal.  For a 2D matrix transpose
   it is for sure easy to write a call that is more optimal.

   Here are some reasons why:

   a) Each MPI alltoall has to work with derived datatypes to capture the
   semantics of a matrix transpose; these are quite slow with some 
   implementations of MPI.

   b) The usual implementation of all-to-all, based on the semantics of 
   the call, is N-1 sends and N-1 receives at each node.  When you
   know you are actually doing a 2D transpose, you can do better, such
   as gather scatter algorithms, or sequences thereof.  [Greg Lindahl
   spoke of such tradeoffs at one meeting I was at.]

   c) Persistent point-to-point communication (if implemented on a fast
   implementation of MPI that makes them faster than regular point to
   point), provide a good basis for writing a corner-turn (2D tranpose)
   of your own, which you might vary in algorithm as a function of the
   size of the processor set and matrix size.  

Doing the corner turn yourself with your own calls will never be worse
than alltoall, given the state of the algorithms used (which in turn
derives from the difficulty of optimizing alltoall in general).

Surprisingly, people use alltoall as a way to measure the performance
of corner turns on some MPI implementations, even though you can write
a better, faster, still portable corner turn of your own, strictly
in terms of pt2pt calls.

Tony

Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Tue, 18 Jul 2000, Martin Siegert wrote:

> On Tue, 18 Jul 2000, Walter B. Ligon III wrote:
> 
> > > Since I'm in the process of expanding the beowulf, I'm wondering whether
> > > switching to 133 MHz would improve the results (given that I can find
> > > a motherboard that supports ECC - we had that discussion).
> > 
> > Generally, if the problem is that node-to-node communication is too slow,
> > upping your system bus speed isn't going to do much in the way of improving
> > your parallel system performance.
> 
> True. However, I'm not completely convinced that node-to-node communication
> is the only limiting factor: The efficiency for np=2 is miserable as well.
> And that should be shared-memory communication (at least in mpich-1.2.0).
> 
> > The first thing that strikes me is that a 400x400 operation is pretty small
> > for a Beowulf.  Its going to be hard to get good efficiency on a small
> > problem like that.
> 
> This is correct. The efficiency gets better, when I go to 1600x1600, but it's
> still misreable: speedup=1.25 for np=2. Furthermore, in my problem time scales
> with length to third or fourth power. Thus, I can't take advantage of the
> larger system sizes, if don't go to much, much later times.
> 
> > I am not familliar with the FFT library you have, but it may be that it was
> > not designed for a machine like a Beowulf, but rather for a finer grain
> > machine like some of the MPPs.  Thus the best approach may be a different
> > bit of software, or reworking that software.  All-to-All is probably NOT
> > the best implementation on an MPI running over TCP/IP.
> 
> The authors of FFTW claim that this is the fastest FFT you can get. I can only
> say that from all I know this statement is correct, i.e., all other FFTs 
> that I tried were slower. Actually I wasted a month by writing my own FFT
> only to find out that is was slower as well :-(
> Furthermore, I can't really see how the necessary matrix transpose can be
> done more efficiently without the Alltoall.
> 
> > Of course, you could always upgrade to a faster network if you have the $$$.
> 
> Yup. That would be nice.
> 
> Martin
> 
> ========================================================================
> Martin Siegert
> Academic Computing Services                        phone: (604) 291-4691
> Simon Fraser University                            fax:   (604) 291-4242
> Burnaby, British Columbia                          email: siegert at sfu.ca
> Canada  V5A 1S6
> ========================================================================
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
>