[Beowulf] Multidimensional FFTs
sdm900 at gmail.com
Tue Feb 28 17:15:08 PST 2006
I've tested fft's rather extensively and run other codes that require
a transpose. In my experience, a well tuned gig-e network is capable
of giving speed up, though not necessarily scaling that well. The
most important thing is that you have full bisection bandwidth.
Anything less will reduce your scaling. That is, if you use gig-e
you can't trunk switches, you will need to stay within a single
switch. Typically, I've seen a 16 cpu job on gig-e gig about a 10
times speedup. Of course, it is processor/memory/nic dependant.
I've also run fft's on Quadrics Elan 3/4, IBM hps, and SGI Numalink
4. Since these are considerably higher bandwidth network they
perform much better. On a 16cpu job I've seen around 14 times speed
up on these higher bandwidth networks.
As the size increases (say 256 cpu's) the networks that maintain full
bisection bandwidth scale the best. There are very few reasonably
prices gig-e switches that maintain full bisection bandwidth at 256
cpu's, while Quadrics and HPS do (though their starting price is
high, at the larger system sizes, they become a realistic
proposition). Numalink falls away a little due to the weird network
topology (dual plane quad bristle fat tree) which has drops in
network connectivity/cpu as the system gets larger.
If you want to go with gig-e a few things to be aware of:
*The nic matters (pro1000MT's give 10-15% better performance that
* Go with single cpu nodes - higher per cpu network bandwidth
* If you get dual core cpu's, treat it as a single core node (allow
the 2nd core to do all the tcp stuff)
I've played around with multiply connected nodes (nodes that have
dual ported nics) and the 2'nd nic doesn't give you much (10-15%) and
requires a fair bit of stuffing around to get it working well. I
think you would be better of running your global fs and other
services over 1 nic and your mpi traffic over the other. At least
this way, your fs and services shouldn't be stealing your bandwidth.
You may even try running mpi-gamma on the 2nd nic, which should give
you better bandwidth, hence better scaling (I haven't tried this).
If you want real measured numbers, drop me a personal email.
On 01/03/2006, at 2:26, Bill Rankin wrote:
> Hey gang,
> I know that in the past, multidimensional FFTs (in my case, 3D)
> have posed a big challenge in getting them running well on
> clusters, mainly in the areas of scalability. This is somewhat due
> to the need for an All2All communication step in the processing
> (although there seem to be some alternative approaches here).
> There is a research group here at Duke doing some application
> development and they are looking at implementing their codes in a
> cluster environment. The main problem is that 95% of their
> processing time is taken up by medium to large sized 3D FFTs
> (minimum 64 elements on an edge, 256k total elements).
> So I was wondering what the current "state of the art" is in
> clustered 3D FFTs? I've googled around a bit, but most off the
> results seem a little dated. If someone could point me to any
> recent papers or studies, I would be grateful.
> Some specifics that I am interested in would be a good comparison
> of different interconnects on overall performance, as this will
> have a significant impact on the design of their cluster.
Dr Stuart Midgley
sdm900 at gmail.com
More information about the Beowulf