The FNN (Flat Neighborhood Network) paradox
tmattox at ieee.org
Wed Feb 21 21:50:02 PST 2001
The FNN (Flat Neighborhood Network) cost paradox is just a matter of
looking in the right design space, and realizing that it is not just
a network for lowering costs. As several people have already
discussed this, I will try to add a few things I think were missed
in the discussion of cost/benefits of a FNN architecture.
Using a FNN does more than let me build a larger cluster
than the maximum size of an affordable switch. It also gives
a higher bisection bandwidth than a fat tree with the same
size & number of switches. And, remember, it is a "Flat" network,
so all nodes are just 1-switch hop distant from each other, giving
you lower latency too.
Some of this can be found in our paper from ALS2000 in the Extreme Linux
"KLAT2's Flat Neighborhood Network"
Basically, the extra costs are having more than one NIC-port pair per
node in your cluster, since you need more than one NIC per node to have
the full connectivity. However, the extra NIC-port pair(s) per node are
not just an added cost. You do get extra usable bandwidth. How much,
depends on your application's communication patterns, as well as on
if you are doing just "basic routing" or what we call "advanced routing".
So far, we have not yet had time to do the coding to implement
advanced routing on a FNN. These are discussed in the paper, but
essentially, with a FNN, between many pairs of nodes, there will be
more than one usable 1-hop communications path. Advanced routing
would take advantage of those extra paths, to get what might best
be described as "destination specific channel bonding", but without
the slaved MAC addresses.
Even without advanced routing on KLAT2, we have observed quite good
performance on real applications (CFD code), as well as benchmarks.
You can see one of our papers on this in the SC2000 proceedings here:
"High-Cost CFD on a Low-Cost Cluster"
For most clusters smaller than 64 nodes, a FNN is probably not a
great choice, due to it's infancy, and currently very limited
software support. Given time, both will improve. However, if you
are building a large cluster, a FNN may be able to do great things
for you even now. We hope, in the future, to have all our FNN
design tools in the public domain, and then even smaller clusters
may benefit... You can get a $54 eight port Compex switch from Buy.com.
Now, if only there were reliable NICs for less than $8 each :)
As one of the developers of the FNN architecture, I must apologize
to the beowulf community for not yet having our design and implementation
tools out and available for you to use at this time. It is one thing
to hack together a new machine and get it to work. It is another to
make the tools usable by others. Please be patient. Now
that I have seen a real discussion of FNNs on the list, I will
redouble my efforts to get the software tools releasable.
And here are the obligatory URLs:
http://aggregate.org/FNN/ The home page for FNNs
http://aggregate.org/KLAT2/ The first machine with a FNN
You may also wish to look at the Bunyip machine which has a similar network,
developed independently and almost at the exact same time as KLAT2.
P.S. - Anyone care to try and make a version of PVM or MPICH that does
not assume all nodes have just one universally accessible IP address each?
We looked into it, and it seemed to be a bear in there... I ran away :)
I was able to hack LAM MPI over a weekend to get it to work, so PVM and
MPICH are not even on a back-burner for us right now...
Tim Mattox - tmattox at ieee.org - http://home.earthlink.net/~timattox
http://aggregate.org/KAOS/ - http://advogato.org/person/tmattox/
More information about the Beowulf