[Beowulf] How to configure a cluster network
tmattox at gmail.com
Thu Jul 24 20:28:25 PDT 2008
Cool, FNN's are still being mentioned on the Beowulf mailing list...
For those not familiar with the Flat Neighborhood Network (FNN) idea,
check out this URL: http://aggregate.org/FNN/
For those who haven't played with our FNN generator cgi script, do try
it out. Hank (my Ph.D. advisor) enhanced the cgi awhile back to generate
pretty multi-color pictures of the resulting FNNs.
Unfortunately, for the particular input parameters from this thread of
six 24-port switches
and 50 nodes, each node would need a 3-port HCA (or 3 HCAs) and a 7th switch
to generate a Universal FNN. FNNs don't really shine until you have 3 or 4
NICs/HCAs per compute node.
Anyway, you would get a LOT more bandwidth with an FNN in this case...
and of course, the "single-switch-latency" that is characteristic of FNNs.
Though, as others have mentioned, IB switch latency is pretty darn small,
so latency would not be the primary reason to use FNNs with IB.
I wonder if anyone has built a FNN using IB... or for that matter, any
other than Ethernet?
On Thu, Jul 24, 2008 at 5:00 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
>> Well the top configuration(and the one that I suggested) is the one
>> that we have tested and know works. We have implimented it into
>> hundereds of clusters. It also provides redundancy for the core
> just for reference, it's commonly known as "fat tree", and is indeed
> widely used.
>> With any network you need to avoid like the plauge any kind of loop,
>> they can cause weird problems and are pretty much unnessasary. for
> well, I don't think that's true - the most I'd say is that given
> the usual spanning-tree protocol for eth switches, loops are a bug.
> but IB doesn't use eth's STP, and even smarter eth networks can take
> good advantage of multiple paths, even loopy ones.
>> instance, why would you put a line between the two core switches? Why
>> would that line carry any traffic?
> indeed - those examples don't make much sense. but there are many others
> that involve loops that could be quite nice. consider 36 nodes: with
> 2x24pt, you get 3:1 blocking (6 inter-switch links). with 3 switches, you
> can do 2:1 blocking (6 interlinks in a triangle, forming a loop.)
> dual-port nics provide even more entertainment (FNN, but also the ability to
> tolerate a leaf-switch failure...)
>> When you consider that it takes 2-4ìs for an mpi message to get from
> depends on the nic - mellanox claims ~1 us for connectx (haven't seen it
> myself yet.) I see 4-4.5 us latency (worse than myri 2g mx!) on
> mellanox systems.
>> one node to another on the same switch, each extra hop will only
>> introduce another 0.02ìs (I think?) to that latency so its not really
> with current hardware, I think 100ns per hop is about right. mellanox
> 60ns for the latest stuff.
>> Most applications dont use anything like the full bandwidth of the
>> interconnect so the half bisectionalness of everything can generally
>> be safeley ignored.
> everything is simple for single-purpose clusters. for a shared cluster
> with a variety of job types, especially for large user populations, large
> jobs and large clusters, you want to think carefully about how much to
> compromise the fabric. consider, for instance, interference between a
> bw-heavy weather code and some latency-sensitive application (big and/or
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmattox at gmail.com || timattox at open-mpi.org
I'm a bright... http://www.the-brights.net/
More information about the Beowulf