[Beowulf] How to configure a cluster network

Thu Jul 24 20:28:25 PDT 2008

Cool, FNN's are still being mentioned on the Beowulf mailing list...
For those not familiar with the Flat Neighborhood Network (FNN) idea,
check out this URL:  http://aggregate.org/FNN/

For those who haven't played with our FNN generator cgi script, do try
it out.  Hank (my Ph.D. advisor) enhanced the cgi awhile back to generate
pretty multi-color pictures of the resulting FNNs.

Unfortunately, for the particular input parameters from this thread of
six 24-port switches
and 50 nodes, each node would need a 3-port HCA (or 3 HCAs) and a 7th switch
to generate a Universal FNN.  FNNs don't really shine until you have 3 or 4
NICs/HCAs per compute node.

Anyway, you would get a LOT more bandwidth with an FNN in this case...
and of course, the "single-switch-latency" that is characteristic of FNNs.
Though, as others have mentioned, IB switch latency is pretty darn small,
so latency would not be the primary reason to use FNNs with IB.

I wonder if anyone has built a FNN using IB... or for that matter, any
link technology
other than Ethernet?

On Thu, Jul 24, 2008 at 5:00 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
>> Well the top configuration(and the one that I suggested) is the one
>> that we have tested and know works. We have implimented it into
>> hundereds of clusters. It also provides redundancy for the core
>> switches.
>
> just for reference, it's commonly known as "fat tree", and is indeed
> widely used.
>
>> With any network you need to avoid like the plauge any kind of loop,
>> they can cause weird problems and are pretty much unnessasary. for
>
> well, I don't think that's true - the most I'd say is that given
> the usual spanning-tree protocol for eth switches, loops are a bug.
> but IB doesn't use eth's STP, and even smarter eth networks can take
> good advantage of multiple paths, even loopy ones.
>
>> instance, why would you put a line between the two core switches? Why
>> would that line carry any traffic?
>
> indeed - those examples don't make much sense.  but there are many others
> that involve loops that could be quite nice.  consider 36 nodes: with
> 2x24pt, you get 3:1 blocking (6 inter-switch links).  with 3 switches, you
> can do 2:1 blocking (6 interlinks in a triangle, forming a loop.)
> dual-port nics provide even more entertainment (FNN, but also the ability to
> tolerate a leaf-switch failure...)
>
>> When you consider that it takes 2-4ìs for an mpi message to get from
>
> depends on the nic - mellanox claims ~1 us for connectx (haven't seen it
> myself yet.)  I see 4-4.5 us latency (worse than myri 2g mx!) on
> pre-connectx
> mellanox systems.
>
>> one node to another on the same switch, each extra hop will only
>> introduce another 0.02ìs (I think?) to that latency so its not really
>
> with current hardware, I think 100ns per hop is about right.  mellanox
> claims
> 60ns for the latest stuff.
>
>> Most applications dont use anything like the full bandwidth of the
>> interconnect so the half bisectionalness of everything can generally
>> be safeley ignored.
>
> everything is simple for single-purpose clusters.  for a shared cluster
> with a variety of job types, especially for large user populations, large
> jobs and large clusters, you want to think carefully about how much to
> compromise the fabric.  consider, for instance, interference between a
> bw-heavy weather code and some latency-sensitive application (big and/or
> tightly-coupled.)
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmattox at gmail.com || timattox at open-mpi.org
 I'm a bright... http://www.the-brights.net/