[Beowulf] Infiniband modular switches

Don Holmgren djholm at fnal.gov
Fri Jun 13 12:05:22 PDT 2008

On Fri, 13 Jun 2008, Ramiro Alba Queipo wrote:

> On Thu, 2008-06-12 at 10:08 -0500, Don Holmgren wrote:
>> Ramiro -
>> You might want to also consider buying just a single 24-port switch for your 22
>> nodes, and then when you expand either replace with a larger switch, or build a
>> distributed switch fabric with a number of leaf switches connecting into a
>> central spine switch (or switches).  By the time you expand to the larger
>> cluster, switches based on the announced 36-port Mellanox crossbar silicon will
>> be available and perhaps per port prices will have dropped sufficiently to
>> justify the purchase delay and the disruption at the time of expansion.
> Could you explain me this solution? I did not know about it

As far as I know, all currently available commercial Infiniband switches are 
based on the Mellanox 24-port non-blocking silicon switch chip (InfiniScale 
III).  The 96, 144, and 288 port modular switches from the various companies use 
a number of these individual chips in a layered (3-hop) design that provides 
full bisection bandwidth.  One can also construct a full bisection bandwidth 
144-port (say) switch out of twelve 24-port switches: out of the total 288 
switch ports, 144 ports connect to nodes, and 144 ports connect to other switch 
ports.  The latency should be identical to that of a 144-port chassis, as 
both would use three hops (disregarding the negligible ~ nanosecond per foot of 
extra cable length delay when using 24-port switches).

Usually the per port cost for a large switch is less than the per port cost for
a bunch of 24-port switches.  When you don't need a full 144-port switch, you
can either buy the large chassis and only buy a limited number of blades, or
go with a set of 24-port switches.  For smaller networks a set of 24-port 
switches is cheaper.

The next generation switch silicon will be 36 ports (InfiniScale IV), rather 
than 24.  Obviously I can't predict for certain that the large switches to
be built out of this silicon will be cheaper than the current models, but it
is reasonable to guess that this will be the case.

>> If your applications can tolerate some oversubscription (less than a 1:1 ratio
>> of leaf-to-spine uplinks to leaf-to-node connections), a distributed switch
>> fabric (leaf and spine) has the advantage of shorter (and cheaper) cables
>> between the leaf switches and your nodes, and relatively fewer longer cables
>> from the leaves back to the spine, compared with a single central switch.
> What do you mean with a distributed switch fabric?
> What is the difference with a  modular solution?
> Thanks for your answer
> Regards

I think both of these questions are answered above.  But to be clear, by 
"distributed" I mean that instead of one large switch chassis one would use
a number of 24-port switches.  In this case it is very natural to put the
individual switches next to their nodes.  See, for example, the "A New Approach 
to Clustering - Distributed Federated Switches" white paper at the Mellanox
web site.  When the switches are next to the nodes, the cable plant can be a lot
easier to deal with.  Don't underestimate the pain of having 144 fairly hefty
Infiniband cables all terminating into a 10U chassis.

One additional item of note when using a distributed fabric: if your typical 
jobs use a small number of nodes, then it is quite possible to configure your 
batch scheduler so that the nodes belonging to an individual job all connect to 
the same leaf switch.  This means that your messages only have to go through one 
switch hop, so latency is reduced compared with going through three hops in a 
large modular switch chassis (although I seriously doubt that the quarter 
microsecond of latency difference here matters to many codes).  Perhaps of more 
significance, though, is that you can use oversubscription to lower the cost of 
your fabric.  Instead of connecting 12 ports of a leaf switch to nodes and using 
the other 12 ports as uplinks, you might get away with 18 nodes and 6 uplinks, 
or 20 nodes and 4 uplinks.  As core counts are increasing, this is becoming more 
and more viable for some applications.


More information about the Beowulf mailing list