[Beowulf] Clos networking and myrinet

Wed Jun 8 12:41:52 PDT 2005

Hi Joe,

Mack.Joseph at epamail.epa.gov wrote:
> I've just found out about Clos networks and realised that
> myrinet uses Clos networks. From reading
> 
> http://www.beowulf.org/archive/2002-June/007458.html
> 
> I see that the diameter of the Clos network is log(n).
>>From other reading I find that 3 layer Clos networks
> are blocking but that 5 (and above) layers are non-blocking.

What do you mean by blocking in this context ? The switching elements
are full crossbars doing wormhole routing. The only case when an input
port is blocked is when the requested output port is used by another
packet at that time, and the FIFO on the input port is full. That is the
classical wormhole contention, and it's independant of the dimension of
the Clos. Contention may happen statistically less often on a larger
Clos if the communication load is the same, because there are more
redundant paths that the route dispersion mechanism can use. However,
larger Clos usually means larger number of nodes, so the average
contention would be the same.

> The url above shows that Myrinet uses a 3 layer Clos
> network for 24..128 nodes, but presumably the network
> is blocking. In an mpi type beowulf running a scatter/gather,
> if one node is blocked, then all nodes have to wait.
> Is this OK or am I missing something?

I don't understand the presumption of blocking. Blocking happens in the
crossbar if there is contention for an output port. There is some
buffering in the input port, but backpressure contention will transfer
the blocking to the sender if the packet is large enough. That will
block the sender as long as the packet that is using the output port at
the head of the blocked worm is going through. If you don't have 2
packets that want to go through the same link at the same time, you
don't have blocking.

There are several ways of reducing the overall contention in a Clos
network (or any multi-stage source-routed wormhole fabric):
* load balance the routes over redundant paths: Clos networks provide
multiple paths between 2 nodes. By spreading uniformingly the routes,
you keep the routes/link ratio to a global minimum.
* fragment messages: the unit of switching is a packet, so smaller
packets will reduce the average blocking time. 2-4 KB is a sweet spot.
* use route dispersion: the Clos topology provides a lot of redundant 
paths between all nodes. If you use multiple route per destination, you 
will spread traffic on more links, reducing the statistical contention. 
GM 2.1.x uses round-robin dispersive routing, it changes route after 
each packet. MX 1.0 uses adaptive dispersive routing, the sender changes 
route only after it sensed backpressure flow control (route dispersion 
creates disorder between packets, so it's better to not use it when you 
don't need it).

> With large beowulfs, why is the diameter of the network
> increased from 3 to 5 to 7...? It would be cheaper to
> keep the networking at 3 layers. What falls apart
> that you need to increase the diameter of the network?

There is just a maximum number of nodes you can connect on a Clos of 
diameter n with N-port crossbars.

In 2002, at the time of the referenced email, the Myrinet crossbar we 
shipped had 16 ports. So the scaling table was the following:
Clos diameter:    1  3   5    7
Max nodes count: 16 128 1024 8192

We now use a 32-port crossbar in the 256 ports box, so the crossover 
points have shifted a bit:
Dimension (Clos): 3    5
Max nodes count: 512 (a lot, ouch my head hurts)

> Sorry if this is all a FAQ. I've been reading for about
> 2 weeks now and don't understand Clos networking
> yet.

I have been using them for several years, and my head still hurts :-)
I am far from being an expert of the graph side of the problem, but here 
is what it means on the practical side, at least, AFAIK: The Clos 
topology is a sub-class of fat tree. It is used to sove the classical 
problem of how to interconnect nodes beyond the size of a economically 
viable crossbar. The Clos topology was defined at Bell for doing phone 
switching (you don't want to have a NxN connection grid when N is the 
number of phone subscribers (copper loops) in a city. The Clos network 
have some particular properties:
* you can compute a deadlock-free set of routes by using a up*down* 
algorithm (once you go down in the tree, you cannot go up anymore): that 
makes the mapper not NP-hard.
* shortest path routes are deadlock-free (thus optimal): that makes 
routes selection trivial.
* there is multiple redundant paths between 2 nodes: good for failover.
* there is enough multiple redundant paths between 2 nodes so that there 
is always a set of routes that provide full-bisection (half of the nodes 
talk to the other half without blocking) for a given partition: to 
exploit this property, you need to use multiples routes, as the 
partition is expected to change for every communication pattern.

The last bullet is the motivation behind route dispersion:
- If you use one and only one route between 2 nodes, you may not have 
unused links because the mapper will load balance the usage of one link 
among several pair of nodes, but the chance that everybody will have the 
right set of routes for a specific communication pattern is small.
- If you use multiple routes and you change the route used between 2 
nodes after sending each packet (round-robin), you statistically 
increase your chances of having a good set of routes at one time for a 
given communication pattern (on irregular traffic, there is a lot of 
head-hurting litterature proving that the efficiency will be ~50% due to 
random head-of-line blocking).
- If you use adaptive route dispersion, you will change route until you 
find one that is not blocking. If the communication pattern is short in 
time, you won't have time to stabilize and you end up falling back on
the previous round-robin method. If the communication pattern is long 
enough, each node will find the non-blocking routes that the Clos 
topology guarantees exist.

Hope it helps, even if it looks confusing.

Patrick
-- 

Patrick Geoffray
Myricom, Inc.
http://www.myri.com