[Beowulf] Clos networking and myrinet
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Patrick Geoffray patrick at myri.comWed Jun 8 12:41:52 PDT 2005
- Previous message: [Beowulf] Clos networking and myrinet
- Next message: [Beowulf] Clos networking and myrinet
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Joe, Mack.Joseph at epamail.epa.gov wrote: > I've just found out about Clos networks and realised that > myrinet uses Clos networks. From reading > > http://www.beowulf.org/archive/2002-June/007458.html > > I see that the diameter of the Clos network is log(n). >>From other reading I find that 3 layer Clos networks > are blocking but that 5 (and above) layers are non-blocking. What do you mean by blocking in this context ? The switching elements are full crossbars doing wormhole routing. The only case when an input port is blocked is when the requested output port is used by another packet at that time, and the FIFO on the input port is full. That is the classical wormhole contention, and it's independant of the dimension of the Clos. Contention may happen statistically less often on a larger Clos if the communication load is the same, because there are more redundant paths that the route dispersion mechanism can use. However, larger Clos usually means larger number of nodes, so the average contention would be the same. > The url above shows that Myrinet uses a 3 layer Clos > network for 24..128 nodes, but presumably the network > is blocking. In an mpi type beowulf running a scatter/gather, > if one node is blocked, then all nodes have to wait. > Is this OK or am I missing something? I don't understand the presumption of blocking. Blocking happens in the crossbar if there is contention for an output port. There is some buffering in the input port, but backpressure contention will transfer the blocking to the sender if the packet is large enough. That will block the sender as long as the packet that is using the output port at the head of the blocked worm is going through. If you don't have 2 packets that want to go through the same link at the same time, you don't have blocking. There are several ways of reducing the overall contention in a Clos network (or any multi-stage source-routed wormhole fabric): * load balance the routes over redundant paths: Clos networks provide multiple paths between 2 nodes. By spreading uniformingly the routes, you keep the routes/link ratio to a global minimum. * fragment messages: the unit of switching is a packet, so smaller packets will reduce the average blocking time. 2-4 KB is a sweet spot. * use route dispersion: the Clos topology provides a lot of redundant paths between all nodes. If you use multiple route per destination, you will spread traffic on more links, reducing the statistical contention. GM 2.1.x uses round-robin dispersive routing, it changes route after each packet. MX 1.0 uses adaptive dispersive routing, the sender changes route only after it sensed backpressure flow control (route dispersion creates disorder between packets, so it's better to not use it when you don't need it). > With large beowulfs, why is the diameter of the network > increased from 3 to 5 to 7...? It would be cheaper to > keep the networking at 3 layers. What falls apart > that you need to increase the diameter of the network? There is just a maximum number of nodes you can connect on a Clos of diameter n with N-port crossbars. In 2002, at the time of the referenced email, the Myrinet crossbar we shipped had 16 ports. So the scaling table was the following: Clos diameter: 1 3 5 7 Max nodes count: 16 128 1024 8192 We now use a 32-port crossbar in the 256 ports box, so the crossover points have shifted a bit: Dimension (Clos): 3 5 Max nodes count: 512 (a lot, ouch my head hurts) > Sorry if this is all a FAQ. I've been reading for about > 2 weeks now and don't understand Clos networking > yet. I have been using them for several years, and my head still hurts :-) I am far from being an expert of the graph side of the problem, but here is what it means on the practical side, at least, AFAIK: The Clos topology is a sub-class of fat tree. It is used to sove the classical problem of how to interconnect nodes beyond the size of a economically viable crossbar. The Clos topology was defined at Bell for doing phone switching (you don't want to have a NxN connection grid when N is the number of phone subscribers (copper loops) in a city. The Clos network have some particular properties: * you can compute a deadlock-free set of routes by using a up*down* algorithm (once you go down in the tree, you cannot go up anymore): that makes the mapper not NP-hard. * shortest path routes are deadlock-free (thus optimal): that makes routes selection trivial. * there is multiple redundant paths between 2 nodes: good for failover. * there is enough multiple redundant paths between 2 nodes so that there is always a set of routes that provide full-bisection (half of the nodes talk to the other half without blocking) for a given partition: to exploit this property, you need to use multiples routes, as the partition is expected to change for every communication pattern. The last bullet is the motivation behind route dispersion: - If you use one and only one route between 2 nodes, you may not have unused links because the mapper will load balance the usage of one link among several pair of nodes, but the chance that everybody will have the right set of routes for a specific communication pattern is small. - If you use multiple routes and you change the route used between 2 nodes after sending each packet (round-robin), you statistically increase your chances of having a good set of routes at one time for a given communication pattern (on irregular traffic, there is a lot of head-hurting litterature proving that the efficiency will be ~50% due to random head-of-line blocking). - If you use adaptive route dispersion, you will change route until you find one that is not blocking. If the communication pattern is short in time, you won't have time to stabilize and you end up falling back on the previous round-robin method. If the communication pattern is long enough, each node will find the non-blocking routes that the Clos topology guarantees exist. Hope it helps, even if it looks confusing. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com
- Previous message: [Beowulf] Clos networking and myrinet
- Next message: [Beowulf] Clos networking and myrinet
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
