[Beowulf] Infiniband modular switches
patrick at myri.com
Thu Jun 26 21:49:10 PDT 2008
Gilad Shainer wrote:
> Not only that I was there, but also had conversations afterwards. It is
> a really "fair" comparison when you have different injection
> rate/network capacity parameters. You can also take 10Mb and inject it
> into 10Gb/s network to show the same, and you always can create the
> network pattern to show what you want to show, but you prove nothing
The injection rate is irrelevant in these tests and the network pattern
is well defined: *random* pairwise exchange. In both cases (IB and
Quadrics in the slides), the fabric is full bisection, ie there are
enough links in the network to support the aggregate traffic of all
ports. The test consists in measuring the MPI bandwidth between random
pair of nodes simultaneously.
Logically, you would expect to reach the full bandwidth between all
pairs, because there are enough links in the fabric to support this
traffic. If you measure each pair independently, you will always get the
link rate, no problem. However, if you measure them simultaneously, you
will have contention: a few pairs may still reach full bandwidth but
most will only get a fraction of it. You can measure the min, max and
average of the bandwidth between these pairs for a large number of
different pairs to evaluate the efficiency of the routing.
The link bandwidth (injection rate) is irrelevant because the results
are normalized (efficiency). What the slides show is that the efficiency
of Quadrics is better (the average bandwidth is higher despite a lower
link bandwidth) and the bandwidth distribution is very narrow for
Quadrics (spread between min and max pairwise bandwidth). This is a
direct result of adaptive routing in Quadrics vs static routing in IB.
Woven Systems reported similar results at Sandia using adaptive routing
in Ethernet vs static routing in IB.
With static routing, you can find *one* set of routes that will provide
full bandwidth between all pairs for a given set of pairs. If you change
the set of pairs without changing the set of routes, then you will get
much less than full bandwidth. In average, if you measure with enough
random set of pairs, you will get an aggregate efficiency of ~40% with
static routing, on several interconnects using full bisection topologies
(Clos or Fat Tree), single virtual channel, wormhole switching and
static routing. It has nothing to do with link rate, it is due to
Head-of-Line (HOL) blocking:
> here. I am not favor of static routing only or adaptive routing only,
> and having both options is the most flexible solution.
It's not as simple as that. If you have a cluster that will run multiple
jobs, most likely at the same time, which routing do you use ? If you
use static routing, efficiency may be good for one job, and bad for
another. Worse, the efficiency will change if I run the same job on
different nodes, or depending on what other job is running at the same
time on the cluster. If you use adaptive routing, efficiency will most
likely be higher (maybe not by much) but, more important, it will be
more deterministic. Determinism means less load unbalance, predictable
time to completion, higher job throughput.
So far, IB only used static routing. If it still relies on packet order
on the wire for a given Queue Pair, then the only way to do some sort of
adaptive routing is to use a different QP for each possible route (LID).
This is what Panda's group tried in a paper. However, the number of QP
explodes, each QP is still subject to HOL blocking and the QP
interleaving is static.
>> You can see that the worst case static routing goes quickly
>> below 40%, but the average eventually goes there as well.
> So what is your proof point here? I am sure you will find many cases
> that static routing will do better (definitely on other interconnects)
> and cases for adaptive routing.
No, static routing is static routing, on all interconnects. There is no
magic here, HOL blocking applies to everybody. My point is that under
*random* structured patterns (such as pairwise exchange), static routing
sucks. There are no other cases of random, it's just random.
If you want to argue that structured traffic patterns across multiple
jobs running simultaneously on the same fabric are not equivalent to
random structured traffic, then this will go nowhere.
More information about the Beowulf