[Beowulf] Infiniband modular switches

Christian Bell christian.bell at qlogic.com
Mon Jun 16 10:32:17 PDT 2008

On Sun, 15 Jun 2008, Gilad Shainer wrote:

> Static routing is the best approach if your pattern is known. In other
> cases it depends on the applications. LANL and Mellanox have presented a
> paper on static routing and how to get the maximum of it last ISC. There
> are cases where adaptive routing will show a benefit, and this is why we
> see the IB vendors add adaptive routing support as well. But in general,
> the average effective bandwidth is much much higher than the 40% you
> claim.  

As Mark Hahn pointed out, how are so-called "known patterns"
representative of any real system?

Even a single application with a "known pattern" doesn't translate
as-is in practice, even less so on a capacity system.  While the
"shift all-to-all pattern" referred to in the paper you cite is
interesting (on paper) in that it stresses the entire connectivity of
a FBB fabric, it remains a simulation carried out in isolation.  
Sticking to simulation, I find that looking at the observed switch
latency at the egress ports as a function of switch loading using
random communication patterns to be a more interesting data point.

However, it can be even more revealing to try to scale an expensive
communication operation on a real system, only to notice that this is
where the paper FBB breaks down.  40% looks like a large number but
it's not uncommon to see application writers report large speedups by
breaking down bandwidth-bound problems into latency-sensitive ones.
This looks counter-intuitive because the paper FBB can be available
on the fabric, but suggests that systems don't always deliver FBB
*even if* the pattern is known and the fabric is otherwise idle.
Given a capacity system, there's huge potential in looking at
alternative routing methods -- all of which is orthogonal to system
size and the ability to describe and understand one's communication

> There are some vendors that uses only the 24 port switches to build very
> large scale clusters - 3000 nodes and above, without any
> oversubscription, and they find it more cost effective. Using single
> enclosures is easier, but the cables are not expensive and you can use
> the smaller components. I used the 24 ports switches to connect my 96
> node cluster. I will replace my current setup with the new 36 InfiniBand
> port switches this month, since they provide lower latency and adaptive
> routing capabilities. And if you are bandwidth bounded, using IB QDR
> will help. You will be able to drive more than 3GB/s from each server. 

Along similar lines (and with less product placement), buying more
cores and less IB can help you solve (and scale) larger problems.

In a world of quick inferences, one is also permitted to conclude
that implementors find it *performance cost effective* to *not* pay
top dollar for paper FBB when only a fraction of the FBB performance
can be achieved.  Don't need the oversubscription for those cases.

Based on some of the points provided so far, it's as if "known
patterns" are already well served by IB static routing and that the
"other cases" will now benefit from newer IB "adaptive routing
support".  As working for an IB vendor myself, or for anyone out
there who spends a lot of their time solving bandwidth-bound problems
or working on capacity systems, the discussion can't possibly be
reduced to this.  The design space for routing methods based on
modular switches is much larger that what is currently offered by the
Mellanox portfolio.

So I'd suggest that *no*, static routing is not necessarily the best
approach even if your pattern is known.

    . . christian

christian.bell at qlogic.com
QLogic Host Solutions Group

More information about the Beowulf mailing list