[Beowulf] Intel pulls networking onto Xeon Phi

atchley tds.net atchley at tds.net
Wed Dec 4 07:49:15 PST 2013


On Tue, Dec 3, 2013 at 10:45 PM, Greg Lindahl <lindahl at pbm.com> wrote:

> On Mon, Dec 02, 2013 at 08:41:26AM -0500, atchley tds.net wrote:
> > On Mon, Dec 2, 2013 at 8:37 AM, atchley tds.net <atchley at tds.net> wrote:
>
> > > I am not sure what Aries currently offers that IB does not.
>
> The IB in question is the True Scale adapter, which does some things
> really fast and other things pretty slowly. Aries has different
> features (quite different and more capable than IB, really), and is
> much larger.
>
> To put this into perspective, I suspect the typical modern Ethernet
> adapter has more gates than True Scale. If you're going to add
> something to a CPU, it had best be small. CPU guys get really irate if
> you reduce their yield.
>

Given that the article said that it could borrow things not found in
Ethernet or IB controllers, I don't think it was meant to be TS only.

What Aries features do you have in mind that are different and/or more
capable? Are they expressed by the uGNI interface? I find the uGNI
interface to be a strict subset of IB and the hardware has some interesting
limits.


> > > As Myricom showed with MX over Ethernet followed by
> > > Mellanox with RoCE, you can get low latency over Ethernet bypassing the
> > > kernel and the TCP stack.
>
> Indeed, Myricom+MX is quite similar in concept to the IB extension
> found in True Scale. The main difference (in my mind) is that OpenMX
> is hosted on a not-optimized-for-MX generic ethernet adapter, and that
> Myrinet's hardware was not fully optimized to do exactly what MX
> needed, nothing more and nothing less.


Hmm, I was not speaking of OpenMX. It was a minor change to run the native
MX on Myricom's NICs in Ethernet mode. The latency was no different than
running MX on Myrinet until you went through a switch.

OpenMX is API, binary, and wire compatible with MX such that one could run
MX on a Myricom 10G NIC connected to an Ethernet switch and OpenMX on a
non-Myricom 10G Ethernet NIC connected to the same switch. Because OpenMX
does not require specialized hardware, it sits atop the Ethernet driver in
the kernel. It is not kernel-bypass, but it does bypass the full IP stack.


> True Scale is the smallest
> possible adapter that supports the basics of what MPI needs.
>
> The fabric is pretty irrelevant, as long as it has flexible
> routing. (See below for comments about SDN.)
>
> > > HPC sends a lot of small messages and various stacks are making use of
> > > 8-byte atomics. It is unhelpful to have a 64 byte minimum frame size in
> > > this case.
>
> Yes, a smaller frame size is quite nice for achieving high message
> rates for tiny packets.
>
> > > Ethernet topology discovery protocols were designed for environments
> where
> > > equipment can be changed out, expanded, or otherwise altered.
>
> This has changed in the new SDN (software defined networking)
> world. You can think of SDN on Ethernet as Infiniband management
> protocols implemented in ethernet, making many of the same mistakes
> that Infiniband did, plus some new ones.
>

SDN is the Big Data of networking. It can be anything you want. ;-)


> > Ethernet requires a single-path between any two endpoints.
>
> This is not true. It's more accurate to say that ethernet (especially
> TCP) benefits from in-order delivery, which you can ensure either on a
> host-host basis (which is what spanning tree provides) or on a
> per-flow basis (which is what SDN allows.)
>
> Personally, I'm a bit bummed this won't happen until 2015 :-( but I'm
> really excited to see True Scale's basic design continue into another
> generation.


You are confusing a transport layer with the L2 layer. Ethernet does not
care about order. It does not care about reliability. These are provided at
higher layers.

SDN has many uses. You can provide per-flow routing, but it can do much
more.

I thought most SDN efforts were to provide routing between fabrics.
Ethernet, by definition, is non-routing and is a common broadcast domain. I
though STP et al took care of multiple paths to ensure single path between
any two MACs. Is it possible for one host to send a broadcast and another
host to receive multiple copies?

Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20131204/108cda11/attachment.html>


More information about the Beowulf mailing list