[Beowulf] SC|05: will 10Ge beat IB?

Patrick Geoffray patrick at myri.com
Sun Nov 27 05:31:04 PST 2005


Hi Gary,

Gary Green wrote:

> Jeff points out one of the two issues slowing down adoption of 10GigE in
> HPC.  The first bottleneck is the lack of a cheap high port count
> non-blocking 10GigE switch.

Add to that "wormhole" and "level-2 only". The Ethernet spec pretty much 
requires a store-and-forward model to implement the spanning tree, but I 
was pleasantly surprised to hear that most existing Ethernet switch 
vendor (in addition to the new kids on the block) can deliver wormhole 
capability. However, the cost of existing switches is often related to 
IP routing functionalities, and

> The second issue is the lack of a broadly adopted RDMA implementation.  It
> appears that there is movement forward in this arena as well with the move
> to adopt the verbs interface from the OpenIB group.

I think the RDMA hype is finally loosing steam. People realize that, at 
least for MPI, RDMA does not help. Using memory copies (ie not doing 
zero-copy, ie not doing RDMA) is faster for small/medium messages which 
represent the vast majority of messages in HPC. Furthermore, RDMA-only 
comes with its share of problems (memory scalability is a major one) 
that cannot be ignored much longer.

With system call overhead below 0.3us these days, OS-bypass may 
eventually join zero-copy/RDMA in the list of features 
once-useful-but-not-so-much-anymore.

> But as someone pointed out in a previous email, just as 10GigE will surely
> catch up to where IB is today, by that time, IB will be at the next
> generation with DDR, QDR, 12X, etc, etc, etc...

With DDR, you reduce your copper cable length by half. Even more for 
QDR. How will you connect hundreds of nodes with 5M fire-hose cables ? 
Today's fibers can carry 10Gb/s of data (12.5 Gb/s signal). You can push 
40 Gb/s data with very expensive optics, but it makes sense only for 
inter-switch links. Furthermore, which IO bus would you use ? With a 
PCI-Express 8x slot, you can barely saturates a 10Gb/s data link (for 
the curious, PCI-Express efficiency is not great, the MTU is usually 64 
Bytes).

The only way to be able to feed that much data will be to be on the 
motherboard, through an HT connection or directly on the memory bus. Not 
commodity anymore, and nobody makes money in the custom-motherboard 
market. And in the end, you will realize that the extra bandwidth buys 
you little.

> There is also the question as to whether GigE will be able to demonstrate
> the ultra low latencies seen in high performance interconnects such as
> Quadrics, InfiniBand and Myrinet.  In the end, there will most likely remain
> a market for high performance interconnects for high end applications.

If the Ethernet switch latency comes to the 200ns range, pure Ethernet 
will be in the same latency range as everything else.

Patrick
-- 
Patrick Geoffray
Myricom, Inc.
http://www.myri.com



More information about the Beowulf mailing list