[Beowulf] Performance characterising a HPC application

Wed Mar 28 22:27:13 PDT 2007

Gilad,

Gilad Shainer wrote:
> So now we can discuss technical terms and not marketing terms such
> as price/performance. InfiniBand uses 10Gb/s and 20Gb/s link signaling
> rate. The coding of the data into the link signaling is 8/10. When 
> someone refer to 10 and 20Gb/s, it is for the link speed and there
> is nothing confusing here - this is InfiniBand specification (and a 
> standard if I may say).   

You don't have a 1.25 Gb/s Ethernet on your laptop, do you ? Gigabit 
Ethernet signal rate is 1.25 Gb/s per the standard, but data rate is 1 
Gb/s after 8b/10b encoding and that's why everybody calls it Gigabit 
Ethernet. Same thing with 10 GigE (it's 12.5 Gb/s signal rate) and with 
Myrinet 2G (2.5 Gb/s signal rate) or 10G (12.5 Gb/s signal rate), both 
according to their respective standard.

Similarly, my laptop has 1 GB of memory, not 1.125 GB of parity memory.

By using signaling rate instead of data rate, you are going against all 
conventions in networking. There is no technical basis for that choice, 
except that 10 is bigger than 8.

> The PCIe specification is exactly the same. Same link speed and same 
> 8/10 data encoding. When you say 13.7Gb/s you confuse between the 
> specification and the MTU (data size) that some of the chipsets support.

I just pointed out that your claim that bigger pipes have better 
application performance has to be adjusted with the effective 
throughtput for the application. IB SDR has actually less effective 
bandwidth than 10 GigE, and IB DDR cannot go more than 1.3x faster today.

> For chipsets that support MTU > 128B, your calculation is wrong and the 
> data throughput is higher.  

Please, name one PCI Express chipset that implements Read Completions 
larger than 128 Bytes. Do no confuse it with the May Payload size, which 
is from the Write operations. Read Completions are as large as the 
transaction size on the memory bus (and that makes a lot of sense if you 
think about it). Intel chipsets can do PCIE combining to reach 128 
Bytes, and they could in theory combine on a larger buffer. However, 
nobody does today.

With PCIE 2.0 doubling the bandwidth, then you will be able to say that 
IB 16 Gb/s is twice as fast as IB 8 Gb/s for application, but not today.

> What is also interesting to know, is when one uses InfiniBand 20Gb/s
> he/she
> Can fully utilized the PCIe x8 link, while in your case, Myricom I/O
> interface is the bottleneck. 

If you have a look at the following web page, you will see the effective 
bandwidth supported by a large variety of PCI Express chipsets:
http://www.myri.com/scs/performance/PCIe_motherboards/

This is a pure PCIE Express DMA measurement, no network involved. You 
will see that some do not even sustain 10 Gb/s on the Read direction, 
and most just barely sustain 20 Gb/s in bidirectional. In many 
motherboards, 10G *does* saturate the PCIE x8 link.

> saw from 3 non-bias parties. In all the application benchmarks, Myrinet 
> 2G shows poor performance comparing to 10 and 20Gb/s.
> As for the registration cache comment, I would go back to the "famous" 
> RDMA paper and the proper responds from IBM and others. The answer 
> to this comment is fully described in those responses.  

I strongly advice you to read the related posts by Christian at Qlogic 
and learn from it. He gets it, you don't. The IBM guy has never 
programmed RDMA interconnects and dealt with memory registration (I am 
not sure he ever programmed anything), and you neither apparently.

Here, try this simple benchmark: pingpong incrementing the send and 
receive buffer at each iteration. Tell me if IB beats GigE for, say, 
64KB messages.

>> Similarly, on many applications I have checked, Qlogic IB SDR 
>> has better performance than Mellanox IB DDR, despite having a 
>> smaller pipe (and despite Mellanox claiming the contrary).
> 
> 
> Are you selling Myricom HW or Qlogic HW? 

I don't do hardware or sales, believe it or not. I am a software guy. 
You tell me that a bigger pipe is better, I reply by example that it's 
not about the size of the pipe, it's how you use it. Qlogic have a 
smaller one, but they don't mind.

> and not only on pure latency or pure bandwidth. Qlogic till recently (*)
> had the lowest latency number but when it comes to application, the CPU
> overhead is too high. Check some papers on Cluster to see the
> application
> results. 

You really do not understand MPI implementations. Qlogic send and recv 
overhead is a problem for large messages, but small and medium messages 
are much more important for MPI applications. For these message sizes, 
it is actually faster to copy on both side to pre-registered buffers 
than doing a rendez-vous to do zero-copy. What is the difference between 
PIO on send side and Copy on receive (what Qlogic does), versus Copy on 
both sides ?

Qlogic design does cut corners for large message, but the tradeoff is 
that it keeps the design simple (thus easier to implement) without 
affecting too much application performance.

Don't get me wrong, Qlogic is my competitor too, and sometimes I 
savagely want to cut Greg's hair when he is wrong, but they mostly (and 
definitively Quadrics) know what they are doing.

Patrick
-- 
Patrick Geoffray
Myricom, Inc.
http://www.myri.com