[Beowulf] Performance characterising a HPC application
patrick at myri.com
Wed Mar 28 22:27:13 PDT 2007
Gilad Shainer wrote:
> So now we can discuss technical terms and not marketing terms such
> as price/performance. InfiniBand uses 10Gb/s and 20Gb/s link signaling
> rate. The coding of the data into the link signaling is 8/10. When
> someone refer to 10 and 20Gb/s, it is for the link speed and there
> is nothing confusing here - this is InfiniBand specification (and a
> standard if I may say).
You don't have a 1.25 Gb/s Ethernet on your laptop, do you ? Gigabit
Ethernet signal rate is 1.25 Gb/s per the standard, but data rate is 1
Gb/s after 8b/10b encoding and that's why everybody calls it Gigabit
Ethernet. Same thing with 10 GigE (it's 12.5 Gb/s signal rate) and with
Myrinet 2G (2.5 Gb/s signal rate) or 10G (12.5 Gb/s signal rate), both
according to their respective standard.
Similarly, my laptop has 1 GB of memory, not 1.125 GB of parity memory.
By using signaling rate instead of data rate, you are going against all
conventions in networking. There is no technical basis for that choice,
except that 10 is bigger than 8.
> The PCIe specification is exactly the same. Same link speed and same
> 8/10 data encoding. When you say 13.7Gb/s you confuse between the
> specification and the MTU (data size) that some of the chipsets support.
I just pointed out that your claim that bigger pipes have better
application performance has to be adjusted with the effective
throughtput for the application. IB SDR has actually less effective
bandwidth than 10 GigE, and IB DDR cannot go more than 1.3x faster today.
> For chipsets that support MTU > 128B, your calculation is wrong and the
> data throughput is higher.
Please, name one PCI Express chipset that implements Read Completions
larger than 128 Bytes. Do no confuse it with the May Payload size, which
is from the Write operations. Read Completions are as large as the
transaction size on the memory bus (and that makes a lot of sense if you
think about it). Intel chipsets can do PCIE combining to reach 128
Bytes, and they could in theory combine on a larger buffer. However,
nobody does today.
With PCIE 2.0 doubling the bandwidth, then you will be able to say that
IB 16 Gb/s is twice as fast as IB 8 Gb/s for application, but not today.
> What is also interesting to know, is when one uses InfiniBand 20Gb/s
> Can fully utilized the PCIe x8 link, while in your case, Myricom I/O
> interface is the bottleneck.
If you have a look at the following web page, you will see the effective
bandwidth supported by a large variety of PCI Express chipsets:
This is a pure PCIE Express DMA measurement, no network involved. You
will see that some do not even sustain 10 Gb/s on the Read direction,
and most just barely sustain 20 Gb/s in bidirectional. In many
motherboards, 10G *does* saturate the PCIE x8 link.
> saw from 3 non-bias parties. In all the application benchmarks, Myrinet
> 2G shows poor performance comparing to 10 and 20Gb/s.
> As for the registration cache comment, I would go back to the "famous"
> RDMA paper and the proper responds from IBM and others. The answer
> to this comment is fully described in those responses.
I strongly advice you to read the related posts by Christian at Qlogic
and learn from it. He gets it, you don't. The IBM guy has never
programmed RDMA interconnects and dealt with memory registration (I am
not sure he ever programmed anything), and you neither apparently.
Here, try this simple benchmark: pingpong incrementing the send and
receive buffer at each iteration. Tell me if IB beats GigE for, say,
>> Similarly, on many applications I have checked, Qlogic IB SDR
>> has better performance than Mellanox IB DDR, despite having a
>> smaller pipe (and despite Mellanox claiming the contrary).
> Are you selling Myricom HW or Qlogic HW?
I don't do hardware or sales, believe it or not. I am a software guy.
You tell me that a bigger pipe is better, I reply by example that it's
not about the size of the pipe, it's how you use it. Qlogic have a
smaller one, but they don't mind.
> and not only on pure latency or pure bandwidth. Qlogic till recently (*)
> had the lowest latency number but when it comes to application, the CPU
> overhead is too high. Check some papers on Cluster to see the
You really do not understand MPI implementations. Qlogic send and recv
overhead is a problem for large messages, but small and medium messages
are much more important for MPI applications. For these message sizes,
it is actually faster to copy on both side to pre-registered buffers
than doing a rendez-vous to do zero-copy. What is the difference between
PIO on send side and Copy on receive (what Qlogic does), versus Copy on
both sides ?
Qlogic design does cut corners for large message, but the tradeoff is
that it keeps the design simple (thus easier to implement) without
affecting too much application performance.
Don't get me wrong, Qlogic is my competitor too, and sometimes I
savagely want to cut Greg's hair when he is wrong, but they mostly (and
definitively Quadrics) know what they are doing.
More information about the Beowulf