[Beowulf] Performance characterising a HPC application
Shainer at mellanox.com
Mon Mar 26 10:04:13 PDT 2007
> Offload, usually implemented by RDMA offload, or the ability
> for a NIC to autonomously send and/or receive data from/to
> memory is certainly a nice feature to tout. If one considers
> RDMA at an interface level (without looking at the
> registration calls required on some interconnects), it's the
> purest and most flexible form of interconnect data transfer.
> Unfortunately, this pure form of data transfer has a few caveats...
When Mellanox refers to transport offload, it mean full transport
offload - for all transport semantics. InfiniBand, as you probably
know, provides RDMA AND Send/Receive semantics, and in both cases
you can do Zero-copy operations.
This full flexibility provides the programmer with the ability to choose
best semantics for his use. Some programmers choose Send/Receive and
some RDMA. It is all depends on their application.
>From your response, I see that Qlogic does not provide this kind
> How the programming model can match up with the semantics of
> RDMA is the real question. A quick sampling suggests that
> global-address space languages fit squarely on top of RDMA,
> whereas MPI-2 almost does if less of its windowing complexity
> is considered. MPI-1, the most popular model out there, has
> the least in common with RDMA offload. Under its simplest
> form in MPI implementations, RDMA can be used for half of the
> communication protocol involved in large messages. In its
> complex form, it can be used to handle small to medium-sized
> messages as shown by a few openib/iwarp MPI implementations
> (although these implementations really implement a complex
> assortment of hybrid RDMA and non-RDMA mechanisms to provide
> scalable performance).
> RDMA offload, depending on the complexity of its
> implementation, can buy you little to lots of communication
> offload (or "total"
> communication offload in Quadrics' case). But RDMA
> implementations aside, you can only offload what the
> programming model *and* the programmer will let you.
> Programmers must understand data dependencies in their codes
> and know where and how to separate communication initiation
> and completion points. Even well intentioned programmers can
> fail to expose their apps for communication offload --
> complex legacy apps can be intimidating to modify, some apps
> may have strong data dependencies and others may be dominated
> by collectives which are themselves indivisible (i.e.
> blocking). And finally, a programmer who can successfully
> overcome over all these hurdles cannot expect to be provided
> with an equal level of overlap on all interconnects.
> There's a good reason that many programmers continue to find
> refuge in simple offload-less primitives like Send/Recv: the
> expectation that its in the interest of every MPI and
> interconnect vendor to provide the best Send/Recv possible.
> Many competent programmers will reap definite benefits from
> highly specialized implementations of RDMA offload. But then
> again, these programmers will also know how to analyse their
> applications and may come to completely different
> conclusions. For example, they may come to realise that most
> of their codes cannot fully benefit from offload and that the
> interconnect that spends the least time in specific MPI
> primitives is the best choice -- hardware-assisted
> operations, pt-to-pt midsize message performance or
> consistent cluster-wide message latency, etc. Understanding
> the expected performance of specific communication primitives
> is an application-centric view of performance evaluation.
> Assuming that more cores necessarily require fatter pipes,
> pt-to-pt latency measurements, signaling rates, messaging
> rates, etc. are all microbenchmark-centric view of
> interconnect evaluation. Picking on the latter is just too
> simplistic and rarely translates into a general and
> verifiable view of the world, but it's good fodder for
> oneupmanship and insipid (but
> entertaining) inter-vendor bickering.
> RDMA offload is attractive for many other reasons but in the
> context of today's most popular programming model it isn't as
> vital as one would like. It's reasonable conventional wisdom
> that offload is a desirable feature, but the way programming
> models have been moving (i.e. not moving), interconnects that
> do not offer elaborate communication offload mechanisms are
> not at a loss, far from it.
> Efficiently exploiting a low-level RDMA engine for the
> purposes of message passing would mean enabling its pure data
> transfer capability to percolate through the many levels of
> software stack and programming model semantics mostly
> unscathed. This is an unrealistic expectation.
> I've yet to see a significant number of message-passing
> applications show that an RDMA offload engine, as opposed to
> any other messaging engine, is a stronger performance
> determinant. That's probably because there are other equally
> important and desirable features implemented in other
> messaging engines.
> . . christian
> christian.bell at qlogic.com
> (QLogic SIG, formerly Pathscale)
More information about the Beowulf