[Beowulf] Performance characterising a HPC application

Fri Mar 23 16:16:17 PDT 2007

On Fri, 23 Mar 2007, Gilad Shainer wrote:

> Are you selling Myricom HW or Qlogic HW? 

Based on what I know, I think it's perfectly reasonable for Patrick
to expect that a messaging technology can outdo the other for reasons
other than higher signaling rates.

> In general, application performance depends on the interconnect
> architecture and not only on pure latency or pure bandwidth.
> Qlogic till recently (*) had the lowest latency number but when it
> comes to application, the CPU overhead is too high. Check some

High cpu overhead, as opposed to expected less overheads from
offloading, is a very crude characterisation of interconnect
performance.  Plus, if I remember correctly, this is not true for
many important message sizes and communication patterns.

Offload, usually implemented by RDMA offload, or the ability for a
NIC to autonomously send and/or receive data from/to memory is
certainly a nice feature to tout.  If one considers RDMA at an
interface level (without looking at the registration calls required
on some interconnects), it's the purest and most flexible form of
interconnect data transfer.  Unfortunately, this pure form of data
transfer has a few caveats...

How the programming model can match up with the semantics of RDMA is
the real question.  A quick sampling suggests that global-address
space languages fit squarely on top of RDMA, whereas MPI-2 almost
does if less of its windowing complexity is considered.  MPI-1, the
most popular model out there, has the least in common with RDMA
offload.  Under its simplest form in MPI implementations, RDMA can be
used for half of the communication protocol involved in large
messages.  In its complex form, it can be used to handle small to
medium-sized messages as shown by a few openib/iwarp MPI
implementations (although these implementations really implement a
complex assortment of hybrid RDMA and non-RDMA mechanisms to provide
scalable performance).

RDMA offload, depending on the complexity of its implementation, can
buy you little to lots of communication offload (or "total"
communication offload in Quadrics' case).  But RDMA implementations
aside, you can only offload what the programming model *and* the
programmer will let you.  Programmers must understand data
dependencies in their codes and know where and how to separate
communication initiation and completion points.  Even well
intentioned programmers can fail to expose their apps for
communication offload -- complex legacy apps can be intimidating to
modify, some apps may have strong data dependencies and others may be
dominated by collectives which are themselves indivisible (i.e.
blocking).  And finally, a programmer who can successfully overcome
over all these hurdles cannot expect to be provided with an equal
level of overlap on all interconnects.

There's a good reason that many programmers continue to find refuge
in simple offload-less primitives like Send/Recv: the expectation
that its in the interest of every MPI and interconnect vendor to
provide the best Send/Recv possible.

Many competent programmers will reap definite benefits from highly
specialized implementations of RDMA offload.  But then again, these
programmers will also know how to analyse their applications and may
come to completely different conclusions.  For example, they may come
to realise that most of their codes cannot fully benefit from offload
and that the interconnect that spends the least time in specific MPI
primitives is the best choice -- hardware-assisted operations,
pt-to-pt midsize message performance or consistent cluster-wide
message latency, etc.  Understanding the expected performance of
specific communication primitives is an application-centric view of
performance evaluation.  Assuming that more cores necessarily require
fatter pipes, pt-to-pt latency measurements, signaling rates,
messaging rates, etc. are all microbenchmark-centric view of
interconnect evaluation.  Picking on the latter is just too
simplistic and rarely translates into a general and verifiable view
of the world, but it's good fodder for oneupmanship and insipid (but
entertaining) inter-vendor bickering.

RDMA offload is attractive for many other reasons  but in the context
of today's most popular programming model it isn't as vital as one
would like. It's reasonable conventional wisdom that offload is a
desirable feature, but the way programming models have been moving
(i.e. not moving), interconnects that do not offer elaborate
communication offload mechanisms are not at a loss, far from it.
Efficiently exploiting a low-level RDMA engine for the purposes of
message passing would mean enabling its pure data transfer capability
to percolate through the many levels of software stack and
programming model semantics mostly unscathed.  This is an unrealistic
expectation.

I've yet to see a significant number of message-passing applications
show that an RDMA offload engine, as opposed to any other messaging
engine, is a stronger performance determinant.  That's probably
because there are other equally important and desirable features
implemented in other messaging engines.

cheers,

    . . christian

-- 
christian.bell at qlogic.com
(QLogic SIG, formerly Pathscale)