[Beowulf] Performance characterising a HPC application
patrick at myri.com
Fri Mar 23 00:23:32 PDT 2007
Greg Lindahl wrote:
> Compare the latency numbers in HPC Challenge to the 2-node ping-pong
> latency reported by vendors. For some vendors, it's the same number.
> For others, the latency from using all the nodes is much, much higher.
The ring test in HPC is rather poorly implemented: 3 iterations only to
measure something in the same order of magnitude than the precision of
MPI_Wtime(). Someone just failed Benchmarking 101. If you replace a
gettimeofday() implementation of MPI_Wtime() by a cycle counter one,
then the numbers change quite a bit.
However, I agree with you, this is the right way to measure the
sensitivity to concurrent traffic.
> Note that the new MVAPICH has message coalescing, which causes its
It is unbelievable that so few people denounce it. It is clearly
implemented only to cheat on a micro-benchmark. What's next ? Checking
that the buffer to send is identical to the previous one to avoid
sending "redundant" messages in ping-pong ?!?
> message each to lots of other nodes before synchronizing. Message rate
> benchmarks like "base" HPCC Gups get no benefit from message
HPCC Gups already does some sort of coalescing. If updates are going to
the same process, then they are put in the same bucket. The size of
messages depend on the number of updates in the buckets, so smaller
number of nodes means bigger messages. I don't understand why they would
do that, it defeats the goal of scalability testing.
> HPC Challenge is much better than what has come before, but it too can
I think HPCC is somewhat a regression compared to the NAS for example.
The communication benchmarks are too analytic, not functional enough.
> intra-node. And guess what? HPCC results are hard to come by, even though
> it's pretty easy to run.
And HPCC is a pain in the bottom to compile and run. HPL is not really a
shinning example of straightforward build process, and configless
operations, so why build HPCC on top of it ? Is autoconf still too
bleeding edge these days ? Argh ! And What about the three dozens
parameters in the config file ?!? It's just insane.
I like the NAS benchmarks. You can run each of them independently, only
choose the problem size and the number of processes. Easy to run, easy
to compare. Pallas is nice too, anybody can run it.
> Trust me, I'd love to see microbenchmarks which attack the real issues
> that speed up applications. But usually they miss the mark, and my
> attempt to create a new one (message rate) is now destroyed by message
> coalescing. I should have used an N-node benchmark instead.
If you want to show the impact of concurrent communications, something
latency-based like the HPCC ring test is the best way (eventually with
more nodes). The millions of packet per second of a stream-based
benchmark are lovely for the marketing folks, but has little meaning for
real codes that computes a minimum. However, an alltoall on many
cores/nodes would exercise the same metric (many sends/recvs on the same
NIC at the same time), but would be harder to cheat and be much more
More information about the Beowulf