[Beowulf] Performance characterising a HPC application

Fri Mar 23 09:53:19 PDT 2007

On Fri, 2007-03-23 at 03:23 -0400, Patrick Geoffray wrote:
> It is unbelievable that so few people denounce it. It is clearly 
> implemented only to cheat on a micro-benchmark. What's next ? Checking 
> that the buffer to send is identical to the previous one to avoid 
> sending "redundant" messages in ping-pong ?!?

Far better to check if the buffer you are sending is just many COW
copies of the kernel zero page and if it is get the receiver to
mmap() /dev/zero over the recv buffer.  Every benchmark should
initialise transmitted data before it is sent, if only to prevent page
faults inside the timing loop.  We don't do this of course but often
comment that with a lot of benchmarks we could get fairly large
bandwidth numbers if we did.

> If you want to show the impact of concurrent communications, something 
> latency-based like the HPCC ring test is the best way (eventually with 
> more nodes). The millions of packet per second of a stream-based 
> benchmark are lovely for the marketing folks, but has little meaning for 
> real codes that computes a minimum. However, an alltoall on many 
> cores/nodes would exercise the same metric (many sends/recvs on the same 
> NIC at the same time), but would be harder to cheat and be much more 
> meaningful IMHO.

Alltoall is one of the hardest functions to optimise purely because of
contention in the NIC, the optimisations we do aim to reduce this number
and avoid hotspots.  It's probably a good thing to benchmark to get a
idea of the capability of a given network.

Ashley,