[Beowulf] Performance characterising a HPC application
smulcahy at aplpi.com
Thu Mar 15 09:03:34 PDT 2007
I'm looking for any suggestions people might have on performance
characterising a HPC application (how's that for a broad query :)
We have a 20 node opteron 270 (2.0GHz dual core, 4GB ram, diskless)
cluster with gigabit ethernet interconnect. It is used primarily to run
an Oceanography numerical model called ROMS (http://www.myroms.org/ in
case anyone is interested). The nodes are running Debian GNU/Linux Etch
(AMD64 version) and we're using the portland group fortan90 compiler and
mpich2 for our MPI needs. The cluster has been in production mode pretty
much since it was commissioned so I haven't gotten a chance to do much
tuning and benchmarking.
I'm currently trying to characterise the performance of the model, in
particular to determine where it is
1. processor bound.
2. memory bound.
3. interconnect bound.
4. headnode bound.
I'm curious about how others go about this kind of characterisation -
I'm not at all familiar with the model at a code level (my expertise, if
any!, is in the area of Linux and hardware rather than in fortran90
code) so I don't have any particular insights from that perspective. I'm
hoping I can characterise the app from outside using various measurement
So far, I've used a mix of things including Ganglia, htop, iostat,
vmtstat, wireshark, ifstat (and a few others) to try and get a picture
of how the app behaves when running. One of my problems is having too
much data to analyse and not being entirely certain what is significant
and what isn't.
So far I've seen the following characteristics,
On the head node:
* Memory usage is pretty constant at about 1GB while the model is
running. An additional 2-3GB is used in memory buffers and memory
caches, presumably because this node does a lot of I/O.
* Network traffic in averages at about 40 Mbit/sec but peaks to about
940 Mbit/sec (I was surprised by this - I didn't think gigabit was
capable of even approaching this in practice, is this figure dubious or
are bursts at this speed possible on good Gigabit hardware?). Network
traffic out averages about 35 Mbit/sec but peaks to about 200Mbit/sec.
The peaks are very short (maybe a few seconds in duration, presumably at
the end of an MPI "run" if that is the correct term).
* Processor usage averages about 25% but if I watch htop activity for a
while I see bursts of 80-90% user activity on each core so the average
On a compute node:
* Memory usage is pretty constant at about 700MB while the model is
running with very little used in buffers or caches.
* Network traffic in averages at about 50 Mbit/sec but peaks to about
200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks
to about 200Mbit/sec. The peaks are very short (maybe a few seconds in
duration, presumably at the end of an MPI "run" if that is the correct
* Processor usage averages about 20% but if I watch htop activity for a
while I see bursts of 50-60% user activity on each core so the average
I'm inclined to install sar on these nodes and run it for a while -
although again I'm wary about generating lots of performance data if I'm
not sure what I'm looking for. I'm also a little wary of some of the RRD
based tools which (for space-saving reasons) seem to do a lot of
averaging which may actually hide information about bursts. Given that
the model run here seems to be quite bursty I think that peak
information is important.
I'm still unsure what the bottleneck currently is. My hunch is that a
faster interconnect *should* give a better performance but I'm not sure
how to quantify that. Do others here running MPI jobs see big
improvements in using Infiniband over Gigabit for MPI jobs or does it
really depend on the characteristics of the MPI job? What
characteristics should I be looking for?
The goals of this characterisation exercise are two-fold,
a) to identify what parts of the system any tuning exercises should
- some possible low hanging fruit includes enabling jumbo frames [some
rough calculations suggest that we have 2 sizes of MPI messages, one at
40k and one at 205k ... use of jumbo frames should significantly reduce
the number of packets to transmit a message, but would the gains be
- Do people here normally tune the tcp/ip stack? My experience is that
it is very easy to reduce the performance by trying to tweak kernel
buffer sizes due to the trade-offs in memory ... and 2.6 Linux kernels
should be reasonably smart about this.
- Have people had much success with bonding and gigabit or is there
significant overheads in bonding?
b) to allow us to specify a new cluster which will run the model *faster*!
- from a perusal of past postings it sounds like current Opterons lag
current Xeons in raw numeric performance (but only by a little) but that
the memory controller architecture of Opterons give them an overall
performance edge in most typical HPC loads, is that a correct 36,000ft
summary or does it still depend very much on the application?
I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters
available through their developer program for testing. Has anyone
actually used these? It sounds like what we really need before spec'ing
a new system is to list our assumptions and then go and test them on
some similar hardware - these clusters would seem to offer an ideal
environment for doing that but I'm wondering, in practice, how many
hoops one has to jump through to avail of them ... and whether parties
from outside of the US are even allowed access to these.
Apologies for the long-winded email but all feedback welcome. I'll be
happy to summarise any off-list comments back to the list,
Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland. http://www.aplpi.com
More information about the Beowulf