<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

<HTML>

<HEAD>

<TITLE>RE: [Beowulf] Performance characterising a HPC application</TITLE>

</HEAD>

<BODY>

<!-- Converted from text/plain format -->


<P><FONT SIZE=2>This is a very interesting topic.<BR>

<BR>

First off it's interesting how different head and compute node are, and that cpu utilisation is relatively  low.<BR>

.<BR>

What is the runtime of one run ?<BR>

<BR>

Have you tried running it only on compute nodes? (mpirun -nolocal)<BR>

<BR>

Have you experimented with the impact of running two threads per node versus four and half the amount of nodes to understand if a quadcore system could give you an advantage (more mpi io within the node) or disadvantage (more mpi io squeezing through interconnect bottleneck) ?<BR>

<BR>

Infiniband will be more valuable on the quadcore I presume.<BR>

<BR>

Does the app use any scratchspace at runtime over NFS?<BR>

<BR>

What size are input and output files and how much time is spent reading / writing them ?<BR>

<BR>

Michael<BR>

<BR>

 -----Original Message-----<BR>

From:   stephen mulcahy [<A HREF="mailto:smulcahy@aplpi.com">mailto:smulcahy@aplpi.com</A>]<BR>

Sent:   Fri Mar 16 07:36:12 2007<BR>

To:     beowulf@beowulf.org<BR>

Subject:        [Beowulf] Performance characterising a HPC application<BR>

<BR>

Hi,<BR>

<BR>

I'm looking for any suggestions people might have on performance<BR>

characterising a HPC application (how's that for a broad query :)<BR>

<BR>

Background:<BR>

We have a 20 node opteron 270 (2.0GHz dual core, 4GB ram, diskless)<BR>

cluster with gigabit ethernet interconnect. It is used primarily to run<BR>

an Oceanography numerical model called ROMS (<A HREF="http://www.myroms.org/">http://www.myroms.org/</A> in<BR>

case anyone is interested). The nodes are running Debian GNU/Linux Etch<BR>

(AMD64 version) and we're using the portland group fortan90 compiler and<BR>

mpich2 for our MPI needs. The cluster has been in production mode pretty<BR>

much since it was commissioned so I haven't gotten a chance to do much<BR>

tuning and benchmarking.<BR>

<BR>

I'm currently trying to characterise the performance of the model, in<BR>

particular to determine where it is<BR>

<BR>

1. processor bound.<BR>

<BR>

2. memory bound.<BR>

<BR>

3. interconnect bound.<BR>

<BR>

4. headnode bound.<BR>

<BR>

I'm curious about how others go about this kind of characterisation -<BR>

I'm not at all familiar with the model at a code level (my expertise, if<BR>

any!, is in the area of Linux and hardware rather than in fortran90<BR>

code) so I don't have any particular insights from that perspective. I'm<BR>

hoping I can characterise the app from outside using various measurement<BR>

tools.<BR>

<BR>

So far, I've used a mix of things including Ganglia, htop, iostat,<BR>

vmtstat, wireshark, ifstat (and a few others) to try and get a picture<BR>

of how the app behaves when running. One of my problems is having too<BR>

much data to analyse and not being entirely certain what is significant<BR>

and what isn't.<BR>

<BR>

So far I've seen the following characteristics,<BR>

<BR>

On the head node:<BR>

* Memory usage is pretty constant at about 1GB while the model is<BR>

running. An additional 2-3GB is used in memory buffers and memory<BR>

caches, presumably because this node does a lot of I/O.<BR>

* Network traffic in averages at about 40 Mbit/sec but peaks to about<BR>

940 Mbit/sec (I was surprised by this - I didn't think gigabit was<BR>

capable of even approaching this in practice, is this figure dubious or<BR>

are bursts at this speed possible on good Gigabit hardware?). Network<BR>

traffic out averages about 35 Mbit/sec but peaks to about 200Mbit/sec.<BR>

The peaks are very short (maybe a few seconds in duration, presumably at<BR>

the end of an MPI "run" if that is the correct term).<BR>

* Processor usage averages about 25% but if I watch htop activity for a<BR>

while I see bursts of 80-90% user activity on each core so the average<BR>

is misleading.<BR>

<BR>

On a compute node:<BR>

* Memory usage is pretty constant at about 700MB while the model is<BR>

running with very little used in buffers or caches.<BR>

* Network traffic in averages at about 50 Mbit/sec but peaks to about<BR>

200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks<BR>

to about 200Mbit/sec. The peaks are very short (maybe a few seconds in<BR>

duration, presumably at the end of an MPI "run" if that is the correct<BR>

term).<BR>

* Processor usage averages about 20% but if I watch htop activity for a<BR>

while I see bursts of 50-60% user activity on each core so the average<BR>

is misleading.<BR>

<BR>

I'm inclined to install sar on these nodes and run it for a while -<BR>

although again I'm wary about generating lots of performance data if I'm<BR>

not sure what I'm looking for. I'm also a little wary of some of the RRD<BR>

based tools which (for space-saving reasons) seem to do a lot of<BR>

averaging which may actually hide information about bursts. Given that<BR>

the model run here seems to be quite bursty I think that peak<BR>

information is important.<BR>

<BR>

I'm still unsure what the bottleneck currently is. My hunch is that a<BR>

faster interconnect *should* give a better performance but I'm not sure<BR>

how to quantify that. Do others here running MPI jobs see big<BR>

improvements in using Infiniband over Gigabit for MPI jobs or does it<BR>

really depend on the characteristics of the MPI job? What<BR>

characteristics should I be looking for?<BR>

<BR>

The goals of this characterisation exercise are two-fold,<BR>

<BR>

a) to identify what parts of the system any tuning exercises should<BR>

focus on.<BR>

- some possible low hanging fruit includes enabling jumbo frames [some<BR>

rough calculations suggest that we have 2 sizes of MPI messages, one at<BR>

40k and one at 205k ... use of jumbo frames should significantly reduce<BR>

the number of packets to transmit a message, but would the gains be<BR>

significant?].<BR>

- Do people here normally tune the tcp/ip stack? My experience is that<BR>

it is very easy to reduce the performance by trying to tweak kernel<BR>

buffer sizes due to the trade-offs in memory ... and 2.6 Linux kernels<BR>

should be reasonably smart about this.<BR>

- Have people had much success with bonding and gigabit or is there<BR>

significant overheads in bonding?<BR>

<BR>

b) to allow us to specify a new cluster which will run the model *faster*!<BR>

- from a perusal of past postings it sounds like current Opterons lag<BR>

current Xeons in raw numeric performance (but only by a little) but that<BR>

the memory controller architecture of Opterons give them an overall<BR>

performance edge in most typical HPC loads, is that a correct 36,000ft<BR>

summary or does it still depend very much on the application?<BR>

<BR>

I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters<BR>

available through their developer program for testing. Has anyone<BR>

actually used these? It sounds like what we really need before spec'ing<BR>

a new system is to list our assumptions and then go and test them on<BR>

some similar hardware - these clusters would seem to offer an ideal<BR>

environment for doing that but I'm wondering, in practice, how many<BR>

hoops one has to jump through to avail of them ... and whether parties<BR>

from outside of the US are even allowed access to these.<BR>

<BR>

Apologies for the long-winded email but all feedback welcome. I'll be<BR>

happy to summarise any off-list comments back to the list,<BR>

<BR>

-stephen<BR>

--<BR>

Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,<BR>

    GMIT, Dublin Rd, Galway, Ireland.      <A HREF="http://www.aplpi.com">http://www.aplpi.com</A><BR>

_______________________________________________<BR>

Beowulf mailing list, Beowulf@beowulf.org<BR>

To change your subscription (digest mode or unsubscribe) visit <A HREF="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</A><BR>

</FONT>

</P>


</BODY>

</HTML>