[Beowulf] Performance characterising a HPC application

Fri Mar 16 10:20:58 PDT 2007

> 1. processor bound.
> 2. memory bound.

oprofile is the only thing I know of that will give you this distinction.

> 3. interconnect bound.

with ethernet, this is obvious, since you can just look at user/system/idle
times.

> 4. headnode bound.

do you mean for NFS traffic?

> * Network traffic in averages at about 40 Mbit/sec but peaks to about 940 
> Mbit/sec (I was surprised by this - I didn't think gigabit was capable of 
> even approaching this in practice, is this figure dubious or are bursts at 
> this speed possible on good Gigabit hardware?).

it's not _that_ hard to hit full wire speed with gigabit...
however, saturating the wire means it's entirely possible that nodes
are being bottlenecked by this.

> * Memory usage is pretty constant at about 700MB while the model is running 
> with very little used in buffers or caches.

if you have lots of unused memory, that implies that you're not doing much
IO, and would suggest that IO to the headnode is not an issue.  NFS is quite 
willing to use all your idle/wasted memory for caching files/metadata...

> * Network traffic in averages at about 50 Mbit/sec but peaks to about 200 
> Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks to about 
> 200Mbit/sec. The peaks are very short (maybe a few seconds in duration, 
> presumably at the end of an MPI "run" if that is the correct term).

you don't think the peaks correspond to inter-node communication (_during_
the MPI job)?

> * Processor usage averages about 20% but if I watch htop activity for a while

ouch.  the cluster is doing very badly, and clearly bottlenecked on either 
inter-node or headnode IO.  I guess I'd be tempted to capture some
representative trace data with tcpdump (but I'm pretty oldfashioned and 
fundamentalist about these things.)

> I'm inclined to install sar on these nodes and run it for a while - although

I don't think sar would tell you _what_ your compute cpus are waiting on.
(/proc/$PID/wchan _might_ actually tell you whether the process is blocked
on NFS IO versus sockets/MPI.)

> quantify that. Do others here running MPI jobs see big improvements in using 
> Infiniband over Gigabit for MPI jobs or does it really depend on the

jeez: compare a 50 us interconnect to a 4 us one (or 80 MB/s vs >800).

anything which doesn't speed up going from gigabit to IB/10G/quadrics 
is what I would call embarassingly parallel...

> characteristics of the MPI job? What characteristics should I be looking for?

well, have you run a simple MPI benchmark, to make sure you're seeing
reasonable performance?  single-pair latency, bandwidth and some form of 
group communication are always good to know.

> a) to identify what parts of the system any tuning exercises should focus on.
> - some possible low hanging fruit includes enabling jumbo frames [some rough

jumbo frames are mainly a way to recover some CPU overhead - most systems,
especially those which are only 20%, can handle back-to-back 1500B frames.
it's easy enough to measure (with ttcp, netperf, etc).

> - Do people here normally tune the tcp/ip stack? My experience is that it is

I don't think so.  for modernish kernels, it should work OK out of the box.
it's possible you might be hitting some nic-specific settings (like interrupt
coalescing/mitigation).  measuring MPI latency would show that, I'd think.

> very easy to reduce the performance by trying to tweak kernel buffer sizes 
> due to the trade-offs in memory ... and 2.6 Linux kernels should be 
> reasonably smart about this.

buffer sizes are mainly an issue for long-fat-pipes - your pipes (say, 
50 us, 80 MB/s) contain only about 8k in flight at once.

> - Have people had much success with bonding and gigabit or is there 
> significant overheads in bonding?

most bonding/trunking uses LACP or similar, which is an aggregation system
not a raid0-like striping system.  so multiple streams speed up in total,
but an individual stream is still limited to ~100 MB/s.  obviously, this 
is a major win for your headnode, if it's a hotspot.  whether it would help
the compute nodes is harder to say.

I don't think you mentioned what your network looks like - all into one 
switch?  what kind is it?  have you verified that all the links are at 
1000/fullduplex?

> b) to allow us to specify a new cluster which will run the model *faster*!
> - from a perusal of past postings it sounds like current Opterons lag current 
> Xeons in raw numeric performance (but only by a little) but that the memory

core2 beats opterons by a factor of 2 at the same clock for purely in-cache
flops.

opterons still do very well for less cache-friendly stuff, especially
compared to Intel quad-core chips.

> controller architecture of Opterons give them an overall performance edge in 
> most typical HPC loads, is that a correct 36,000ft summary or does it still 
> depend very much on the application?

it does and always will depend very much on the app ;)

> I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters available 
> through their developer program for testing. Has anyone actually used these?

I haven't.  but if you'd like to try on our systems, we have quite a range.
(no IB, but our quadrics systems are roughly comparable.)

regards, mark hahn.