[Beowulf] memory bandwidth estimation

Robert G. Brown rgb at phy.duke.edu
Tue Nov 15 05:28:38 PST 2005


On Mon, 14 Nov 2005, Olivier Marsden wrote:

> Hello everyone,
> this might be a stupid question, but I haven't found a simple
> answer to it yet: is there any "easy" way of looking at the memory
> bandwidth that a program uses? What I basically want to know is
> whether my mpi-based program would benefit from dual-core processors
> or not, and this depends on whether the program's execution is badly
> memory-bandwidth-limited or not.
> So far I have not found anything better than looking at cpu usage thanks to 
> ganglia,
> or 'top' or something like that, and looking for times when the processors 
> are idling,
> but this method is definitely not very precise, and for all I know might be 
> wrong.
> Could anyone give me any suggestions?
> Thanks in advance,

The obvious suggestion is to find a dual core processor cluster that
somebody will loan you and run your application on it.  That way there
is no possibility of mistake.  Penguin has a test cluster running,
usually, and are pretty good about letting people try out code on it
(especially if they are potential customers, of course, but I think they
are easy about it even if not).  Several other companies ditto.  Whoever
you might actually buy from should be willing to help you here, and on
this list you might well find people with dual core clusters who will
let you bench on their cluster during an idle spell.

Application-specific memory bandwidth isn't terribly easy to measure
because it tends to be SO bound up in the code that is being executed.
One resource that won't answer your question but that will give you some
hints about the problem is the Intel optimization reference for the
Pentium family -- it has an entire chapter devoted to prefetch, for
example, and how one can take certain code segments and basically shift
them completely from being mostly memory bandwidth bound to CPU bound by
layering prefetches in just the right pattern so that the next numbers
to be multiplied are always loaded into the pipeline by the time they
are needed.  It has nice sections on SSE (SIMD) and several other
optimizations, a bit dated but still relevant.  You can google it up
from the Intel website.

The way I'd usually suggest proceeding is by a) studying your code's
core loop(s).  If they are highly vectorized -- lots of linear algebra,
then they are relatively likely to be memory bound and stream is likely
to be your relevant microbenchmark.  If they are almost anything else,
the code is REASONABLY likely to be CPU bound and to scale linearly or
nearly so on N-processor cores or N-processor SMP systems or N-processor
UP-node clusters.  You can then look at stream results (or SPEC results)
and see how the relevant constellations scale from UP to dual core to
SMP in the architecture(s) you are looking at.  If they scale linearly,
then your application probably will too, although a benchmark is worth a
whole lot of "probably".

Alternatively, you COULD insert timing instructions in your core loops
(as opposed to around the whole application) and measure actual timings
of those loops.  Then if you could get permission to run on just ONE
dual core system you could measure those timings when running
applications two at a time and one at a time.  If the same, then relax,
you're CPU bound not memory bound on THAT HARDWARE (YMMV, of course,
especially as you switch hardware e.g. from Intel to AMD or vv).

Last suggestion is that if you ARE memory bound and do lots of linear
algebra, I'd strongly suggest that you use ATLAS, possibly ATLAS tuned
up still further with prefetch.  ATLAS alone can be worth a factor of
2-3 in speed, as its cache-friendly algorithms are much more likely to
be efficient than a naive flat implementation of a vector loop.  In
fact, you might well NOT be memory bound without ATLAS not because your
code shouldn't be, but because the memory accesses are sufficiently
inefficient that the alternation between memory access and instruction
execution in the pipelines on the two processors permit memory to keep
up with both, sort of.  Not exactly the way you want to scale
linearly...

    rgb

>
> Olivier Marsden
> Ecole Centrale de Lyon
> France
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list