Instrumenting a parallel code

Mon May 21 08:32:14 PDT 2001

On Mon, 21 May 2001, Jared Hodge wrote:

> We are working on instrumenting a parallel finite element code we have
> been working on.  We want to analyze how the system is being utilized,
> and produce reliable, and quantifiable results so that in the future
> we'll know if clusters designed for this code should be designed to
> optimize network bandwidth, CPU speed, memory size, or other variables.
> Basically what we want to do is measure performance degradation as each
> of these decreases (and vise versa).  This won't give us absolute
> numbers for all variables, since there will obviously be plateaus in
> performance, but it's a start.  Here's what I've got in mind for each of
> these areas, please let me know if you have any suggestions.
>
> Memory
> 	This one is pretty easy as far as memory size.  We can just launch
> another application that will allocate a specific amount of memory and
> hold it (with swap off of course).  I'm not sure if adjusting and
> measuring memory latency is feasible or too great a concern.
>
>
> Network
> 	We're writing a series of wrapper functions for the MPI calls that we
> are using that will time their execution.  This will give us a good
> indication of the blocking nature of communication in the program.
>
> CPU usage
> 	I'm really not sure how we can decrease this one easily other than
> changing the bus multiplier in hardware.  A timeline of CPU usage would
> at least give us a start (like capturing top's output), but this would
> alter the performance too (invasive performance monitor).  We could just
> use the network measurements and assume that whenever a node is not
> communicating or blocked for communication, it's "computing", but that
> is definitely an over simplification.
>
> Any useful comments or suggestions would be appreciated.  Thanks.

Two comments/suggestions:

  a) Look over lmbench (www.bitmover.com) as a microbenchmark basis for
your measurements.  It has tools to explicitly measure just about
anything in your list above and more besides.  It is used by Linus and
the kernel developers to test various kernel subsystems, so it gives you
a common basis for discussion with kernel folks should the need arise.
It might not do everything you need -- in many cases you will be more
interested in stream or cpu-rate like measures of performance that
combine the effects of cpu speed and memory speed for certain tasks --
but it does a lot.

  b) Remember the profiling commands (compile with -pg and use gprof
with gcc, for example).  In a lot of cases profiling a simple run will
immediately tell you whether the code is likely to be memory or CPU
bound or bound by trancendental (library) speed or bound by network
speed.  At the very least you can see where it spends its time on
average and then add some timing code to those routines and core loops
to determine what subsystem(s) are the rate limiting bottlenecks.

I actually think that your project is in an area where real "beowulf
research" needs to occur.  I have this vision of a suite of
microbenchmarks built into a kernel module and accessed via
/proc/microbench/whatever that provide any microbenchmark in the suite
on demand (or perhaps more simply an ordinary microbenchmark generating
program that runs at an appropriate runlevel at boot and stores all the
results in e.g. /var/microbench/whatever files).  In either or both
cases I'd expect that both the latest measurement and a running average
with full statistics over invokations of the microbenchmark program
would be provided.

Either way, the microbenchmark results for a given system would become a
permanent part of the commonly available system profile that is
available to ALL programs and programmers after the first bootup.
System comparison and systems/beowulf/software engineering would all be
immeasurably enhanced -- one could write programs that autotune to at
least a first order approximation off of this microbenchmark data, and
for many folks and applications this would be both a large improvement
over flat untuned code and "enough".  More complex/critical programs or
libraries could refine the first order approximation via an ATLAS-like
feedback process.

   Hope this helps,

     rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu