[Beowulf] How does one calculate the scalability of one's code?
Robert G. Brown
rgb at phy.duke.edu
Thu Jun 10 08:23:57 PDT 2004
On Thu, 10 Jun 2004 daniel.kidger at quadrics.com wrote:
> > To make the number more clear
> > Run1 Specs
> > Grid Size: 100 x 100 x 100
> > Nodes: 4
> > Time: 605 seconds
> > Run2 Specs
> > Grid Size: 200 x 200 x 200
> > Nodes: 32
> > Time: 944 seconds
> > So Run2 took ~ 1.6 times as long as Run1
> You must also understand what your code is doing. If you code iterates to a converged solution (say using PCG) then the number of iterations rises as the number of freedoms increases. So a 1.6x speedup might be perfect scaling still.
> Also look at load balancing - how are the 200^3 grid cells distributed over the 32 Nodes.
> You may well find that the workload given to each cpu is not equal.
> For the record - what compute nodes are you using - how many cpus does each have. And are you using a high performance interconnect ?- if so it should be able to give you statistics of average message size, number of bytes transferred for each program run etc. This would be interesting.
...and to continue Greg's observation with some direction -- What you
probably want to do is not try to fit Amdahl's law per se, but variants
that are fairly easily derived from the same sort of analysis that
include at least the time required for communications between nodes if
not something a bit more sophisticated as might be required by your
communications model and problem partitioning.
A web reference:
has a whole chapter devoted to Amdahl's law and generalizations,
including some little graphs to help you understand why one doesn't
generally see linear scaling in "real" (as opposed to embarrassingly)
Remarks apropos of your particular problem: If you are doing things
with Maxwell's equations (what, I don't know) you are likely solving
things like Laplace or Helmholtz equations on a grid (with ODE/PDE
solvers). If the problem is a lattice decomposition of a spatial volume
and you use a local method (one where each lattice point has a value
that is updated only on the basis of its nearest neighbors) and each
node contains a subvolume of characteristic length L, then the number of
sites that are updated on the node scales like L^3, right? However,
order L^2 (exact prefactor depends on the lattice geometry) of these
sites have to get their "nearest neighbor" data from another node over
the network instead of from main memory.
The problem is that retrieving some numbers from main memory and doing
some simple numerical operations on those numbers and putting them back
into main memory is fast -- on a cubic lattice we can guestimate order
of tens of nanoseconds for the retrieve and order tens more for the
arithmetic, unless you use e.g. trancendental functions or the like in
the core update. Call it (perhaps, this is a very rough estimate) 100
nanoseconds total per site update, or ten million site updates per
second -- in principle one could update 100x100x100 cube several times a
However, retrieving the numbers over the network is much slower. If one
gets them poorly (send/receive them one at a time) it might well take 50
MICROseconds per number. If one gets the efficiently in a single send
block, then our assumed 100x100x100 lattice needs roughly 6x10^4 ~ 10^5
double precision numbers, or perhaps half a megabyte of data per site
update sweep. Assuming truly optimal transmission and 100 BT and a
clever pattern of sending and receiving, and allowing for overhead in
the message passing process, I'd guess that this would take between 0.05
and 0.1 seconds all by itself. Note that JUST GETTING THE SURFACE DATA
takes a significant fraction of the time spent in doing the volume
As you can see, there are a variety of scaling laws implicit in this
very crude analysis. If one has to go BEYOND nearest neighbors to long
range communications (so to update a point one has to talk to ALL the
nodes and get ALL the other points) then the communications scaling can
be horrible (or not, depending on the actual update algorithm).
In nearest neighbor lattice decomposition, surface becomes increasingly
irrelevant compared to volume as L increases, which means that running
big lattices actually decreases the fraction of time spent communicating
and gives you better scaling. This is actually true in many problems --
problem size is an important parameter when estimating scaling, as
important as the number of nodes. There are further nonlinear effects
to consider as well -- local performance gets a big boost when the
lattice blocks all fit into cache, or when they are organized to be
accessed efficiently in a vector fashion (e.g. in a checkerboard
algorithm for local site updates rather than sequential, especially in 3
dimension). Random access of memory can be an order of magnitude slower
than vector sequential access on many architectures as it effectively
defeats the cache altogether.
In summary, studying your problem and its parallel scaling is a very
good thing, but you need to kick the mathematics up a notch -- don't
just look up formulae and try to fit them, instead learn how to derive
them and understand them and then try to see how they might apply to
your actual numerical code.
Some judicious timing loops inserted in your code (if you are really
STUDYING this problem and want complete understanding) can help you
measure just how long the program takes doing this or that, where this
should probably be "execute the core loop" and that should probably be
"communicate with other nodes". Get these numbers and observe how they
scale with problem size PER node as well as with partitioning and the
number of nodes, and you'll be a good part of the way there.
You might also want to check out Ian Foster's online book -- google it
up or find a reference link on the brahma site. It teaches you a lot
about parallel programming, which is naturally intimately tied to
parallel scaling and cluster design.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf