[Beowulf] How does one calculate the scalability of one's code?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Jun 10 08:23:57 PDT 2004
- Previous message: [Beowulf] How does one calculate the scalability of one's code?
- Next message: [Beowulf] Re: Redmond is at it, again
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 10 Jun 2004 daniel.kidger at quadrics.com wrote: > Chris, > > > To make the number more clear > > Run1 Specs > > Grid Size: 100 x 100 x 100 > > Nodes: 4 > > Time: 605 seconds > > > > Run2 Specs > > Grid Size: 200 x 200 x 200 > > Nodes: 32 > > Time: 944 seconds > > > > So Run2 took ~ 1.6 times as long as Run1 > > You must also understand what your code is doing. If you code iterates to a converged solution (say using PCG) then the number of iterations rises as the number of freedoms increases. So a 1.6x speedup might be perfect scaling still. > > Also look at load balancing - how are the 200^3 grid cells distributed over the 32 Nodes. > You may well find that the workload given to each cpu is not equal. > > For the record - what compute nodes are you using - how many cpus does each have. And are you using a high performance interconnect ?- if so it should be able to give you statistics of average message size, number of bytes transferred for each program run etc. This would be interesting. ...and to continue Greg's observation with some direction -- What you probably want to do is not try to fit Amdahl's law per se, but variants that are fairly easily derived from the same sort of analysis that include at least the time required for communications between nodes if not something a bit more sophisticated as might be required by your communications model and problem partitioning. A web reference: http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf_book.php has a whole chapter devoted to Amdahl's law and generalizations, including some little graphs to help you understand why one doesn't generally see linear scaling in "real" (as opposed to embarrassingly) parallel problems. Remarks apropos of your particular problem: If you are doing things with Maxwell's equations (what, I don't know) you are likely solving things like Laplace or Helmholtz equations on a grid (with ODE/PDE solvers). If the problem is a lattice decomposition of a spatial volume and you use a local method (one where each lattice point has a value that is updated only on the basis of its nearest neighbors) and each node contains a subvolume of characteristic length L, then the number of sites that are updated on the node scales like L^3, right? However, order L^2 (exact prefactor depends on the lattice geometry) of these sites have to get their "nearest neighbor" data from another node over the network instead of from main memory. The problem is that retrieving some numbers from main memory and doing some simple numerical operations on those numbers and putting them back into main memory is fast -- on a cubic lattice we can guestimate order of tens of nanoseconds for the retrieve and order tens more for the arithmetic, unless you use e.g. trancendental functions or the like in the core update. Call it (perhaps, this is a very rough estimate) 100 nanoseconds total per site update, or ten million site updates per second -- in principle one could update 100x100x100 cube several times a second. However, retrieving the numbers over the network is much slower. If one gets them poorly (send/receive them one at a time) it might well take 50 MICROseconds per number. If one gets the efficiently in a single send block, then our assumed 100x100x100 lattice needs roughly 6x10^4 ~ 10^5 double precision numbers, or perhaps half a megabyte of data per site update sweep. Assuming truly optimal transmission and 100 BT and a clever pattern of sending and receiving, and allowing for overhead in the message passing process, I'd guess that this would take between 0.05 and 0.1 seconds all by itself. Note that JUST GETTING THE SURFACE DATA takes a significant fraction of the time spent in doing the volume computation! As you can see, there are a variety of scaling laws implicit in this very crude analysis. If one has to go BEYOND nearest neighbors to long range communications (so to update a point one has to talk to ALL the nodes and get ALL the other points) then the communications scaling can be horrible (or not, depending on the actual update algorithm). In nearest neighbor lattice decomposition, surface becomes increasingly irrelevant compared to volume as L increases, which means that running big lattices actually decreases the fraction of time spent communicating and gives you better scaling. This is actually true in many problems -- problem size is an important parameter when estimating scaling, as important as the number of nodes. There are further nonlinear effects to consider as well -- local performance gets a big boost when the lattice blocks all fit into cache, or when they are organized to be accessed efficiently in a vector fashion (e.g. in a checkerboard algorithm for local site updates rather than sequential, especially in 3 dimension). Random access of memory can be an order of magnitude slower than vector sequential access on many architectures as it effectively defeats the cache altogether. In summary, studying your problem and its parallel scaling is a very good thing, but you need to kick the mathematics up a notch -- don't just look up formulae and try to fit them, instead learn how to derive them and understand them and then try to see how they might apply to your actual numerical code. Some judicious timing loops inserted in your code (if you are really STUDYING this problem and want complete understanding) can help you measure just how long the program takes doing this or that, where this should probably be "execute the core loop" and that should probably be "communicate with other nodes". Get these numbers and observe how they scale with problem size PER node as well as with partitioning and the number of nodes, and you'll be a good part of the way there. You might also want to check out Ian Foster's online book -- google it up or find a reference link on the brahma site. It teaches you a lot about parallel programming, which is naturally intimately tied to parallel scaling and cluster design. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] How does one calculate the scalability of one's code?
- Next message: [Beowulf] Re: Redmond is at it, again
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
