Cluster benchmark(s)?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Jan 17 13:19:01 PST 2001
- Previous message: Cluster benchmark(s)?
- Next message: Cluster benchmark(s)?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 17 Jan 2001 Randy_Howard at Dell.com wrote: > > Well, my intent was not to establish specific numbers but rather to > get an idea of "bang for the buck" factors with various hardware > configurations. For example, I wonder if there is a way of > predicting up front for a given application whether or not 10/100 > ethernet would be sufficient and not become the primary bottleneck. > I understand it is a very complex problem and this may not even > be possible. Oh, it's possible all right but it isn't easy. As you say, it's (fundamentally!) a complex problem, so you have to learn to understand and manage the complexity. A general methodology might be outlined something like: a) Study the application to be parallelized. Separate it mentally to the extent possible into serial (must be done on just one computer one instruction after another) and (in principle) parallelizable parts. b) Profile the original application, and note how long is being spent doing serial work and the work that might be parallelizable. Unless the time spent doing the latter (vastly) exceeds the former, you should probably quit. c) Consider the work being done in the parallelizable parts, and consult books on parallel algorithms and parallel libraries to get an idea of what the scaling properties of the parallel components are likely to be. In many cases you may have to invent a parallel algorithm and figure everything out yourself, but a lot of things are already in the literature and you don't want to reinvent wheels. Especially not badly. d) Assemble (if possible) a small prototyping cluster. This probably doesn't need to be a "real beowulf" -- a handful of workstations on a common switched net would likely suffice. It would be ideal if one or more of those systems were "close" to the architecture being considered, e.g. Intels if Intel, Alphas if Alpha. Comparisons are also useful, though (one or more of each). e) Run benchmarks to determine the raw speed of your network, your CPU platform(s), their memory and other key components. Rerun your task profiles on the different architectures and at different scales to get a decent idea of what bounds its execution. That is (for example) does the application cache well, or is it memory bus bound? If it caches well, why buy bleeding edge memory based systems? If it is memory bus bound, why spend money on CPU clock when you should be buying faster memory? f) Parallelize the application in what you expect (based on all this) to be the best way. Run test timings on your mini-cluster. Use equations to hopefully extrapolate parallel speedup (scaling) to a large beowulf with any given/candidate hardware or network architecture. g) If warranted by all this design and prototyping (you get good numbers and you've got the money) then build the "best" architecture. Enjoy. Note that this is a very sloppy scheme. That is because what you are doing (at heart) is trying to co-optimize a parallel implementation of the task, an architecture to run it on, and your pocketbook when the benefit function is highly nonlinear (in fact, has numerous places where there are jump discontinuities) in all related degrees of freedom. This sort of thing cannot be done by recipe; it is art. Note also that this isn't as pessimistic a scheme as all that (if this seems daunting). For a lot of problems you'll either immediately reject them as unsuitable for parallelization at all on any architecture (the upper bound speedup given by Amdahl's Law just not worth the hassle) or you'll find that they have an "obvious" parallel decomposition and a very predictable parallel speedup for a given general architecture. However, there also will exist problems that can be parallelized efficiently, but only if you rewrite the entire application (possibly even inverting part of the logic so it is quite non-intuitive) and then only when matched with a particular, possibly expensive, architecture. There are lots of problems, for example, that will scale well with Myrinet that will scale poorly with only fast ethernet; others where 10Base ethernet is more than adequate. For others CPU clock and cache size will matter. For some memory speed will matter. Finally, you aren't really trying to optimize "parallel speedup" -- you just want to get the most work done for the least money, which may well mean that a slow, pokey (but Cheap!) architecture may win out over a technically much more advanced and favorable (but Expensive!) architecture. For a lot of programs Alphas (for example) are much faster CPU/memory architectures, but they also tend to cost a lot more. Is it better to buy one Alpha node at $6K or 10 e.g. Duron nodes at $600/each? The only possible answer is "it depends". You'll almost certainly buy "more" raw floating point cycles with 10 Durons, but there are plenty of problems where you'll get much less work done. Ditto issues like 10Base, 100Base, 1000Base ethernet vs Myrinet -- costs differ, latencies and bandwidths GREATLY differ, bus interfaces and driver efficiencies differ -- and then there are the specific needs of your parallel application using THIS algorithm (given these speeds and latencies) versus THAT on THIS CPU and memory architecture for THIS dollar cost per node. Complex, to be sure, but you get the idea...;-) So apply the recipe above with love and discretion, and keep your goal clearly in sight. Remember, you don't have to be perfect first try, just good enough. Also remember that one of your many cost/benefit choices is to NOT do the above design loop yourself. There exist a number of companies (e.g. Paralogics, HPTi) whose whole raison d'etre is to do it for you. In many cases they will pay for themselves many times over by doing a far better job than you would by yourself (especially for a really complex task). In others the tasks are "simple" and you should go it alone. Only you can decide... rgb BTW, there are some useful tools and short papers and observations apropos to all this on brahma, www.phy.duke.edu/brahma. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Cluster benchmark(s)?
- Next message: Cluster benchmark(s)?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
