Robert G. Brown
rgb at phy.duke.edu
Wed Jan 17 13:19:01 PST 2001
On Wed, 17 Jan 2001 Randy_Howard at Dell.com wrote:
> Well, my intent was not to establish specific numbers but rather to
> get an idea of "bang for the buck" factors with various hardware
> configurations. For example, I wonder if there is a way of
> predicting up front for a given application whether or not 10/100
> ethernet would be sufficient and not become the primary bottleneck.
> I understand it is a very complex problem and this may not even
> be possible.
Oh, it's possible all right but it isn't easy. As you say, it's
(fundamentally!) a complex problem, so you have to learn to understand
and manage the complexity. A general methodology might be outlined
a) Study the application to be parallelized. Separate it mentally to
the extent possible into serial (must be done on just one computer one
instruction after another) and (in principle) parallelizable parts.
b) Profile the original application, and note how long is being spent
doing serial work and the work that might be parallelizable. Unless
the time spent doing the latter (vastly) exceeds the former, you should
c) Consider the work being done in the parallelizable parts, and
consult books on parallel algorithms and parallel libraries to get an
idea of what the scaling properties of the parallel components are
likely to be. In many cases you may have to invent a parallel algorithm
and figure everything out yourself, but a lot of things are already in
the literature and you don't want to reinvent wheels. Especially not
d) Assemble (if possible) a small prototyping cluster. This probably
doesn't need to be a "real beowulf" -- a handful of workstations on a
common switched net would likely suffice. It would be ideal if one or
more of those systems were "close" to the architecture being considered,
e.g. Intels if Intel, Alphas if Alpha. Comparisons are also useful,
though (one or more of each).
e) Run benchmarks to determine the raw speed of your network, your CPU
platform(s), their memory and other key components. Rerun your task
profiles on the different architectures and at different scales to get a
decent idea of what bounds its execution. That is (for example) does
the application cache well, or is it memory bus bound? If it caches
well, why buy bleeding edge memory based systems? If it is memory bus
bound, why spend money on CPU clock when you should be buying faster
f) Parallelize the application in what you expect (based on all this)
to be the best way. Run test timings on your mini-cluster. Use
equations to hopefully extrapolate parallel speedup (scaling) to a large
beowulf with any given/candidate hardware or network architecture.
g) If warranted by all this design and prototyping (you get good
numbers and you've got the money) then build the "best" architecture.
Note that this is a very sloppy scheme. That is because what you are
doing (at heart) is trying to co-optimize a parallel implementation of
the task, an architecture to run it on, and your pocketbook when the
benefit function is highly nonlinear (in fact, has numerous places where
there are jump discontinuities) in all related degrees of freedom. This
sort of thing cannot be done by recipe; it is art.
Note also that this isn't as pessimistic a scheme as all that (if this
seems daunting). For a lot of problems you'll either immediately reject
them as unsuitable for parallelization at all on any architecture (the
upper bound speedup given by Amdahl's Law just not worth the hassle) or
you'll find that they have an "obvious" parallel decomposition and a
very predictable parallel speedup for a given general architecture.
However, there also will exist problems that can be parallelized
efficiently, but only if you rewrite the entire application (possibly
even inverting part of the logic so it is quite non-intuitive) and then
only when matched with a particular, possibly expensive, architecture.
There are lots of problems, for example, that will scale well with
Myrinet that will scale poorly with only fast ethernet; others where
10Base ethernet is more than adequate. For others CPU clock and cache
size will matter. For some memory speed will matter. Finally, you
aren't really trying to optimize "parallel speedup" -- you just want to
get the most work done for the least money, which may well mean that a
slow, pokey (but Cheap!) architecture may win out over a technically
much more advanced and favorable (but Expensive!) architecture.
For a lot of programs Alphas (for example) are much faster CPU/memory
architectures, but they also tend to cost a lot more. Is it better to
buy one Alpha node at $6K or 10 e.g. Duron nodes at $600/each? The only
possible answer is "it depends". You'll almost certainly buy "more" raw
floating point cycles with 10 Durons, but there are plenty of problems
where you'll get much less work done. Ditto issues like 10Base,
100Base, 1000Base ethernet vs Myrinet -- costs differ, latencies and
bandwidths GREATLY differ, bus interfaces and driver efficiencies differ
-- and then there are the specific needs of your parallel application
using THIS algorithm (given these speeds and latencies) versus THAT on
THIS CPU and memory architecture for THIS dollar cost per node.
Complex, to be sure, but you get the idea...;-)
So apply the recipe above with love and discretion, and keep your goal
clearly in sight. Remember, you don't have to be perfect first try,
just good enough. Also remember that one of your many cost/benefit
choices is to NOT do the above design loop yourself. There exist a
number of companies (e.g. Paralogics, HPTi) whose whole raison d'etre is
to do it for you. In many cases they will pay for themselves many times
over by doing a far better job than you would by yourself (especially
for a really complex task). In others the tasks are "simple" and you
should go it alone. Only you can decide...
BTW, there are some useful tools and short papers and observations
apropos to all this on brahma, www.phy.duke.edu/brahma.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf