siegert at sfu.ca
Wed Jan 17 15:29:23 PST 2001
On Wed, 17 Jan 2001 Randy_Howard at Dell.com wrote:
>> Well, my intent was not to establish specific numbers but rather to
>> get an idea of "bang for the buck" factors with various hardware
>> configurations. For example, I wonder if there is a way of
>> predicting up front for a given application whether or not 10/100
>> ethernet would be sufficient and not become the primary bottleneck.
>> I understand it is a very complex problem and this may not even
>> be possible.
On Wed, Jan 17, 2001 at 04:19:01PM -0500, Robert G. Brown wrote:
> Oh, it's possible all right but it isn't easy. As you say, it's
> (fundamentally!) a complex problem, so you have to learn to understand
> and manage the complexity. A general methodology might be outlined
> something like:
<snip very valid, but necessarily lengthy procedure>
Hmm. This is all very valid and correct, but unfortunately quite
overwhelming, particularly if you want to build your first
cluster. I'm wondering whether it wouldn't be possible
to establish a database of cluster benchmarks that could provide
hints (these won't be more than hints, but nevertheless these could
Here is the idea: There should be benchmarks and speedup data for different
type of cluster applications:
1. Embarrassingly parallel (e.g., Monte-Carlo simulations).
In this case the benchmark will be dominated by the CPUs, the
interconnect is unimportant, the speedup curve will show linear
scaling for (almost) unlimited number of processors.
2. Applications with "nearest neighbour" communications
(e.g., finite-difference methods for PDEs). In this case there is
significant communication between processors, however, since the
communication is local (i.e., processor n only talks with n+1 and
n-1) the scaling of the communication time with the # of processors
is not so bad (constant + probably a small linear piece).
In this case you should see a maximum in the speedup curve the location
of which depends on you interconnect.
3. Applications with pairwise (all-to-all) communications
(e.g., parallel FFT). In this case the time for communication scales
proportional to the square of the # of processors. The benchmark will
be dominated by the speed of the interconnect, i.e., the speedup curve
will show minimal speedups (or even speedups < 1) for fast ethernet.
There may be a few more cases (but probably not many more).
A real application will be a mixture of these three scenarios. But if you
know how, e.g., a PIII/800MHz cluster with fast ethernet scales in these
cases, you at least have some hints how your own application may scale
on certain architectures.
Sure, there are complications: The results depend on the MPI distribution
used: e.g., lam works best when small latencies are required, mpipro is
good when high throughput is required, etc.
But nevertheless, I'm sure something like this would have helped me when
I set up my first cluster.
Academic Computing Services phone: (604) 291-4691
Simon Fraser University fax: (604) 291-4242
Burnaby, British Columbia email: siegert at sfu.ca
Canada V5A 1S6
More information about the Beowulf