MFLOPS expected

Wed Feb 28 17:16:40 PST 2001

On Wed, 28 Feb 2001 jmggarcia at infosel.net.mx wrote:

> How many MFLOPS can be expected from a Beowulf cluster with something like 16 Pentium III at 1GHz ?
> In my little four nodes cluster prototype, Blas benchmark give me a peak performance of 14 MFlops, isn't too low ?

Ah, the eternal question.  If one car can go 160 kph, how fast will my
garage of four cars go?  First of all, you should ask, "What's a MFLOP
and what does it have to do with my application?"

If your goal (with cars) is to drive to Detroit with one person doing
the driving, it can take you a lot longer to use four cars than just one
(presuming that you MUST get yourself and all four cars to Detroit).  On
the other hand, if your goal is to get your family and your friends to
Detroit (say, 24 people total) it will take a lot less time with four
cars than with one.

So it is with parallel computing.

Thus the essential meaninglessness of your question.  One possible
answer is "N times the number of MFLOPS produced by a single node in
your cluster" (using any of the manifold definitions of a "MFLOP" that
you care two -- they all mean equally much, or little, with respect you
your particular application").  Another answer might be (literally) "1/2
the number of MFLOPS produced by a single node in the cluster".  Or even
worse.  Or (semi-paradoxically, due to possible but unlikely superlinear
speedup in the partitioned job) even better.  To further complicate the
issue, your job might be integer instruction dominated.  It might have
lots of trancendental calls.  It might have short range or long range
communications with a variety of patterns and granularities.  It might
even be dominated by simple linear transforms (so that a BLAS or Linpack
benchmark isn't totally irrelevant).  Even if it IS dominated by linear
transforms, are you using ATLAS?  If not, your BLAS speed (and hence
your "MFLOPS" rating) can vary by a factor of 2-3 quite easily when it
is tuned to your cache and memory.

Clearly nobody on the list can answer your question, and the answer
isn't useful anyway.  So why bother?

There ARE, however, useful questions -- and answers.  Ask instead "How
long does it take MY APPLICATION (which I'm sure you care about much
more than a BLAS or Linpack or cpu-rate benchmark anyway) to complete on
a single node?"  You can measure this easily.  Now ask "How long does it
take to complete in parallel running on two, three, four of my nodes?"
You can measure this easily.

If the time for four nodes is roughly 1/4 the time for one node, life is
good, your problem has parallelized well (out to at least four nodes,
don't get carried away) and you can reasonably hope to parallelize it
out to more.  If it runs at 1/2 the speed on four nodes that it does on
one (and it might! seriously!) don't be discouraged -- instead figure
out why and see if one of the simple remedies gives you better scaling.

To understand the scaling of completion times (or "effective speed") in
a cluster environment, read almost any of the papers or the online draft
book on brahma:

  http://www.phy.duke.edu/brahma

and look around the "Amdahl's Law" chapter and immediately following,
especially at the figures and equations used to estimate speedup.

Now that that is all said, ONE place to look at ONE measure of
"bogomflops" is on the aforementioned brahma site, where a "cpu-rate"
benchmark I've been working on is discussed and some results presented
with it.  [Note to listvolk: I finally updated the figures and the
source links to the STILL BETA version of the fixed benchmark.  And I
included a legend!;-)]

I don't know how reliable this microbenchmark is -- I just fixed (I
think, pray, hope) a pretty horrendous bug -- but it is still a possibly
useful (or no more useless than the rest:-) measure.  One thing that it
does do that is very useful (I think, presuming that it is at least
moderately accurate at actually measuring what it purports to measure)
is show the tremendous variation in "CPU speed" by any measure you like
as the size of a calculation is varied.  For example, once could claim
speeds of "100 MFLOPs" for a PIII in L1 and only "55 MFLOPS" when
running out of main memory.  Factors of 2-3 difference in speed are
commonplace.

Even larger factors can occur when comparing the speed of particular
floating point operations.  Addition and multiplication tend to be
"fast".  Division tends to be "slow".  Then there are transcendental
function rates, which depend both on the CPU and the library.  Finally,
integer performance can be important.

To conclude with one last warning.  I really meant it above -- the ONLY
benchmark that matters is your application.  Citing benchmarks outside
of that is only necessary when a) writing a grant proposal (Alas,
Babylon!); b) trying to engineer a cluster when one CANNOT run one's
application as a benchmark on trial node hardware.  You've already got a
small cluster and your application in hand, which is all you need to
start work on efficient design and parallelization.  Forget BOGOMFLOPS.
My own benchmark, which I thoroughly understand and is quite simple,
isn't a good predictor of the performance of even my own favorite (much
more complex) application...

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu