How many Gflops?

Fri May 11 14:07:32 PDT 2001

On Fri, 11 May 2001, Rob Simac wrote:

> I would like to find out if anyone knows how many Gflops at Athlon
> 1.3Ghz CPU can perform at peak.

There is the perpetual question of "what's a gigaflop" that makes this
question ambgiguous if not meaningless.  However, I'll give you at least
one answer and you can judge how meaningful it really is for you.

In L2 I've measured about 270-275 peak MFLOPS (double precision) with
cpu-rate (http://www.phy.duke.edu/brahma) (which averages the rate at
which addition, subtraction, multiplication and division occur, where
division is generally very slow and a rate limiting factor) on a 1.2 GHz
Tbird Athlon.  Extrapolating (as is pretty reasonable to do in this
case, in cache) to a 1.33 GHz Tbird one might get 300 MFLOPS (or only
0.3 of a GFLOPS -- not even ONE GFLOPS).

However, as one increases the size of the memory vectors one operates on
(running out of main memory instead of cache) the rate drops off to
about 115 MFLOPS where at least part of that good a performance (and it
really is quite good, comparatively -- only Alphas benchmark faster out
there) is due to the use of DDR, as in this regime floating point is
limited by streaming large memory access speed (so stream MFLOPS becomes
a viable measure of floating point speed if you prefer them to
cpu-rate).  This is not peak, though.

The cpu-rate numbers aren't peak either.  It is quite possible for
aggressive optimization, different compiler choices, hand-coded
assembler, and perhaps the use of e.g. prefetch to improve them, and
then there are the manufacturer's quoted theoretical peak "maximum
FLOPS" which I've never seen or even heard of anybody who has seen but
which might exist.  cpu-rate also always involves SOME sort of vector
addressing -- it doesn't just multiply four static variables a gazillion
times and evaluate the rate, so it arguably isn't even close to a
register-to-register peak rate without any need to access memory at all.
However, the cpu-rate numbers are based on straightforward compiled
code and are at least MAYBE relevant to certain common operations in
core loops.

Then there are LINPACK MFLOPS and probably others.  MFLOPS is really a
pretty meaningless measure, especially given that "peak" MFLOPS will
seriously increase if the operation(s) in question is just addition
and/or multiplication (which are often heavily optimized in the chip
design).  As an example of another trap, I've learned the hard way that
many vendors (Intel, for example) optimize division by integers that are
a power of two so that it is done by a bit shift instead of a full
floating point division algorithm -- a measure of "FLOPS" based on
(floating point!) multiplication or division by numbers that happen to
be integers can be skewed by more than a factor of 2 up.  Are these
"peak" FLOPS?  Or just absurdly unlikely accidents in most real code?

A more useful way of viewing and using measures like MFLOPS with all its
many possible definitions is comparatively.  The fact that an Athlon
1200 Tbird with DDR gets 270 or so peak double precision MFLOPS on
cpu-rate is really pretty irrelevant unless your application EXACTLY
resembles cpu-rate in its main core loop.  However, the fact that it
gets 270 peak while a 933 MHz PIII with ordinary PC133 gets only a bit
more than 100 peak while an Athlon 800 MHz Tbird with PC133 gets perhaps
177 peak and a lowly 466 MHz Celron gets about 50 peak is possibly
relevant.  In both cases the peak scales nearly perfectly with CPU clock
WITHIN families (Athlon vs P6-family) which gives us a certain amount of
warm fuzziness -- the benchmark is insensitive to the (>>very<<
different) main memory speeds, as it should be in this range (for
vectors maybe 40-80K in length that fit easily in all the L2 caches).
It also shows that for code of this type, the Athlon blows the pants off
of the P6.

HOWEVER, other code that I run shows the Athlon slightly underperforming
equivalent clock compared to the P6 family.  Then there are the very
different and not particularly CPU clock-speed proportional results that
hold when the vectors are much bigger than L2.  Then there is the fact
that cache sizes differ.  Then there are latency dominated (instead of
streaming vector memory dominated) results to consider.  Your mileage
can and almost certainly will vary.

Aside from this sort of VERY crude rough comparison, the only really
useful purpose for the FLOPS rating of a system (any of them!) is to put
it into a grant proposal or bandy it around to impress the more ignorant
and impressionable of your friends.  Otherwise one should seek to
prototype and benchmark your actual application, or hope that your code
nearly exactly resembles lmbench, or LINPACK, or stream, or cpu-rate, or
any of the various components of SPEC.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu