SpecFP and cache

Tue May 16 14:08:51 PDT 2000

On Tue, 16 May 2000, Gerben Roest wrote:

> Hi,
> 
> I always thought that Athlons were so much faster in FP than Pentiums,
> until I saw some results for the SpecFP 95 at spec.org.
> 
> PIII-733: 30.4 (full speed 256 kB cache)
> PIII-600: 15.8 (half speed 512kB cache), normalized for 733: 19.3
> Athlon 750: 32.9 (2/5 speed 512 kB cache), normalized for 733: 31.9
> 
> This questionable comparison shows that the full speed cache of the PIII
> is a big advantage. Probably the new and improved (wider path and so on)
> cache of the CuMine is of big importance to the SpecFP bench.
> 
> My question is: Does anyone have experiences with old and new PIII's (that
> is, old and new cache) running big Fortran jobs? Is the difference there
> as much as the SpecFP bench seems to show?

I'd be very cautious about comparing >>any<< CPUs or systems on paper.
I wouldn't believe a priori that the Athlon is better than a P-anything
or a Celeron.  I wouldn't even take it on faith that an Alpha is as much
better as it is hyped to be.  It all depends on your code.  I've been
doing fairly extensive comparative benchmarking on Celerons, PII's, the
Athlon and 21264 Alpha 667's and hope to do PIII's as soon as we get any
(it's often hard to justify them compared to the Celeron, really).

As far as I can tell, a >>lot<< of the supposed advantage of the Athlon
comes from its superior cache/memory subsystem.  Indeed, my tests find
that the Athlon has a significant advantage over the PII and Celeron in
memory speed (unfortunately I have no PIII to test).  When I look at
"raw floating point speed" (in a small enough loop to fit in L1) it is
much faster than a Celeron.

However, when I run an actual, not particularly local Monte Carlo job
that contains a mix of double precision float, integer, and
transcendentals (my own research application, which I've been using as
MY most important benchmark for many years now) on Celeron (466 MHz),
PII (400 MHz), Athlon (900 MHz) and 21264 Alpha (667 MHz) I get:

#============================================================
# Benchmark run of On_spin3d on host athlon
# CPU = 900 MHz AMD K7 (athlon), Total RAM = 64 MB
# L = 16
# Time = 28.69user 0.01system 0:29.17elapsed
#============================================================
# Benchmark run of On_spin3d on host lucifer
# CPU = Celeron 466, Total RAM = 128 MB
# L = 16
# Time = 46.48user 0.03system 0:46.67elapsed 
#============================================================
# Benchmark run of On_spin3d on host brahma
# CPU = 400 MHz PII, Total RAM = 512 MB
# L = 16
# Time = 53.01user 0.03system 0:54.55elapsed
#============================================================
# Benchmark run of On_spin3d on host qcd2
# CPU = 667 MHz Alpha EV67, Total RAM = 512 MB
# Compiler = gcc!
# L = 16
# Time = 66.08user 6.14system 1:12.25elapsed 
# Compiler = ccc!
# Time = 23.79user 0.00system 0:23.84elapsed 
#============================================================

Note that on my nasty dirty application (as opposed to a nice, clean
benchmark) the Athlon UNDERPERFORMS the Celery, and PII at equivalent
clock -- the 900 MHz is 1.85x faster than the 400 MHz PII with its
half-speed cache, but of course the clock is 2.25x faster, making its
overall performance only 82% as good when allowing for CPU clock
scaling.  It is only 1.62x faster than the Celeron (with its full speed
cache) but there its clock is 1.93x faster, so again it is only 84% as
fast when allowing for CPU clock.  A dual Celery 466 running two
instances of my embarrassingly parallel application will complete in
that same 47 seconds, but will take 57 seconds on a 900 MHz Athlon.

Costing out a dual Celery vs a 900 MHz Athlon, the dual Celery is the
price/performance winner for me by a small amount, because all I care
about is how many embarrassingly parallel runs I can complete, not how
long it takes to do any particular run.  

The XP1000 Alpha (667 MHz) wins on raw speed by a factor of 2.23 over
the PII (PROVIDED that one uses the Compaq compiler ccc and not gcc!).
This beats the ratio of their clocks (1.67), but is still pretty
disappointing -- the Alpha finishes my job only 1.34x faster at
equivalent clock (where there is near-perfect clock speed scaling within
the Intel family, with Celerons and PII's performing according to the
ratio of their clocks only).  

Since the XP1000 costs an easy 6x as much as a dual celery, determining
the best buy here isn't rocket science.  This is just using ccc -O3 on
the alpha and gcc -O3 on the others; perhaps there are optimization
secrets that would speed up the code (in either or all cases) but I'm
using the compilers "as is" to compare their more or less standardized
optimizations.

I'm working on doing lmbench (which tests a whole suite of "microscopic"
performance indicators, e.g. latencies and bandwidths but not generally
float/trancendental performance -- on all of these systems but the
version I have won't compile cleanly under ccc and it is pretty clear
that using gcc is a waste of time, so I don't have anything meaningful
yet for the alpha.  I also suspect on the basis of some of my tests that
ccc-base results "cannot be trusted" when doing simple things like
executing a simple multiply/add/subtract/divide in a loop where the
results are never "used" -- its optimizer seems to just cancel out the
whole thing.  Either that or it is unbelievably fast, since when I
timed:

/* 
 * one addition, subtraction, multiplication and division per pass,
 * scaled up to  2.5x10^12 passes (or ten terafloat ops) 
 */
for(k=1;k<=2500000000;k++){
	for(i=0;i<1000;i++){
		x[i] = (pi + x[i])*(pi - x[i])/2.0;
	}
}

it was still completing in 0.00 seconds.  Maybe if I printed out the
final x[i] vector it would relent and actually do the calculation.

Anyway, my point in all of this is that I'd be VERY suspicious of ANY
benchmarks other than your own code.  If I went by paper results, I'd
expect the alpha to be 3x faster or even more at common clock, not 1.34x
faster.  I'd also expect (from all of the hype) that the Athlon would be
proportionally better than a Celeron (with its tiny 128K L2 cache) by at
least some amount instead of worse.

Your own application, with all of its flaws, is the one benchmark you
can always trust to predict how much work you'll get done (all things
being equal) with a given hardware platform.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu