interesting Athlon/P4 discussion from FreeBSD-Q-l

Don Holmgren djholm at fnal.gov
Thu Apr 19 08:34:44 PDT 2001


I've posted a plot of Athlon vs P4 performance on our "real" code - in
this case, MILC lattice QCD code (quark/gluon simulations).  This
illustrates well the various points under discussion, I think.  See

   http://qcdhome.fnal.gov/benchmarks/athlon_vs_pIV.html

There are four curves - 1 GHz Athlon with DDR (Gigabyte GA-7DX), 
1.5 GHz Pentium IV with 800 MHz RDRAM (Dell 330, not sure of the
motherboard), and a 1.4 GHz Pentium IV with 800 MHz RDRAM (Intel D850GB
motherboard, home-built, about $1100 w/ 128MB memory) shown with and
without SSE and prefetch optimizations.

The Athlon and 1.5 GHz P4 curves are from binaries built from standard
MILC version 5 code (gcc 2.95.2).  The 1.4 GHz P4 optimized curve is
from a binary built with prefetches added in by hand, and with
hand-coded matrix-matrix and matrix-vector multiplies using SSE.

Where the lattice size just starts to exceed the P4 L2 cache (256 MB),
the Athlon outperforms the P4's running non-SSE code.  At very small
lattice sizes, assuming scaling with clock speed (i.e., comparing a
mythical 1.4 or 1.5 GHz Athlon with the P4's) the Athlon outperforms the
P4's.  We would likely never run at such sizes, however.

In main memory, the higer memory bandwidth of the P4's is
evident.  STREAMS copy numbers were 723, 1243, and 1264 MB/sec
respectively for the Athlon and the 1.4 and 1.5 GHz P4's.

Don Holmgren
Fermilab




On Wed, 18 Apr 2001, Mark Hahn wrote:

> > Cant vouch for correctness, but seems to have some explanations/info that
> > werent mentioned here. Feel free to rebut the content of course.
> 
> the P4 has an awesome combination of hardware prefetcher,
> fast FSB, and dram that keeps up with it.  for code that 
> needs bandwidth, this is very attractive.  and it's dramatically
> faster than anything else in the ia32 world: 1.6 GB/s versus
> at most around .8 GB/s for even PC2100 DDR systems (at least 
> so far - I'm hopeful that DDR can manage around 1.2 GB/s when
> tuned, and if the next-gen Athlon contains hardware prefetch.)
> 
> but it's also true that most code, even a lot of computational code,
> is not primarily dram-bandwidth-bound.  the P4 is not exceptional
> when running real code in-cache; this is why on most benchmarks
> other than Stream, recent Athlons beat P4's quite handily.
> 
> and that's why AMD is having such an awsome time in the market now,
> and why Intel is cutting prices so dramatically on the P4.
> 
> regards, mark hahn.
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 





More information about the Beowulf mailing list