Athlon SDR/DDR stats for specific gaussian98 jobs

Wed May 2 20:36:47 PDT 2001

On Wed, May 02, 2001 at 06:16:51PM -0400, Robert G. Brown's all...

> IIRC, somebody on the list (Josip Loncaric?) inserted prefetching into
> at least parts of ATLAS for use with athlons back when they were first
> released.  It apparently made a quite significant difference in
> performance.

Oh, btw, I did include Athlon3DNow!2 stuff in my ATLAS just for a laugh.
I then compared my results to the original non ATLAS g98 and everything
matched up for all 4 jobs I chose. I dont think its actually used
for G98 since its single precision stuff only IIRC. 

Also, note, we may start using MPQC in a bit - Graydon Hoare (ex Berlin
project, now at Redhat) is hacking on it to speed it up a bunch if
possible. I havent had a chance to benchmark that (it may demand
TB1333's on DDR boards and Ill be sunk! :)

In fact my stats indicated that a Duron 750 was slightly better for the money
in most cases, but the 900 and 700 are so close, I went with the faster chip
in the hopes that other situations that I cant benchmark now but will
encounter later will favour the faster CPU (also the supply of D700s is
waning).

> in the marginal area where it makes sense to go single.  It may not be
> very easy to tell which >>one<< is truly cost optimal, though, without
> benchmarking your particular code and doing a careful cost comparison
> including hidden costs (e.g. the fact that electricity costs and space
> costs may be 60% higher for lots of singles, each single requires a
> case, its own memory and copy of the OS and network card, and so forth).
> In many cases a dual is only 0.7-0.8 the cost of two singles, although
> the high cost of DDR makes that unlikely in this case.  Still, it is
> under $1/MB, which isn't all THAT bad -- PC133 cost that much only
> months ago.  Months from now DDR may cost little more than SDRAM in
> equivalent amounts.

I started forming my anti SMP bias when I ran a bunch of old G94 jobs
on a dual Celeron board. It *SUCKED* :) 100 Mhz ram was probably
the bottleneck, and the resource locking in Linux 2.2.(early) was not
as nice as it is now in 2.4 (so I hear). I have stats for it actually:

two jobs
""""""""
(Mhz ratio = 1.00 for 550Mhz, times for 2 jobs to be run)

#
C                                                 
P                                         speed    MHZ  efficiency
U CPU           MHz/bus RAM       total   ratio   ratio (mhz/speed ratios)
------------------------------------------------------------------------------
2 C366A         550/100 128M      4293.6  1.00    2.00	   0.50
1 P2 400        400/100 128M      6105.3  0.70    0.73	   0.96
1 P3 450        450/100 256M      5675.0  0.76    0.83	   0.92
1 C300A         450/100 64M       7253.5  0.59    0.83	   0.71

For g94 I saw almost the exact same performance out of a C450 as
a P3-450 (I guess the CPU's SSE extensions were not used by it).
So seeing that my heavily overclocked 550MHz Celeron's were only some
25% faster overall for 2 CPUs vs 1 at 450, thats pretty bad. :)
I think that this is the software fighting with the locking and the
shared bus (Abit BP6 IIRC was the board, which probably didnt have
the fastest architecture either. IIRC, dual celeron was a bad hack).
(As well, the stats above are not that fair as I ran a single job
instead of the 4 different types Im using now as my base set, not to
mention its G94 (outdated) and non-ATLAS.)

So its probably not optimal to even consider stats for SMP from a dual
Celerons considering they're so not designed for it. :) Nonetheless,
my benchmarking of various jobs showed me that memory bandwidth is really
important for most gaussian jobs, and you'd really need a speedy memory
bus to keep up with this. DDR will probably really help SMP for this
kind of thing.

> Sounds like you are dead right on all of this.  Embarrassingly parallel
> jobs, few to no communications, purely CPU bound -- Durons (or whatever
> currently delivers the most raw flops for the least money) are likely to
> be perfect for you.  And for many others, actually.  For a long time I
> like Celerons (or even dual Celerons) for the same reasons, although at
> this point I've converted to AMD-based systems as their cost-benefit has
> overwhelmed Intel's whole product line for my code.

Its interesting however, to note that the improvements per cycle especially
for DDR boards.  I actually removed this stat from my charts, but I did
compare all the Athlons to a Celeron 450 G89-atlas setup. 

These values are MUCH better than the non atlas values - ie atlas
really improves the efficiency of the jobs, especially for newer CPUs -
I assume by making better use of large L1/L2 caches and filling the longer
pipelines more optimally.

efficiency/cycle, atlas g98:
(each column is normalized to 1.00 for C450 w.r.t that column's data)

		   job 1	   job 2	   job 3	   job 4
			      |		      |		      |
		non	atlas |	non	atlas |	non	atlas |	non	atlas
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~
C450		1.00	1.00  |	1.00	1.00  |	1.00	1.00  |	1.00	1.00
A700			1.05  |	0.88	1.05  |	1.26	1.31  |	1.34	1.39
D750		0.86	1.30  |	0.70	0.77  |	1.05	1.10  |	1.25	1.24
T900		0.89	1.31  |	0.81	0.86  |	1.05	1.08  |	1.23	1.20
T1200DDR		1.08  |		1.04  |		1.30  |		1.42

I think we see these patterns because without optimization for caches
and pipeline, the disparate speeds between the CPU and RAM for the non
DDR machines is very large (up to 4 or 5 CPU cycles/ram cycle). The cache
helps a fair bit (the Tbird and Athlon fare much better than the duron),
but when we get ATLAS involved, the improvements are quite noticeable
over the baseline C450.

> Not yet, but maybe soon.  The fast Tbirds do require a big "certified"
> power supply, but I'm guessing they draw a lot less than they "require"
> except maybe in bursts.  I'm betting they draw around 100-150W running,
> a number that recently got some support on the list.

Bursts of CPU usage? Arent all our clusters all hammering our CPUs as much as
possible? And if ATLAS is really doing its job, arent we hammering all parts
of the CPU as much as possible? :)

> Not a preset package, but I'm trying to start a collection of sorts:
> 
>   http://www.phy.duke.edu/brahma/dual_athlon/tests.html

Will check it out.

> Hope this all helps or is interesting.  I'm very interested in Athlon
> performance profiles as they seem to be the current
> most-CPU-for-the-least-money winners, and when one buys in bulk (as
> beowulf humans tend to do) this sort of optimization really matters.

What kind of deals can you get in bulk? From AMD themselves? do you
need to be a big university and have a big press release event to
get these deals from them? How many do you need to get a batch deal?
How deep is the discount?

/kc

> 
>    rgb
> 
> -- 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 

Athlon SDR/DDR stats for *specific* gaussian98 jobs

Athlon SDR/DDR stats for specific gaussian98 jobs