Bill Broadley <bill at cse.ucdavis.edu> wrote:

> Dual socket quad core opteron 2350's (2.0 GHz) running the current McCalpin'S
> STREAM compiled with pathscale-3.0 -mp -O4:
> Total memory required = 228.9 MB.
> Function      Rate (MB/s)   Avg time     Min time     Max time
> Copy:       15355.3139       0.0104       0.0104       0.0105
> Scale:      15249.5885       0.0105       0.0105       0.0105
> Add:         14954.2883       0.0161       0.0160       0.0162
> Triad:       15061.2389       0.0160       0.0159       0.0160

So with all 8 cores at work from 2 sockets you are seeing 70% of peak assuming
you are using 667 MHz DDR2 (as fast as you can get until the "Phenom" comes
out I think) which is a little better on a percentage basis than socket 940 numbers.
That meets expectations.  I am surprised by the latency number you provide though.
Latencies in the 90 to 100+ nanos are quite a bit higher than I expected and are edging
up into the Intel range.  Perhaps this is an L3 cache delay effect -- a new layer in
the path to memory in the Barcelona.  Although I see your 200 series numbers are
up there too ... I thought first byte latencies were around 65 nanos for Opteron.  Am
I confused?  

Anyway, if the latency numbers hold up, I would say this is not the greatest news for 
Barcelona.  We can anticipate faster clocks which should help, but it makes you wonder what
things would have looked like with a larger shared L2 cache instead of an L3.  This is
a synthetic test of course,  what compilers and users do to strip mine for cache will
present a more realistic assessment.  Perhaps this was the trade off driving this 
design.  Can I continue to think of the AMD as the first byte latency king? ... ;-) ...



