[Beowulf] Woodcrest Memory bandwidth

Tue Aug 15 07:16:59 PDT 2006

On Tue, Aug 15, 2006 at 12:29:02PM +0100, Kozin, I (Igor) wrote:
> 
> Good point which makes perfect sense to me.
> Given that the theoretical maximum is actually 21.3 GB/s
> the real maximum Triad number must be 21.3/3 = 7.1 GB/s.
> And that's the best number I've heard of.

Then how do you explain a dual opteron with two 6.4GB/sec (peak)
memory system, 12.8GB/sec total per node managing 9-10GB/sec?

12.8/3=4.26GB/sec.  People are seeing well over twice that.

If the opteron manages 75% efficiency or so on a 12.8GB/sec
memory system, why does woodcrest manage 32% efficiency?

> Here is a pointer to some measured latencies
> http://www.anandtech.com/IT/showdoc.aspx?i=2772&p=4

Interesting, the woodcrest latencies are much higher than I've seen
elsewhere.  It's been awhile since I looked at the lmbench source,
I seem to recall it used to do a negative stride, but then one of the
the architectures detected it and successfully prefetched it.

I'll check, if it doesn't do true random accesses I'll post a threaded
benchmark that does.

> Incidentally, the same site dwells on low latency of Core 2
> http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2795&p=5
> Anybody run stream on it?

Note the 256byte stride, looks like it's a test of the prefetcher more
than true memory latency.

Note this link Mark H. brought to my attention:
	http://www.anandtech.com/mb/showdoc.aspx?i=2810&p=4

It shows that a mismatch of FSB and 2 x memory speed hurts performance
signficantly.  The DDR2-533 + core 2 FSB/1066 significantly outperforms
the DDR2-667 + core 2 FSB/1066.  If this holds true on woodcrest it would
seem that many of the woodcrest systems available from tier-1 vendors
are shipping with a significantly sub-optimal memory configuration.  It's
rather counter intuitive for the faster (ddr2-667) memory to lead to 
only 75% of the slower memory's (ddr2-533's) performance.

Based on that one might speculate that a DDR2-667 + core2/woodcrest/1333
would score significantly better.  Although I've yet to find a compiler,
os, and BIOS that demonstrates significantly better numbers.  Offline I've
reports from people who: 
* Have the FSB snoop filter off by default in BIOS 
* Have the adjacent cache prefetch on (which would likely increase main
  memory latency)
* Have dimms in the wrong slots (4 dimms on 2 channels, not 4 on 4 channels).

I've also seen intel documents on the chipset showing that stream numbers
increase with the number of dimms and ranks.  So 4 single rank dimms only
get 65% of possible.  8 single rank 80%, and 8 double rank dimms get 100%
of possible stream bandwidth.  Alas no absolute numbers.

Seems a little strange to get only 65% of possible stream bandwidth
with 4 dimms, after all their peak bandwidth is 21GB/sec.  Maybe FBdimms
and/or the current chipset only allows a few pages open per bank/rank?
So it takes 16 bank/ranks to allow for good stream performance (read that
as allowing the prefetcher to hide 120ns or so of main memory latency).

I'm guessing the best woodcrest stream numbers will be:
* Pathscale's compiler with -mp -O3 or possibly -mp -Ofast
* 8 dual rank dimms
* FSB snoop filter on (in BIOS)
* Prefetch adjacent cache lines (in BIOS)

--  
Bill Broadley
Computational Science and Engineering
UC Davis