[Beowulf] Barcelona numbers
bill at cse.ucdavis.edu
Tue Sep 11 00:09:08 PDT 2007
Latency for an amd64-2.0:
L2 latency 400KB = 8ns (includes L1 latency)
Main memory 16MB = 63ns (55ns because of memory)
Opteron 275 dual sockt:
L2 latency 800KB = 8ns
Main memory = 77ns (82ns because of L2)
I believe 77ns is something along the lines of 55ns for memory, 8ns for L2
latency, 2ns for registered memory (1 cycle @ 400 MHz), and 12ns or so for
hypertransport coherency. 55+8+2+12 = 77
L2 latency 400KB = 7.5ns
L3 latency 2.25MB = 23ns (includes L2 latency)
Main memory 32MB = 100ns
The earlier 136ns number for 2GB I attribute to TLB thrashing, and is
hopefully fixable with 1GB pages. I believe the 100ns is something along the
lines of 63ns for memory, 23 ns for L3, 2ns for registered memory, and 12ns or
so for hypertransport coherency. Sadly most DDR2 ECC registered memory I've
see has a higher (both cycles AND wallclock) than the rather mature ddr-400.
Q6600 (2.4 GHz quad, 4MB L2, single socket):
L2 latency 3MB =12ns
main memory 32MB =80ns
Xeon 5310 (1.6GHz quad, 4MB l2, dual socket):
L2 latency 3MB = 15ns
main memory latency 32MB = 126.77ns
So basically it looks like AMD still has the lead in memory latency (although
I don't have the latest greatest multi-socket intel quads to compare) lead.
Intel has a bigger transistor budget (with 2 pieces of silicon) yet AMD
looks to have the potential for better throughput with 2 64 bit memory
busses per socket. Definitely a good battle that's going to benefit the
end user, at least for the short term.
> No, on Opteron it doesn't. The *bandwidth* depends on nearness, the
> *latency* pretty much depends on the last snoop coming back from the
> farthest socket.
I tried to prove the wrong by example, using numactl and related calls...
and failed. I did notice in today's news that asus is bragging about
a dual socket board for barcelona that has a split power plain (faster
memory controller and l3 cache) and dual hypertransport connections between
> On systems with directory-based SMP protocols, things are different.
> That's probably what you're used to seeing -- SGI Origin, for example.
Indeed, a related bandwidth instead of latency code produced this:
Of course my pstream code is embarassingly parallel, each thread access a
local array and only communicates enough to make sure each stage of the
benchmark happens in sync. Hardly a good example to show off the altix
More information about the Beowulf