[Beowulf] Memory latency (was woodcrest)
bill at cse.ucdavis.edu
Thu Aug 17 14:52:47 PDT 2006
For those interested in latency.
I wrote a pthread based latency tester that will access N integers
randomly per thread. Each member of the array is accessed once.
All the numbers below are for N=1,000,000 integers. Every integer is
loaded exactly once, randomly.
The first number is the latency per thread, so it increases with memory
contention. The second number is the "effective" ns, where I divide
the run time of all threads and divide it by the integers retreived.
It should decrease with increased threads if the machine has the CPU
and memory system parallelism to avoid contention.
1 thread 2 threads 4 threads
Dual Opteron 275 83.69ns/83.69ns 80ns/52.08ns 85ns/21.72ns
Quad opteron 846 108.07/108.07ns 115ns/61.39ns 110ns/27.89ns
Dual Woodcrest-2.66 107.18/107.18ns 108ns/54.03ns 118ns/29.69ns
Dual core amd64-2.2GHz 89.45/89.45ns 89.45ns/44.72 145ns/52.76ns
AMD64 3200-2.0GHz 69.74ns/69.74ns 69ns/69.31ns 137ns/69.85ns
Dual socket nacoma 3.4GHz 130.45/130.45ns 133/66.72ns 230ns/67.72ns
Dual core p4-3.0 115.45/115.46ns 185ns/101.03ns 283ns/92.67ns
Dual it2-1.4GHz 200.47/200.47ns 203ns/101.92ns 362ns/101.57ns
I'm happy to say that Pathscale, Intel, GCC-3, and GCC-4 all share
mostly identical performance. Although, I had to be very careful with
pathscale to avoid the benchmark routine from getting optimized away.
Anyone have a Rev F opteron handy?
 Where runtime = max(finishtimes)-min(starttimes)
 Dual socket, dual core = 4 cores
 Quad socket, single core = 4 cores
 Single core/single socket = 1 core
 Dual core/single socket = 2 cores
 Dual socket, single core = 2 cores.
Computational Science and Engineering
More information about the Beowulf