[Beowulf] Memory latency (was woodcrest)

Kozin, I (Igor) i.kozin at dl.ac.uk
Fri Aug 18 04:07:54 PDT 2006


These are long integers, right? Otherwise the array could almost 
fit into 4MB cache of Itanium or Woodcrest.

However even this would not be enough for Power5 which has a massive 
36 MB L3 cache. By the way, we have Power5 and PowerPC available if you want.

> I wrote a pthread based latency tester that will access N integers
> randomly per thread.  Each member of the array is accessed once.
> All the numbers below are for N=1,000,000 integers.  Every integer is
> loaded exactly once, randomly.
> 
> The first number is the latency per thread, so it increases  with memory
> contention.  The second number is the "effective" ns, where I divide
> the run time[1] of all threads and divide it by the integers  retreived.
> It should decrease with increased threads if the machine has the CPU
> and memory system parallelism to avoid contention.
> 
>                                1 thread         2 threads        4 threads
> Dual Opteron 275[2]           83.69ns/83.69ns  80ns/52.08ns     85ns/21.72ns 
> Quad opteron 846[3]          108.07/108.07ns  115ns/61.39ns    110ns/27.89ns
> Dual Woodcrest-2.66[2]       107.18/107.18ns  108ns/54.03ns    118ns/29.69ns
> Dual core amd64-2.2GHz[5]     89.45/89.45ns    89.45ns/44.72   145ns/52.76ns
> AMD64 3200[4]-2.0GHz          69.74ns/69.74ns  69ns/69.31ns    137ns/69.85ns
> Dual socket nacoma 3.4GHz[6] 130.45/130.45ns  133/66.72ns      230ns/67.72ns
> Dual core p4-3.0[6]          115.45/115.46ns  185ns/101.03ns   283ns/92.67ns
> Dual it2-1.4GHz[6]           200.47/200.47ns  203ns/101.92ns   362ns/101.57ns
> 
> I'm happy to say that Pathscale, Intel, GCC-3, and GCC-4 all share
> mostly identical performance.  Although, I had to be very careful with
> pathscale to avoid the benchmark routine from getting optimized away.
> 
> Anyone have a Rev F opteron handy?
> 
> [1] Where runtime = max(finishtimes)-min(starttimes)
> [2] Dual socket, dual core = 4 cores
> [3] Quad socket, single core = 4 cores
> [4] Single core/single socket = 1 core
> [5] Dual core/single socket = 2 cores
> [6] Dual socket, single core = 2 cores.
> 
> -- 
> Bill Broadley
> Computational Science and Engineering
> UC Davis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 




More information about the Beowulf mailing list