[Beowulf] latency and bandwidth micro benchmarks
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bill Broadley bill at cse.ucdavis.eduMon Aug 28 22:47:51 PDT 2006
- Previous message: [Beowulf] latency and bandwidth micro benchmarks
- Next message: [Beowulf] latency and bandwidth micro benchmarks
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, Aug 15, 2006 at 09:02:12AM -0400, Lawrence Stewart wrote:
> As has been mentioned here, the canonical bandwidth benchmark is
> streams.
Agreed.
> AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of
> the lmbench suite.
Really? Seems like more of a prefetch test then a latency benchmark.
A fixed stride allows a guess at where the n+1 address before the n'th
address is loaded.
I ran the full lmbench:
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
amd-2214 Linux 2.6.9-3 x86_64-linux-gnu 2199 32 128 4.4800 1
xeon-5150 Linux 2.6.9-3 x86_64-linux-gnu 2653 8 128 5.5500 1
Strangely, the linux kernel disagrees on the cache line size for the amd
(from dmesg):
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
> Secondarily, streams is a compiler test of loop unrolling, software
> pipelining, and prefetch.
Indeed.
> Streams is easy meat for hardware prefetch units, since the access
> patterns are
> sequential, but that is OK. It is a bandwidth test.
Agreed.
> latency is much harder to get at. lat_mem_rd tries fairly hard to
> defeat hardware
> prefetch units by threading a chain of pointers through a random set
> of cache
> blocks. Other tests that don't do this get screwy results.
A random set of cache blocks?
You mean:
http://www.bitmover.com/lmbench/
I got the newest lmbench3.
The benchmark runs as two nested loops. The outer loop is the stride
size. The inner loop is the array size.
The memory results:
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
------------------------------------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Rand mem Guesses
--------- ------------- --- ---- ---- -------- -------- -------
amd-2214 Linux 2.6.9-3 2199 1.3650 5.4940 68.4 111.3
xeon-5150 Linux 2.6.9-3 2653 1.1300 5.3000 101.5 114.2
> lat_mem_rd produces a graph, and it is easy to see the L1, L2, and
> main memory plateaus.
>
> This is all leadup to asking for lat_mem_rd results for Woodcrest
> (and Conroe, if there
> are any out there), and for dual-core Opterons (275)
The above amd-2214 is the ddr2 version of the opteron 275.
My latency numbers with plat are 98.5ns for a 38MB array. A bit better than
lmbench.
> With both streams and lat_mem_rd, one can run one copy or multiple
> copies, or use a
> single copy in multithread mode. Many cited test results I have been
> able to find use
> very vague english to describe exactly what they have tested. I
My code is pretty simple, for an array of N ints I do:
while (p != 0)
{
p = a[p];
}
That to me is random memory latency. Although doing a 2 stage loop
for 0 to N pages
pick a random page
for 0 to M (cachelines per page)
pick a random cacheline
Would minimize time spent with the page overhead.
> prefer running
> two copies of stream rather than using OpenMP - I want to measure
> bandwidth, not
> inter-core synchronization.
I prefer is synchronized. Otherwise 2 streams might get out of sync, and
while one gets 8GB/sec, and another gets 8GB/sec, they didn't do it at the
same time. In my benchmark I take the min of all start times and the max
of all stop times. That way there is no cheating.
> I'm interested in results for a single thread, but I am also
> interested in results for
> multiple threads on dual-core chips and in machines with multiple
> sockets of single
> or dual core chips.
Since your spending most of your time waiting on dram, there isn't much
contention:
http://cse.ucdavis.edu/~bill/intel-1vs4t.png
> The bandwidth of a two-socket single-core machine, for example,
> should be nearly twice
> the bandwidth of a single-socket dual-core machine simply because the
> threads are
> using different memory controllers.
Judge for yourself:
http://cse.ucdavis.edu/~bill/quad-numa.png (quad opteron)
http://cse.ucdavis.edu/~bill/altix-dplace.png
http://cse.ucdavis.edu/~bill/intel-5150.png (woodcrest + ddr2-667)
> Is this borne out by tests?
> Four threads on
> a dual-dual should give similar bandwidth per core to a single socket
> dual-core. True?
Yes, alas I don't have graphs of single socket dual core systems
handy.
> Next, considering a dual-core chip, to the extent that a single core
> can saturate the memory
> controller, when both cores are active, there should be a substantial
> drop in bandwidth
> per core.
Right.
> Latency is much more difficult. I would expect that dual-core
> lat_mem_rd results with
> both cores active should show only a slight degradation of latency,
> due to occasional
> bus contention or resource scheduling conflicts between the cores. A
> single memory
> controller should be able to handle pointer chasing activity from
> multiple cores. True?
Right, see above graphs for 1 vs 4t.
> Our server farm here is all dual-processor single core (Opteron 248)
> and they seem
> to behave as expected: running two copies of stream gives nearly
> double performance,
> and the latency degradation due to running two copies of lat_mem_rd
> is nearly
> indetectable. We don't have any dual-core chips or any Intel chips.
Right.
--
Bill Broadley
Computational Science and Engineering
UC Davis
- Previous message: [Beowulf] latency and bandwidth micro benchmarks
- Next message: [Beowulf] latency and bandwidth micro benchmarks
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
