[Beowulf] latency and bandwidth micro benchmarks
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Lawrence Stewart larry.stewart at sicortex.comTue Aug 15 06:02:12 PDT 2006
- Previous message: [Beowulf] DC Power Dist. Yields 20%
- Next message: [Beowulf] latency and bandwidth micro benchmarks
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
As has been mentioned here, the canonical bandwidth benchmark is streams. AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of the lmbench suite. Streams is ultimately a test of the bandwidth path between the drams and the core in that if you turn up the buffer size sufficiently high, you will overflow any cache. If you keep turning it up enough above that, you will wash out the edge effects such as not needing to write the dirty cache lines at the end of the test. Secondarily, streams is a compiler test of loop unrolling, software pipelining, and prefetch. Streams is easy meat for hardware prefetch units, since the access patterns are sequential, but that is OK. It is a bandwidth test. latency is much harder to get at. lat_mem_rd tries fairly hard to defeat hardware prefetch units by threading a chain of pointers through a random set of cache blocks. Other tests that don't do this get screwy results. lat_mem_rd produces a graph, and it is easy to see the L1, L2, and main memory plateaus. This is all leadup to asking for lat_mem_rd results for Woodcrest (and Conroe, if there are any out there), and for dual-core Opterons (275) With both streams and lat_mem_rd, one can run one copy or multiple copies, or use a single copy in multithread mode. Many cited test results I have been able to find use very vague english to describe exactly what they have tested. I prefer running two copies of stream rather than using OpenMP - I want to measure bandwidth, not inter-core synchronization. For lat_mem_rd, the -P 2 switch seems fine, it just forks two copies of the test. I'm interested in results for a single thread, but I am also interested in results for multiple threads on dual-core chips and in machines with multiple sockets of single or dual core chips. The bandwidth of a two-socket single-core machine, for example, should be nearly twice the bandwidth of a single-socket dual-core machine simply because the threads are using different memory controllers. Is this borne out by tests? Four threads on a dual-dual should give similar bandwidth per core to a single socket dual-core. True? Next, considering a dual-core chip, to the extent that a single core can saturate the memory controller, when both cores are active, there should be a substantial drop in bandwidth per core. Latency is much more difficult. I would expect that dual-core lat_mem_rd results with both cores active should show only a slight degradation of latency, due to occasional bus contention or resource scheduling conflicts between the cores. A single memory controller should be able to handle pointer chasing activity from multiple cores. True? Our server farm here is all dual-processor single core (Opteron 248) and they seem to behave as expected: running two copies of stream gives nearly double performance, and the latency degradation due to running two copies of lat_mem_rd is nearly indetectable. We don't have any dual-core chips or any Intel chips. -Larry
- Previous message: [Beowulf] DC Power Dist. Yields 20%
- Next message: [Beowulf] latency and bandwidth micro benchmarks
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
