[Beowulf] latency and bandwidth micro benchmarks
larry.stewart at sicortex.com
Tue Aug 15 06:02:12 PDT 2006
As has been mentioned here, the canonical bandwidth benchmark is
AFAIK, the canonical latency benchmark is lat_mem_rd, which is part of
the lmbench suite.
Streams is ultimately a test of the bandwidth path between the drams
and the core
in that if you turn up the buffer size sufficiently high, you will
overflow any cache.
If you keep turning it up enough above that, you will wash out the
such as not needing to write the dirty cache lines at the end of the
Secondarily, streams is a compiler test of loop unrolling, software
Streams is easy meat for hardware prefetch units, since the access
sequential, but that is OK. It is a bandwidth test.
latency is much harder to get at. lat_mem_rd tries fairly hard to
prefetch units by threading a chain of pointers through a random set
blocks. Other tests that don't do this get screwy results.
lat_mem_rd produces a graph, and it is easy to see the L1, L2, and
main memory plateaus.
This is all leadup to asking for lat_mem_rd results for Woodcrest
(and Conroe, if there
are any out there), and for dual-core Opterons (275)
With both streams and lat_mem_rd, one can run one copy or multiple
copies, or use a
single copy in multithread mode. Many cited test results I have been
able to find use
very vague english to describe exactly what they have tested. I
two copies of stream rather than using OpenMP - I want to measure
inter-core synchronization. For lat_mem_rd, the -P 2 switch seems
fine, it just
forks two copies of the test.
I'm interested in results for a single thread, but I am also
interested in results for
multiple threads on dual-core chips and in machines with multiple
sockets of single
or dual core chips.
The bandwidth of a two-socket single-core machine, for example,
should be nearly twice
the bandwidth of a single-socket dual-core machine simply because the
using different memory controllers. Is this borne out by tests?
Four threads on
a dual-dual should give similar bandwidth per core to a single socket
Next, considering a dual-core chip, to the extent that a single core
can saturate the memory
controller, when both cores are active, there should be a substantial
drop in bandwidth
Latency is much more difficult. I would expect that dual-core
lat_mem_rd results with
both cores active should show only a slight degradation of latency,
due to occasional
bus contention or resource scheduling conflicts between the cores. A
controller should be able to handle pointer chasing activity from
multiple cores. True?
Our server farm here is all dual-processor single core (Opteron 248)
and they seem
to behave as expected: running two copies of stream gives nearly
and the latency degradation due to running two copies of lat_mem_rd
indetectable. We don't have any dual-core chips or any Intel chips.
More information about the Beowulf