Robert G. Brown
rgb at phy.duke.edu
Wed Mar 12 07:19:58 PST 2003
On Wed, 12 Mar 2003, Mark Hahn wrote:
> > >> PS Pentium 4 sustained performance from memory is about
> > >> 5% of peak (stream triad).
> > >
> > >that should be 50%, I think.
> > Nope ... not "from memory".
> > A 2.8 GHz P4 using SSE2 instructions can deliver two
> > 64-bit floating point results per clock or 5.6 Gflops
> > peak performance at this clock. The stream triad (a
> > from-memory, multiply-add operation) for a 2.8 GHz
> > P4 produces only 200 Mflops (see stream website). The
> > arithmetic is then:
> > 200/5600 = .0357 or 3.57% (so 5% is a gift)
> oh, I see. to me, that's a strange definition of "peak",
> since stream is, by intention, always bottlenecked on
> memory bandwidth, since its FSB is either 3.2 or 4.3 GB/s.
> it'll deliver roughly 50% of that to stream.
Stream also generally does not deliver worst case memory bandwidth --
far from it. A "streaming" pass through memory allows all sorts of
prefetch optimizations to be done that partially parallelize the memory
access times, as does e.g. SDRAM's basic architecture (which
accomplishes similar parallelization on a smaller scale).
Worst case performance comes when doing a stream-like test on numbers
stored at randomly selected locations in memory, defeating all such
optimizations and parallelizations, and maybe even adding in additional
penalties as systems don't LIKE for their predictions to all be wrong
and might make you pay extra for flushing all their prefilled/prefilling
cache buffers that on average same them time for on average streaming
This makes a BIG BIG difference. I measure (note "measure" as opposed
to theoretically estimate according to CPU manufacturer's best case
design specifications:-) a factor of 58 difference, from around 1.2
nanoseconds on an 800 MHz machine, roughly the inverse of the clock to
repeatedly access a single int on a Celeron on a laptop out to about 68
nanoseconds to repeatedly grab a random int from "anywhere" in a vector
of 1000000 ints.
I may yet introduce another degree of indirect addressing into the
implementation of stream in my benchmarking tool so that I can do a real
"random stream" test, although one will have to be very careful not to
compare apples to oranges in the numbers it returns because the system
will definitely implement a straight iteration of sequential accesses to
vector addresses allocated permanently in the data segment of a program
differently from an indirectly indexed (one needs a shufflable indexing
vector to provide the next index in the loop) vector address in memory
malloc'd (to permit tests with variable length vectors) in the dynamic
portion of memory, with an extra step of pointer resolution required per
If/when I do, I wouldn't be at all surprised to find that the best
measured bandwidth for a single triad on a vector of length 1 (per
access) is well over 100 times slower than random access triads on a
vector of length 10^8. P4's are likely to be even worse in this regard
than a Celeron because there is just plain more to flush, more to
> > As you suggest, the P4 will (as does the Cray X1) do
> > significantly better when cache use/re-use is a
> > significant factor.
> no, it's not a matter of reuse, but what you consider "peak".
> I think the real take-home message is that this sort of
> fraction-of-theoretical-peak is useless, and you need to look
> at the actual numbers, possibly scaled by price.
Absolutely, dead-on-right, the whole truth and nothing but the truth.
And the actual measured numbers that MATTER are YOUR APPLICATION,
compiled with the compiler you're going to use, hand optimized or not as
you're going to hand optimized, implemented the way you're going to
implement it. Sure, if you rewrote it, used the right compiler, hand
tuned it with the right assembler, and sacrificed a chicken on the head
node keyboard it might run 10 times as fast -- or not. If you're not
going to do that -- and who really does? -- why worry about whether or
not THIS processor, which perhaps costs twice as much as THAT processor,
"justifies" the additional cost by theoretically executing much faster
at peak if in >>your code<< you aren't ever going to see the difference?
> as a matter of fact, I'm always slightly puzzled by this sort
> of conversation. yes, crays and vector computers in general
> are big/wide memory systems with a light scattering of ALU's.
> a much different ratio than the "cache-based" computing world.
> but if your data is huge and uniform, don't you win big by
> partitioning (data or work grows as dim^2, but communication
> at partitions scaling much slower)? that would argue, for instance,
> that you should run on a cluster of e7205 machines, where each node
> delivers a bit more than the 200 Gflops above under $2k, and should
> scale quite nicely until your interconnect runs out of steam,
> say, several hundred CPUs. the point is really that stream-like
> codes are almost embarassingly parallel.
> so what's the cost per stream-triad gflop from Cray?
...and, how many people are running code for which the stream-triad is a
fair and accurate measure of their expected performance? Lots, I'm sure
(seriously), but not a majority -- maybe not even a plurality. The
whole reason CPU clock is such a driving factor in system design is that
for a lot of folks, "cache works" and their completion time is CPU clock
linked, not (directly or obviously) memory subsystem linked.
I'm certainly one of them. Although I just LOVE running microbenchmarks
and trying to figure out system speeds, my favorite research application
scales almost perfectly with CPU clock over the entire P6 family,
spanning EDO, SDRAM, RDRAM, and DDR. If anything, RDR and the P4 suffer
a modest hit in performance relative to clock, probably because its
geometry is a poor match for prefetch algorithms so it has to do a lot
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf