hahn at physics.mcmaster.ca
Wed Mar 12 07:23:50 PST 2003
> >> PS Pentium 4 sustained performance from memory is about
> >> 5% of peak (stream triad).
> >that should be 50%, I think.
> Nope ... not "from memory".
> A 2.8 GHz P4 using SSE2 instructions can deliver two
> 64-bit floating point results per clock or 5.6 Gflops
> peak performance at this clock. The stream triad (a
> from-memory, multiply-add operation) for a 2.8 GHz
> P4 produces only 200 Mflops (see stream website). The
> arithmetic is then:
> 200/5600 = .0357 or 3.57% (so 5% is a gift)
oh, I see. to me, that's a strange definition of "peak",
since stream is, by intention, always bottlenecked on
memory bandwidth, since its FSB is either 3.2 or 4.3 GB/s.
it'll deliver roughly 50% of that to stream.
> As you suggest, the P4 will (as does the Cray X1) do
> significantly better when cache use/re-use is a
> significant factor.
no, it's not a matter of reuse, but what you consider "peak".
I think the real take-home message is that this sort of
fraction-of-theoretical-peak is useless, and you need to look
at the actual numbers, possibly scaled by price.
as a matter of fact, I'm always slightly puzzled by this sort
of conversation. yes, crays and vector computers in general
are big/wide memory systems with a light scattering of ALU's.
a much different ratio than the "cache-based" computing world.
but if your data is huge and uniform, don't you win big by
partitioning (data or work grows as dim^2, but communication
at partitions scaling much slower)? that would argue, for instance,
that you should run on a cluster of e7205 machines, where each node
delivers a bit more than the 200 Gflops above under $2k, and should
scale quite nicely until your interconnect runs out of steam,
say, several hundred CPUs. the point is really that stream-like
codes are almost embarassingly parallel.
so what's the cost per stream-triad gflop from Cray?
More information about the Beowulf