[Beowulf] Re: vectors vs. loops
eugen at leitl.org
Wed May 4 10:32:28 PDT 2005
On Wed, May 04, 2005 at 09:19:35AM -0600, Josip Loncaric wrote:
> That may work for games, but not for everyone. A common operation like
> C = A + B
> is very fast when A, B, and C are small enough to fit into the cache
> simultaneously. However, for scientific computing, the size of these
> vectors could be 1 GB each (per CPU!), and the problem is memory
> bandwidth bound. Today's memory bandwidths cannot support full CPU
> speed on a problem like this.
There are tricks to optimize available memory bandwidth on modern x86
architectures though, as described in
(and far more in http://leitl.org/docs/comp/AMD64softoptguide.pdf ).
It would be interesting to know whether DDR2 (and coming DDR4) will
especially profit from above, given that the latency is getting
arguably worse (I think the same applies to RAMBUS type of memories which
seem to be the default memory for the Cell CPU).
Does anyone has a DDR2 machine, and could run the numbers?
> A fact of life in scientific computing, e.g. CFD, is that the workload
> resembles "C=A+B". People try to get better reuse of data in cache, but
> there is only so much that an algorithm will allow. Thus, memory (and
> network) bandwidths remain the main bottleneck.
Eugen* Leitl <a href="http://leitl.org">leitl</a>
ICBM: 48.07078, 11.61144 http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 189 bytes
Desc: Digital signature
More information about the Beowulf