[Beowulf] Re: vectors vs. loops

Eugen Leitl eugen at leitl.org
Wed May 4 10:32:28 PDT 2005


On Wed, May 04, 2005 at 09:19:35AM -0600, Josip Loncaric wrote:

> That may work for games, but not for everyone.  A common operation like
> 
> C = A + B
> 
> is very fast when A, B, and C are small enough to fit into the cache 
> simultaneously.  However, for scientific computing, the size of these 
> vectors could be 1 GB each (per CPU!), and the problem is memory 
> bandwidth bound.  Today's memory bandwidths cannot support full CPU 
> speed on a problem like this.

There are tricks to optimize available memory bandwidth on modern x86
architectures though, as described in

http://leitl.org/docs/comp/AMD_block_prefetch_paper.pdf

(and far more in http://leitl.org/docs/comp/AMD64softoptguide.pdf ).

It would be interesting to know whether DDR2 (and coming DDR4) will
especially profit from above, given that the latency is getting 
arguably worse (I think the same applies to RAMBUS type of memories which
seem to be the default memory for the Cell CPU). 	

Does anyone has a DDR2 machine, and could run the numbers?
 
> A fact of life in scientific computing, e.g. CFD, is that the workload 
> resembles "C=A+B".  People try to get better reuse of data in cache, but 
> there is only so much that an algorithm will allow.  Thus, memory (and 
> network) bandwidths remain the main bottleneck.

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050504/9dd48c4d/attachment.sig>


More information about the Beowulf mailing list