[Beowulf] Re: vectors vs. loops

Wed May 4 09:07:13 PDT 2005

Eugen Leitl wrote:
> On Wed, May 04, 2005 at 09:19:35AM -0600, Josip Loncaric wrote:
> 
> 
>>That may work for games, but not for everyone.  A common operation like
>>
>>C = A + B
>>
>>is very fast when A, B, and C are small enough to fit into the cache 
>>simultaneously.  However, for scientific computing, the size of these 
>>vectors could be 1 GB each (per CPU!), and the problem is memory 
>>bandwidth bound.  Today's memory bandwidths cannot support full CPU 
>>speed on a problem like this.
> 
> 
> There are tricks to optimize available memory bandwidth on modern x86
> architectures though, as described in
> 
> http://leitl.org/docs/comp/AMD_block_prefetch_paper.pdf
> 
> (and far more in http://leitl.org/docs/comp/AMD64softoptguide.pdf ).

Thanks for the links, but prefetching (which I usually recommend) 
doesn't fix this problem: 2 GB needs to be read from RAM and 1 GB 
written, with only 128 M double precision floating point operations. 
This example needs 24 bytes of memory bandwidth per FLOP, much more than 
today's RAM can deliver.  If the CPU can issue ADD instructions at 3 
GHz, to run at full speed we'd need about 72 GB/s in memory bandwidth. 
Unfortunately, today's RAM supplies less than 5% of this requirement.

Real CFD code can do a bit more work per memory access, and benefits 
from prefetching, but often runs into the same memory bandwidth 
bottleneck as C=A+B.  Prefetching can hide latency problems, but not 
bandwidth bottlenecks.

Sincerely,
Josip