Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: vectors vs. loops

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Josip Loncaric josip at lanl.gov
Wed May 4 09:07:13 PDT 2005


Eugen Leitl wrote:
> On Wed, May 04, 2005 at 09:19:35AM -0600, Josip Loncaric wrote:
> 
> 
>>That may work for games, but not for everyone.  A common operation like
>>
>>C = A + B
>>
>>is very fast when A, B, and C are small enough to fit into the cache 
>>simultaneously.  However, for scientific computing, the size of these 
>>vectors could be 1 GB each (per CPU!), and the problem is memory 
>>bandwidth bound.  Today's memory bandwidths cannot support full CPU 
>>speed on a problem like this.
> 
> 
> There are tricks to optimize available memory bandwidth on modern x86
> architectures though, as described in
> 
> http://leitl.org/docs/comp/AMD_block_prefetch_paper.pdf
> 
> (and far more in http://leitl.org/docs/comp/AMD64softoptguide.pdf ).

Thanks for the links, but prefetching (which I usually recommend) 
doesn't fix this problem: 2 GB needs to be read from RAM and 1 GB 
written, with only 128 M double precision floating point operations. 
This example needs 24 bytes of memory bandwidth per FLOP, much more than 
today's RAM can deliver.  If the CPU can issue ADD instructions at 3 
GHz, to run at full speed we'd need about 72 GB/s in memory bandwidth. 
Unfortunately, today's RAM supplies less than 5% of this requirement.

Real CFD code can do a bit more work per memory access, and benefits 
from prefetching, but often runs into the same memory bandwidth 
bottleneck as C=A+B.  Prefetching can hide latency problems, but not 
bandwidth bottlenecks.

Sincerely,
Josip



More information about the Beowulf mailing list