[Beowulf] Re: vectors vs. loops
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed May 4 12:03:51 PDT 2005
- Previous message: [Beowulf] Re: vectors vs. loops
- Next message: [Beowulf] Re: vectors vs. loops
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 4 May 2005, Eugen Leitl wrote: > On Wed, May 04, 2005 at 09:19:35AM -0600, Josip Loncaric wrote: > > > That may work for games, but not for everyone. A common operation like > > > > C = A + B > > > > is very fast when A, B, and C are small enough to fit into the cache > > simultaneously. However, for scientific computing, the size of these > > vectors could be 1 GB each (per CPU!), and the problem is memory > > bandwidth bound. Today's memory bandwidths cannot support full CPU > > speed on a problem like this. > > There are tricks to optimize available memory bandwidth on modern x86 > architectures though, as described in > > http://leitl.org/docs/comp/AMD_block_prefetch_paper.pdf > > (and far more in http://leitl.org/docs/comp/AMD64softoptguide.pdf ). Awesome documents -- very informative! I'm saving copies for my own edification (presuming that is permitted by their respective licenses). Do you have any idea how the "fully optimized loops" in the example code compare timewise to gcc results for obvious implementations of the same loops, or ditto for other compilers? How necessary is it for us to start inlining assembler in order to get a threefold improvement in effective throughput in a straightforward core loop? Do compilers automatically use block prefetch and three phase implementations of the floating point involved? rgb > It would be interesting to know whether DDR2 (and coming DDR4) will > especially profit from above, given that the latency is getting > arguably worse (I think the same applies to RAMBUS type of memories which > seem to be the default memory for the Cell CPU). > > Does anyone has a DDR2 machine, and could run the numbers? > > > A fact of life in scientific computing, e.g. CFD, is that the workload > > resembles "C=A+B". People try to get better reuse of data in cache, but > > there is only so much that an algorithm will allow. Thus, memory (and > > network) bandwidths remain the main bottleneck. -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Re: vectors vs. loops
- Next message: [Beowulf] Re: vectors vs. loops
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
