[Beowulf] Re: vectors vs. loops
eugen at leitl.org
Fri May 6 05:36:14 PDT 2005
On Wed, May 04, 2005 at 03:03:51PM -0400, Robert G. Brown wrote:
> On Wed, 4 May 2005, Eugen Leitl wrote:
> > There are tricks to optimize available memory bandwidth on modern x86
> > architectures though, as described in
> > http://leitl.org/docs/comp/AMD_block_prefetch_paper.pdf
> > (and far more in http://leitl.org/docs/comp/AMD64softoptguide.pdf ).
> Awesome documents -- very informative! I'm saving copies for my own
Thanks! That's the reason I mirrored them.
> edification (presuming that is permitted by their respective licenses).
These are freely available whitepapers and manuals from AMD. I haven't seen
any license restricting their use.
> Do you have any idea how the "fully optimized loops" in the example code
> compare timewise to gcc results for obvious implementations of the same
> loops, or ditto for other compilers? How necessary is it for us to
I don't recall whether they posted all the benchmarks, but IIRC the pure C
variant doesn't give the 300% (vs naive) boost, as the compiler doesn't
generate the required (MOVNTQ?) instruction.
> start inlining assembler in order to get a threefold improvement in
> effective throughput in a straightforward core loop? Do compilers
I think you have to use assembler inline for the full speed bost.
> automatically use block prefetch and three phase implementations of the
> floating point involved?
Given that gcc 4.0 is ante portas, things might have changed in the respect.
If you do benchmarks, can you please post full numbers?
Eugen* Leitl <a href="http://leitl.org">leitl</a>
ICBM: 48.07078, 11.61144 http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 189 bytes
Desc: Digital signature
More information about the Beowulf