[Beowulf] AMD64 results...

David Mathog mathog at mendel.bio.caltech.edu
Thu Dec 16 14:12:04 PST 2004


<SNIP>

> Well, stream is as much a memory bandwidth test as it is a floating
> point test per se anyway.  I always hope for something dramatic when I
> use faster/wider memory, but usually reality is fairly sedate.

<SNIP>

> 
> Enjoy and prefetch!
> 
> rbw
> 
> gcc-3.2.3  -O4 -Wall -pedantic:
> Function      Rate (MB/s)   RMS time     Min time     Max time
> Copy:        2004.8056       0.0095       0.0080       0.0099
> Scale:       2044.7551       0.0099       0.0078       0.0105
> Add:         2272.3092       0.0133       0.0106       0.0137
> Triad:       2237.3599       0.0134       0.0107       0.0137
> 
> gcc-3.2.3  -O4 -fprefetch-loop-arrays -Wall -pedantic:
> Function      Rate (MB/s)   RMS time     Min time     Max time
> Copy:        3259.9273       0.0049       0.0049       0.0052
> Scale:       3294.9803       0.0049       0.0049       0.0049
> Add:         3306.7241       0.0073       0.0073       0.0073
> Triad:       3349.1914       0.0072       0.0072       0.0072
> 

That was what happened even on the Athlon MP - the prefetch
tricks made a pretty big difference, although not so great
percentage wise, apparently, as for the Opteron / Athlon64.
1.2 was a typical prefetch/normal ratio, whereas here
it seems to be 1.6.  Much better prefetch in the newer chips.

Stream is pretty simple code though and I have found a couple
of instances in the last few years in other programs
where just enabling -prefetch in gcc didn't work that
well - the prefetch pattern was too complex for the
compiler to figure out. Maybe later versions of gcc have
fixed this.  Anyway,  to hand tune prefetches in gcc
add something like this:

#
# Prefetch 192 bytes ahead of the current pointer.
# The "w" form is for data that will be written.
# How far upstream to prefetch depends on the code.
# Prefetch too close and it won't be in cache when needed.
# Prefetch too far and it may swap out before it gets used.
#
#if  defined(AMD_PREFETCH)
static __inline__ void CPU_prefetchwR(const void *s) {
         __asm__ ("prefetchw 192(%0)" :: "r" ((s)) );
}
static __inline__ void CPU_prefetchR(const void *s) {
         __asm__ ("prefetch 192(%0)" :: "r" ((s)) );
}
#endif

And then sprinkle these in as needed:

#if   defined(AMD_PREFETCH)
           CPU_prefetchR(&a[i]);
           CPU_prefetchwR(&c[i]);
#endif

for instance, before this:

   a[i]=c[i];

For more or less sane code you can generally just do a few
runs varying the 192 (above) to place the prefetch in the
optimal position.

That's only for sane code though.  For something awful like this:

   b=a[i];
   d=c[b];
   f[d]=e[b];

you have to know a priori what's (likely) to be in a[] and c[]
to guess ahead of time what b,d will be, so that f[d] and e[b]
can be prefetched. 


Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list