[Beowulf] AMD64 results...
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at mendel.bio.caltech.eduThu Dec 16 14:12:04 PST 2004
- Previous message: [Beowulf] AMD64 results...
- Next message: [Beowulf] About Bench Marks
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
<SNIP>
> Well, stream is as much a memory bandwidth test as it is a floating
> point test per se anyway. I always hope for something dramatic when I
> use faster/wider memory, but usually reality is fairly sedate.
<SNIP>
>
> Enjoy and prefetch!
>
> rbw
>
> gcc-3.2.3 -O4 -Wall -pedantic:
> Function Rate (MB/s) RMS time Min time Max time
> Copy: 2004.8056 0.0095 0.0080 0.0099
> Scale: 2044.7551 0.0099 0.0078 0.0105
> Add: 2272.3092 0.0133 0.0106 0.0137
> Triad: 2237.3599 0.0134 0.0107 0.0137
>
> gcc-3.2.3 -O4 -fprefetch-loop-arrays -Wall -pedantic:
> Function Rate (MB/s) RMS time Min time Max time
> Copy: 3259.9273 0.0049 0.0049 0.0052
> Scale: 3294.9803 0.0049 0.0049 0.0049
> Add: 3306.7241 0.0073 0.0073 0.0073
> Triad: 3349.1914 0.0072 0.0072 0.0072
>
That was what happened even on the Athlon MP - the prefetch
tricks made a pretty big difference, although not so great
percentage wise, apparently, as for the Opteron / Athlon64.
1.2 was a typical prefetch/normal ratio, whereas here
it seems to be 1.6. Much better prefetch in the newer chips.
Stream is pretty simple code though and I have found a couple
of instances in the last few years in other programs
where just enabling -prefetch in gcc didn't work that
well - the prefetch pattern was too complex for the
compiler to figure out. Maybe later versions of gcc have
fixed this. Anyway, to hand tune prefetches in gcc
add something like this:
#
# Prefetch 192 bytes ahead of the current pointer.
# The "w" form is for data that will be written.
# How far upstream to prefetch depends on the code.
# Prefetch too close and it won't be in cache when needed.
# Prefetch too far and it may swap out before it gets used.
#
#if defined(AMD_PREFETCH)
static __inline__ void CPU_prefetchwR(const void *s) {
__asm__ ("prefetchw 192(%0)" :: "r" ((s)) );
}
static __inline__ void CPU_prefetchR(const void *s) {
__asm__ ("prefetch 192(%0)" :: "r" ((s)) );
}
#endif
And then sprinkle these in as needed:
#if defined(AMD_PREFETCH)
CPU_prefetchR(&a[i]);
CPU_prefetchwR(&c[i]);
#endif
for instance, before this:
a[i]=c[i];
For more or less sane code you can generally just do a few
runs varying the 192 (above) to place the prefetch in the
optimal position.
That's only for sane code though. For something awful like this:
b=a[i];
d=c[b];
f[d]=e[b];
you have to know a priori what's (likely) to be in a[] and c[]
to guess ahead of time what b,d will be, so that f[d] and e[b]
can be prefetched.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] AMD64 results...
- Next message: [Beowulf] About Bench Marks
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
