BLAS-1, AMD, Pentium, gcc

Fri Apr 12 10:36:22 PDT 2002

On Fri, 12 Apr 2002, Hung Jung Lu wrote:

> Hi,
> 
> I am thinking in migrating some calculation programs
> from Windows to Linux, maybe eventually using a
> Beowulf cluster. However, I am kind of worried after I
> read in the mailing list archive about lack of
> CPU-optimized BLAS-1 code in Linux systems. Currently
> I run on a Wintel (Windows+Pentium) machine, and I
> know it's substantially faster than equivalent AMD
> machine, because I use the Intel's BLAS (MKL) library.
> (I apologize for any misapprehensions in what
> follows... I am only starting to explore in this
> arena.)
> 
> (1) Does anyone know when gcc will have memory
> prefetching features? Any time frame? I can notice
> very significant performance improvement on my Wintel
> machine, and I think it's due to memory prefetching.

If you mean, "when will gcc's optimizer do automatic prefetching?", I
have no idea.  But, many programmers have been doing manual prefetching
with gcc for quite a while. If you don't mind defining and using
assembler macros, gcc handles it just fine now.  Here's an example:

#define prefetch_loc(addr) \
__asm__ __volatile__ ("prefetchnta %0" \
                      : \
                      : \
                      "m" (*(((char*)(((unsigned int)(addr))&~0x7f)))))

> (2) I am a bit confused on the following issue: Intel
> does release MKL for Linux. So, does this mean that if
> I use Pentium, I still get full benefit of the
> CPU-optimized features in BLAS-1, despite of gcc does
> not do memory prefetching? How is this possible?

The Intel compiler produces object files compatible with gcc, and vice
versa.  I would assume they implemented the library with the Intel
compiler, which has full SSE/SSE2 support (including prefetching).  They
list the MKL for Linux as compatible with both gnu and Intel compilers.

> (3) Related to the above: for general linear algebra
> operations, is Pentium processor then better than AMD,
> since Intel has the machine-optimized BLAS library? I
> get contradictory information sometimes... I've seen
> somewhere that Pentium-4 compares unfavorably with AMD
> chips in calculation speed... Any opinions?
> 
> thanks,
> 
> Hung Jung Lu

For the very simple SU3 linear algebra (3X3 complex matrices and 3X1
complex vectors) used in our codes, the Pentium 4 outperforms the Athlon
on most of our SSE-assisted routines.  See the table near the bottom of
   http://qcdhome.fnal.gov/sse/inline.html
for Mflops per gigahertz on various routines for P-III, P4, and Athlon.
Perhaps re-coding in 3DNow! would give the Athlon a boost.

For our codes, which are bound by memory bandwidth, P4's do
significantly better than Athlons because of the faster front side bus
(400 Mhz effective).  See 
   http://qcdhome.fnal.gov/qcdstream/compare.qcdstream
for a table comparing memory bandwidth and SU3 linear algebra
performance on a 1.2 GHz Athlon, 1.4 GHz P4, and 1.7 GHz P7 (see
   http://qcdhome.fnal.gov/qcdstream/   
for information about this benchmark).

Don Holmgren
Fermilab