BLAS-1, AMD, Pentium, gcc
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Fraser fraser5 at cox.netFri Apr 12 16:51:25 PDT 2002
- Previous message: BLAS-1, AMD, Pentium, gcc
- Next message: k7s5a mobo based cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Sure the optimized BLAS by Intel IS faster (on Intel) the data you present while very impressive but are skewed towards Intel because the libs are optimized for ONLY for SSE and intel chips while AMD does not really fully SSE. BUT should replace your stale BLAS code with optimized ATLAS on for your AMD chips....its a whole new world my friend! AMD really kicks some butt when the libs are optimized for cache size. It blew me away. The libs optimize for a specific chip cache and detect for SSE or 3Dnow! and really exploit it and the performance is very impressive. (as well as the makefile that runs for quite some time to produce the libs.) Download the latest developers version compile and sit back and smile. WELL WORTH THE EFFORT, no question. I got into this to port a cfd code over from intel/mkl/scalapack/mpi to amd/atlas/scalapack/mpi. The bang for the buck with AMD is no comparison after you run with this package. BTW, the Atlas libs also run on intel ( runs ANY chip for that matter) and improved performance over the intel MKL package as well (for some chips = on others). I don't have the all numbers off hand but I would suggest you re-run your case with ATLAS, your conclusion may change. try it. Its free. (PS get the developers source and compile instead of downloading the binary, the term) http://www.netlib.org/atlas/ Jim -----Original Message----- From: beowulf-admin at beowulf.org [mailto:beowulf-admin at beowulf.org]On Behalf Of Don Holmgren Sent: Friday, April 12, 2002 1:36 PM To: Hung Jung Lu Cc: beowulf at beowulf.org Subject: Re: BLAS-1, AMD, Pentium, gcc On Fri, 12 Apr 2002, Hung Jung Lu wrote: > Hi, > > I am thinking in migrating some calculation programs > from Windows to Linux, maybe eventually using a > Beowulf cluster. However, I am kind of worried after I > read in the mailing list archive about lack of > CPU-optimized BLAS-1 code in Linux systems. Currently > I run on a Wintel (Windows+Pentium) machine, and I > know it's substantially faster than equivalent AMD > machine, because I use the Intel's BLAS (MKL) library. > (I apologize for any misapprehensions in what > follows... I am only starting to explore in this > arena.) > > (1) Does anyone know when gcc will have memory > prefetching features? Any time frame? I can notice > very significant performance improvement on my Wintel > machine, and I think it's due to memory prefetching. If you mean, "when will gcc's optimizer do automatic prefetching?", I have no idea. But, many programmers have been doing manual prefetching with gcc for quite a while. If you don't mind defining and using assembler macros, gcc handles it just fine now. Here's an example: #define prefetch_loc(addr) \ __asm__ __volatile__ ("prefetchnta %0" \ : \ : \ "m" (*(((char*)(((unsigned int)(addr))&~0x7f))))) > (2) I am a bit confused on the following issue: Intel > does release MKL for Linux. So, does this mean that if > I use Pentium, I still get full benefit of the > CPU-optimized features in BLAS-1, despite of gcc does > not do memory prefetching? How is this possible? The Intel compiler produces object files compatible with gcc, and vice versa. I would assume they implemented the library with the Intel compiler, which has full SSE/SSE2 support (including prefetching). They list the MKL for Linux as compatible with both gnu and Intel compilers. > (3) Related to the above: for general linear algebra > operations, is Pentium processor then better than AMD, > since Intel has the machine-optimized BLAS library? I > get contradictory information sometimes... I've seen > somewhere that Pentium-4 compares unfavorably with AMD > chips in calculation speed... Any opinions? > > thanks, > > Hung Jung Lu For the very simple SU3 linear algebra (3X3 complex matrices and 3X1 complex vectors) used in our codes, the Pentium 4 outperforms the Athlon on most of our SSE-assisted routines. See the table near the bottom of http://qcdhome.fnal.gov/sse/inline.html for Mflops per gigahertz on various routines for P-III, P4, and Athlon. Perhaps re-coding in 3DNow! would give the Athlon a boost. For our codes, which are bound by memory bandwidth, P4's do significantly better than Athlons because of the faster front side bus (400 Mhz effective). See http://qcdhome.fnal.gov/qcdstream/compare.qcdstream for a table comparing memory bandwidth and SU3 linear algebra performance on a 1.2 GHz Athlon, 1.4 GHz P4, and 1.7 GHz P7 (see http://qcdhome.fnal.gov/qcdstream/ for information about this benchmark). Don Holmgren Fermilab _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: BLAS-1, AMD, Pentium, gcc
- Next message: k7s5a mobo based cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
