[Beowulf] [gorelsky at stanford.edu: CCL:dual-core Opteron275performance]

Mon Jul 4 10:48:28 PDT 2005

In message from Vincent Diepeveen <diep at xs4all.nl> (Mon, 04 Jul 2005 
17:59:40 +0200):
> ...
> ...
>Of course we take a large buffer. Around 400MB is the working set 
>size for
>the hashtable which i use for my chess software (which is reading 
>randomly
>a 8-64 bytes from the cache).
>
>Results:
>   single cpu A64 : 91 ns  (cl2 memory)
>   single cpu P4  : 220 ns (cl2 memory, bus overclocked)
>   dual opteron   : 120 ns 
>   quad opteron   : 133 ns
>   dual xeon      : 280 ns (800Mhz bus)
>   dual xeon      : 400 ns (533Mhz bus)
The latencies should depends from processors frequencies (although
RAM part is much higher),
so what was the frequencies for A64/P4/Opteron/Xeon ? 

And do I understand you correctly that you have 1/2/4 threads which
perform "random" read of some bytes from main memory ? 

>
>So obviously things that do not fit in L2 cache, the opteron runs 
>away with
>it. Only if the executable is optimized in question by the intel c++
>compiler it will have done stuff to run it faster at intel processors 
>than >at opteron, 
>then results do not look too bad for P4.
If the results above are for "bad" (bad optimizing) compiler -
in some sense it's the problem of compiler :-) Yes, old binary 
software will work slow. But many, many HPC applications may be 
compiled
from source.
BTW, more good results are for icc++ only - do you know
something about PGI and PathScale compilers ?

> Yet that's a matter of 
>optimizing
>it for opteron better, which most software dudes do NOT do, as intel
>delivers good support and AMD historically didn't deliver *any* kind 
>of
>support (they are improving now, but even then their math libraries 
>are so
>pathetic compared to the ease of the intel libraries that i can 
>imagine at
>least *that* part of
>the problems).
acml 2.1 gives me a set of good results for Opteron in comparison 
w/MKL

Yours
Mikhail