[Beowulf] Q: AMD Opteron (Barcelona) 2356 vs Intel Xeon 5460

Wed Sep 17 23:53:55 PDT 2008

Hi Bill,

     I'm sorry. I composed the mail in proper format, but its not showing as
I put.

See, I've tested with three compilers only for AMD. For intel only Intel
ifort.

Also there are two results for a single run (not for all. I missed out to
take results with time command).

I hope this helps,

Thanks,
Sangamesh

On Thu, Sep 18, 2008 at 11:59 AM, Bill Broadley <bill at cse.ucdavis.edu>wrote:

>
>
> I'm trying to understand your post, but failed.  Can you post a link,
> publish a google spreadsheet or format it differently?
>
> You tried 3 compilers on both machines?  Which times are for which
> CPU/Compiler combos?  I tried to match up the columns and ros, but sometimes
> there were 3 columns, and sometimes 4.  None of them lines up nicely under
> CPU or compiler headings.
>
> Mine (and many other folks) read email in ASCII/text, so a table should
> look like:
>
> Serial run:
>                Compiler A   Compiler B   Compiler C
> =====================================================
> Intel 2.3 GHz     30            29           31
> AMD 2.3 GHZ       28            32           32
>
> Note that I used spaces and not tabs so it appears clear to everyone
> irregardless of their mail client, ascii/text, html, tab settings, etc.
>
> I've been testing these machines quite a bit lately and have been quite
> impressed with the barcelona memory systems, for instance:
>
> http://cse.ucdavis.edu/bill/fat-node-numa3.png
>
>
> Sangamesh B wrote:
>
>> The scientific application used is Dl-Poly - 2.17.
>>
>> Tested with Pathscale and Intel compilers on AMD Opteron Quad core. The
>> time
>> figures mentioned were taken from DL-Poly output file. Also I had used
>> time
>> command. Here are the results:
>>
>>
>>                      AMD-2.3GHz (32 GB RAM)
>>    INTEL-2.33GHz (32 GB RAM)
>>
>>                         GNU gfortran      Pathscale      Intel 10
>> ifort                      Intel 10 fiort
>>
>> 1. Serial
>>
>> OUTPUT file       147.719 sec       158.158 sec     135.729 sec
>>                     73.952 sec
>>
>> Time command    2m27.791s
>> 2m38.268s                                              1m13.972s
>>
>> 2. Parallel
>>      4 core
>>
>> OUTPUT file         39.798 sec           44.717 sec        36.962 sec
>>          32.317 sec
>>
>> Time Command     0m41.527s
>> 0m46.571s                                       0m36.218s
>>
>>
>> 3. Parallel
>>      8 core
>>
>> OUTPUT               26.880 sec             33.746 sec       27.979 sec
>>               30.371 sec
>>
>> Time cmd
>> 0m30.171s
>>
>>
>> The optimization flags used:
>>
>> Intel ifort 10:        -O3  -axW  -funroll-loops  (don't remember exact
>> flag. Similar to loop unroll)
>>
>> Pathscale:          -O3  -OPT:Ofast   -ffast-math      -fno-math-errno
>>
>> GNU gfortran      -O3   -ffast-math -funroll-all-loops  -ftree-vectorize
>>
>>
>> I'll try to use the further: http://directory.fsf.org/project/time/
>>
>> Thanks,
>> Sangamesh
>>
>>
>> On Thu, Sep 18, 2008 at 6:07 AM, Vincent Diepeveen <diep at xs4all.nl>
>> wrote:
>>
>>  How does all this change when you use a PGO optimized executable on both
>>> sides?
>>>
>>> Vincent
>>>
>>>
>>> On Sep 18, 2008, at 2:34 AM, Eric Thibodeau wrote:
>>>
>>>  Vincent Diepeveen wrote:
>>>
>>>> Nah,
>>>>>
>>>>> I guess he's referring to sometimes it's using single precision
>>>>> floating
>>>>> point
>>>>> to get something done instead of double precision, and it tends to keep
>>>>> sometimes stuff in registers.
>>>>>
>>>>> That isn't a problem necessarily, but if i remember well floating point
>>>>> state
>>>>> could get wiped out when switching to SSE2.
>>>>>
>>>>> Sometimes you lose your FPU registerset in that case.
>>>>>
>>>>> Main problem is that there is so many dangerous optimizations possible,
>>>>> to speedup testsets, because in itself floating point is real slow to
>>>>> do
>>>>> at hardware,
>>>>> from hardware viewpoint seen.
>>>>>
>>>>> Yet in general last generations of intel compilers that has improved
>>>>> really a lot.
>>>>>
>>>>>  Well, running the same code here is the result discrepancy I got:
>>>> FLOPS:
>>>>  my code has to do: 7,975,847,125,000 (~8Tflops) ...takes 15minutes on
>>>> 8*2core Opeteron with 32 Gigs-o-RAM (thank you OpenMP ;)
>>>>
>>>> The running times (ran it a _few_ times...but not the statistical
>>>> minimum
>>>> of 30):
>>>>  ICC -> runtime == 689.249  ; summed error == 1651.78
>>>>  GCC -> runtime == 1134.404 ; summed error == 0.883501
>>>>
>>>> Compiler Flags:
>>>>  icc -xW -openmp -O3 vqOpenMP.c -o vqOpenMP
>>>>  gcc -lm -fopenmp -O3 -march=native vqOpenMP.c -o vqOpenMP_GCC
>>>>
>>>> No trickery, no smoky mirrors ;) Just a _huge_ kick ASS k-Means
>>>> parallelized with OpenMP (thank gawd, otherwise it takes hours to run)
>>>> and a
>>>> rather big database of 1.4 Gigs
>>>>
>>>> ... So this is what I meant by floating point errors. Yes, the runtime
>>>> was
>>>> almost halved by ICC (and this is on an *opteron* based system, Tyan
>>>> VX50).
>>>> The running time wasn't what I was actually looking for rather than
>>>> precision skew and that's where I fell off my chair.
>>>>
>>>> For the ones itching for a little more specs:
>>>>
>>>> eric at einstein ~ $ icc -V
>>>> Intel(R) C Compiler for applications running on Intel(R) 64, Version
>>>> 10.1
>>>>   Build 20080602
>>>> Copyright (C) 1985-2008 Intel Corporation.  All rights reserved.
>>>> FOR NON-COMMERCIAL USE ONLY
>>>>
>>>> eric at einstein ~ $ gcc -v
>>>> Using built-in specs.
>>>> Target: x86_64-pc-linux-gnu
>>>> Configured with:
>>>> /dev/shm/portage/sys-devel/gcc-4.3.1-r1/work/gcc-4.3.1/configure
>>>> --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.3.1
>>>> --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.1/include
>>>> --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.3.1
>>>> --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.3.1/man
>>>> --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.3.1/info
>>>>
>>>> --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.1/include/g++-v4
>>>> --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec
>>>> --enable-nls --without-included-gettext --with-system-zlib
>>>> --disable-checking --disable-werror --enable-secureplt --enable-multilib
>>>> --enable-libmudflap --disable-libssp --enable-cld --disable-libgcj
>>>> --enable-languages=c,c++,treelang,fortran --enable-shared
>>>> --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu
>>>> --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo
>>>> 4.3.1-r1
>>>> p1.1'
>>>> Thread model: posix
>>>> gcc version 4.3.1 (Gentoo 4.3.1-r1 p1.1)
>>>>
>>>>  Vincent
>>>>>
>>>>> On Sep 17, 2008, at 10:25 PM, Greg Lindahl wrote:
>>>>>
>>>>>  On Wed, Sep 17, 2008 at 03:43:36PM -0400, Eric Thibodeau wrote:
>>>>>
>>>>>>  Also, note that I've had issues with icc
>>>>>>
>>>>>>> generating really fast but inaccurate code (fp model is not IEEE *by
>>>>>>> default*, I am sure _everyone_ knows this and I am stating the
>>>>>>> obvious
>>>>>>> here).
>>>>>>>
>>>>>>>  All modern, high-performance compilers default that way. It's
>>>>>> certainly
>>>>>> the case that sometimes it goes more horribly wrong than necessary,
>>>>>> but
>>>>>> I wouldn't ding icc for this default. Compare results with IEEE mode.
>>>>>>
>>>>>> -- greg
>>>>>>
>>>>>>
>>>>>>
>>>>  _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080918/4c93ad21/attachment.html>