[Beowulf] Re: dual core Opteron performance - re suse 9.3
diep at xs4all.nl
Tue Jul 12 09:04:40 PDT 2005
A few questions.
Did you use PGO (profile guided optimizations) with gcc 3.3.4 for your code?
PGO is broken in 3.3.4 for my software when i deterministically compare 1
cpu's output compiled with 3.3.4 + pgo. Did you deterministically compare
both executables with each other (when running single cpu) and see whether
output is 100% equal?
Note 3.3.4-suse gcc may not be 100% similar to 3.3.4 gcc. However same bug
is there in the 3.3.1-3.3.x series from Suse GCC.
In 4.0.0 the PGO works better and creates the same output, YMMV there.
At 10:36 AM 7/12/2005 -0500, Don Kinghorn wrote:
>Hi Vincent, ...all,
>The code was built on a SuSE9.2 machine with gcc/g77 3.3.4. The same
>executable was run on both systems.
>Kernel for the 2 dual-node setup was SuSE stock 2.6.8-24-smp
>for the 9.3 setup with the dual-core cpus it was the stock install kernel
>Memory was fully populated on the 2 node setup -- 4 one GB modules per
>there are only 4 slots on the Tyan 2875 (I had mistakenly reported yesterday
I'm not seeing anywhere at Tyan an indication this board can take advantage
Looks like it that there is 1 shared memory, correct me if i'm wrong. It's
not showing the RAM as being working for 1 cpu, but rather for both.
>that there was only 2GB/per board for the benchmark numbers)
>The dual-core system had 4 one GB modules arranged 2 for each cpu.
So you compared a dual opteron dual core (non-tiger board)
with dual opteron (Tiger).
I assume you used at both machines 2 cpu's to compare speed of your code.
Currently setting up gentoo at quad.
>Important(?) bios settings were;
>Bank interleaving "Auto"
>Node interleaving "Auto"
>MemoryHole "Disabled" for both hardware and software settings
>The speedup we saw on the dual-core was less than 10% for the most jobs. MP2
>jobs with heavy i/o (worst case) was around a %20 hit (there were twice as
>many processes hitting the raid scratch space at the same time)
Are you speaking now of comparing a 4 core (dual opteron dual core) as
compared to a dual opteron tiger, which gave a 10% speedup for the added 2
That's an ugly speedup in that case, perhaps improve the code?
Excuses like memory controllers is not a good excuse. The 2 memory
controllers can deliver more data per second than the cpu's deliver gflop
As you can see at sudhian, diep has a speedup of 3.92 out of 4 cores.
Of course that was years of hard programming.
>I still have lots of testing and tuning to do. These tests were just to
>was going to work and how much trouble it was going to be. ( It was a LOT of
>trouble getting SuSE9.3 installed but I think worth it in the end)
Setting up gentoo 2005.0 amd64 universal here now. Will go fine.
>Best to all
>> If you 'did get better performance', that's possibly because
>> you have some kernel 2.6.x now, allowing NUMA, and a new
>> compiler version of gcc like 4.0.1 that has been bugfixed more than
>> the very buggy series 3.3.x and 3.4.x
>> Can you show us the differences between the compiler versions and kernel
>> versions you had and whether it's NUMA?
>> Also how is your memory banks configured, for 64 bits usage or 128 bits
>> single cpu usage, or are all banks filled up?
>Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC
More information about the Beowulf