[Beowulf] Nehalem and Shanghai code performance for our rzf example
diep at xs4all.nl
Mon Jan 19 17:15:42 PST 2009
I'm not limited by knowledge on materials unlike you.
I'd argue if something gives 10x the clockrate it destroys everything
even at 1/10 th of the the transistor capacity.
Current status is:
Phenom2 overclocks better and is real cheap and when programmed real
low level near assembler
level it's having a faster IPC than Nehalem. Especially in SSE2 type
codes it's dominant. Just the compiler
fools you, it's intel friendly, to say polite. That seems current
Yet objectively, Q6600 was a quantum leap forward. A brilliant
design, when it released.
Connected L2's or not connected, who cares when it delivers a big punch?
Testsetprogram tricks like hyperthreading, have seen this, done that.
It doesn't work for most HPC type
workloads. Just makes timing your software more complicated.
All the cpu's are still 4 cores, that's reality. I don't see progress
Newer processtechnology from 65 nm to 45 nm, hopefully it produces
cpu's cheaper, yet it hardly
clocks a lot higher at production level. Only for watercooled
overclockers it makes AMD suddenly very
attractive now, yet that's not how clusters get build usually (my
cluster probably is a big exception anyway,
it has 1 node currently to give one example).
Nehalem hardly is better performing IPC wise than Q6600 for integer
workloads, and it is doing so
at a huge powercost. Phenom2 in fact
is 0% better integerwise than Phenom1. Even more disappointing in
that respect. Just its price is cool.
factor 4 cheaper than Nehalem 3.2Ghz Nearly factor 5. And just 200Mhz
lower default clock.
I'm quite disappointed by the new cpu's from intel and amd to be honest.
The way these manufacturers 'fix' performance on paper is by using
Most testsets are too L3 oriented and too much subject of
optimization of compiler teams.
If you have a $100 billion company and under 100 'test programs'
that's what you get,
then for such a huge company with such a huge compiler team it's too
easy to bust everything.
Current new generation cpu's are faster on paper, in reality they
"paper supports everything"
A $100 billion companies will bust every test and manage to
manipulate in new tests with a
datasize that benefits big L3's whereas in reality big L3's are just
not needed for HPC.
That's just total ballony for matrix calculations, CFD whatever.
Either your code hardly gets inside L3,
or you need that much gigabytes of RAM that L3 doesn't matter either.
A few mb's is enough.
4 MB versus 16MB is no big deal simply.
Only some 'chosen' working set sizes benefit to L3.
A 20Ghz PhenomGAaS will of course destroy everything.
As explained however, that doesn't really matter, because L3's eat
relative little power compared to
the execution logics, so that is a big bummer in that case.
My plans for a 128 core (each core low power) multiprocessor, which
allows easy porting of HPC codes to
it, as i voted for say 50% of the total ram assigned to each core
local through a local L2 (total not-shared with
the other cores) and a very slow, possibly even off-chip L3 cache to
a shared memory (the other 50% of the RAM),
it got laughed away by some intel fanboys. If that's the case then
intel is dead in HPC of course as nvidia and AMD
will take over with GPU type supercomputers. I tend to have more
faith in engineers though than the fanboys do
and more than most professors are. I believe in new solutions, not in
vicious circles that were the past.
A manycore is really complicated to write efficient algorithms for,
whereas some modified multicore type cpu,
is easy to port codes to.
I'd argue approaching things from software viewpoint: WHAT IS EASY TO
PORT might be a rather good idea for
future cpu design.
If you quote now something that can run at 10x the clockspeed, then
the question is of course: "suppose we would
make a big building filled with GaAs processing units, at what price
can you build it me and what computing power does it
give at what power?
If the answer is: "the building might explode with odds 1 in a
million", i'm sure some governments want to take that risk
if it is that much faster. See it as a feature. Ideal feature to sell
to N*SA i'd argue.
The amount of power it uses is quite important IMHO.
Power should be ever more a bigger concern in highend HPC i feel.
Right now it is paper demands from governments that
just receive lied statements - i feel this is unsellable in future to
government. The amount of watt a gflop matters quite a lot.
If it was that easy to produce energy, we would of course already
have cars on electricity or drive on water.
Of course we want ECC and ECC ram on every design. Too many errors at
such computing power is not acceptable.
On Jan 19, 2009, at 7:31 PM, Bill Broadley wrote:
> John Hearns wrote:
>> BTW, re the discussion on processor frequency scaling,
>> what finally did happen to Emitter Coupled Logic and gallium
> I followed the exponential "intel killer" for quite some time,
> although it
> seemed obvious to me from the first slides it was going to be a
> failure. Sky
> high clock rates, tiny caches, and a poor memory buss seemed to be
> for failure.
> If gallium arsenide or some other material gave us 10x the clock
> rate per
> watt, but 1/2 the transistors would it really matter? Seemed like
> even intel
> is begrudgingly admitting it's the memory bus, and finally the
> nehalem is
> blessed with dramatically more bandwidth.
> Seems like increasingly cores are turning latency limited workloads
> (for the
> parallel jobs of course) into bandwidth limited ones. Without a
> memory bus
> that allows for 10x the bandwidth it doesn't really seem like 10x
> the clock
> rate would be of particular use.
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf