[Beowulf] itanium vs. x86-64

Tue Feb 10 03:34:04 PST 2009

On Feb 10, 2009, at 8:59 AM, Toon Knapen wrote:

>> SGI altix3000 series is basically 280ns (random 8 byte reads from  
>> big buffer) shared memory until 4 sockets.
>
> itanium needs 8 clock-cycles to fetch a double from L2 cache  
> (floats are not stored in L1) and 110 clock-cycles from L3 IIRC,  
> how is that with x86-64?
>

L2 is relevant for instructions more than data.

In case of for example my diep program, the L1d has 0.6-0.9% misses.  
0.5 to 0.8% of that go anyway to memory controller,
so also pass L2 and L3 in 95% of cases.

L1i has 1.34% misses in case of core2. Obviously a lot more than that  
for itanium2 as it cannot store instructions in L2.

Memory controller is very relevant for many workloads.

Scientists don't have something that "just fits in L3 cache". They  
use gigabytes of RAM.

Me too.

Latency to opteron at the time was 110 ns (for 2 sockets) and nearly  
170-200 ns for 4 socket opteron when fetching memory
from a RANDOM spot of any of the 4 memory controllers. This was in  
fact for cheap ECC/REG DDR ram and quite cheap mainboards.
Dual channel memory mainboards already improve latency dramatically.

Single socket it was 90 ns for A64.

When just doing local fetches it was roughl 90-110 ns.

The same thing at the itaniums from NWO was 280 ns.

More than factor 2.5 slower for local RAM and roughly factor 2 slower.

You must never focus upon L2 and L3 cache. These caches come AFTER  
L1. L1 is by far most important.
If L1 is real tiny for the instructions, then speed of L2 matters, IF  
IT CAN STORE INSTRUCTIONS.
Itanium2 couldn't do that.

Thing is, a 128 KB L1 is always going to outperform a 64KB L1 bigtime.
Above 128KB L1 things start to get less relevant for majority of  
software products,
except of course when your instruction size is real huge and you work  
with big bundles,
then having 256KB L1 is no luxury.

A real fast L1 cache has its advantage, except when it avoids you  
from clocking your cpu very high.

How many 3Ghz+ clocked cpu's have a L1 that's under 4 cycles?  0

If you make a very cheap CPU design such as itanium, obviously having  
a 1 cycle L1 is a simple way to avoid problems.

I say this as an objective hardware layman, all the experts are too  
coloured simply. Either the work for intel or have big affiliation  
for intel,
or they work for another company and therefore dislike intel. That's  
the grim reality.

So for me as a layman it also was a big shock when januari 2003 i  
started to realize that itanium2 would not EVER get clocked
at the same speed, nor any of its future follow ups, to the same  
clock like x86-64 processors would be. Even if it would have the same
processtechnology at the same time and same number of cores, that  
already renders it obsolete of course.

What i do remember is the discussions years ago about whether one  
should go for a tiny L1 or a tad bigger one.
Where 64KB L1 is tiny and where 256KB L1 is big.

I remember all the postings of intel engineer fanboys who voted for  
an ultra tiny L1.
years later we see the result.

Clocking Tukwila now at 2.x Ghz at 65 nm is maybe a formidable  
achievement in the eyes of engineers, it is years too late simply,
realize c2q clocked 3.2Ghz years ago in 65 nm.

It is even QUESTIONABLE whether tukwila 2.4ghz will SCALE better for  
my chessprogram Diep at a by now old 3.2Ghz Q9775.
And i say that knowing that SMT which schedules more than 1 proces at  
1 processor. I'm even prepared to forgive that
scaling is total irrelevant over speedup in time even, as you guys  
have not much to do with game tree search algorithms.

>
>> f) itanium2 total focussed upon floating point, yet that is about  
>> the least interesting thing for such hardware to do; there always  
>> have been cheaper floating point solutions than itanium2. Let's  
>> forget about the disaster called itanium-1. A wrong focus. Integer  
>> speed simply matters for such expensive cpu's. This is why IBM  
>> could sell power6.
>
> same goes for x86-64, no?
>

I wouldn't qualify x86-64 as a floating point monster. Just some  
manufacturers have been total asleep if you ask me. They all want to  
earn so so many billions that they 'forget' to produce cpu's which  
are relative easy to produce for a small team, where they can earn  
say 0.5 billion a year with.
The real problem of a custom cpu made by an ultra tiny team, it is of  
course complicated to clock it higher than say 333Mhz.

Yet already years ago it shouldn't have been too hard to produce a  
cpu of 500Mhz that basically behaves like a vector processor for  
double precision floating point and that gets handsdown 0.5 Tflop.  
That was years ago.

Tony, see all the specifications your own organisations make. Would  
they have bought a 500Mhz vector chip?

I bet not.

The cpu has to follow all kind of specifications. Yet a lot of  
software can already get helped by a fast doulbe precision vector  
processor.
Call it manycore, call it vector processor. It seems some intel dudes  
are pissed now at the naming convention mine of calling larrabee
a vector processor, gpu's i cal manycores and multicores can do a  
hell of a lot more than the first 2.

Maybe considering the above, the "too small profit" for a floating  
point monster chip, should boost some initiatives.
Collect a few NSA type hardware nerds to make a cpu design. Then let  
it commercially get produced,
and you have a 2 Tflop double precision cpu for a small price which  
all kind of governments can order
publicly for their universities.

All these guys have little to do anyway right now, as they hired too  
many people past few years anyway.

The only thing that is real important is to overspec the cpu. That's  
the problem with all these 'big cpu's if we speak about floating point,
they must be so generic that they are not giving a big punch in that  
what you need most.

Imagine that you can offload matrixcalculations from expensive  
itanium/power/sun/x86 supercomputers to some dedicated
floating point vector type hardware. That really saves out half the  
problem.

The additional problem you have selling that hardware is that  
universities suddenly see the size of their sporthall 500 machine
halved in size.

Let's take the dutch example.

Suppose you buy 1 big power6 machine of 105 nodes. It is 60 Tflop.
Far above the machine in Groningen in short.

Now suppose NWO had decided to spent half their money on a 50 nodes  
power6 machine and the other half at some
major league floating point monster. They wouldn't have been happy  
about that in Amsterdam; as then in the sporthall 500,
the machine in Groningen would've been above their current box.

That's the grim reality isn't it?

It's all about prestige for the professors to get a higher ranking in  
the sporthall 500 and big money for the manufacturers.

I won't blame the manufacturers, they're not burocrats.

Vincent

>
>