[Beowulf] Xeon Phi out as well [kraut]

Vincent Diepeveen diep at xs4all.nl
Tue Nov 13 05:58:58 PST 2012


Interesting article.

Regrettably the writer is a technical noob, clearly readable in the  
German he writes.

Confusing MB with GB, so it's not so clear how accurate it is what he  
writes. Well what can you
expect from Heise.de in that sense...

Let's assume that majority he wrote down is ok.

Then we speak about 60 cores at 1.053Ghz using vectors of 512 bits,  
so that's 8 doubles i assume or AVX2.
The horror architecture previously called Larrabee.

Just more cores. I read nothing about cache coherency anymore and the  
fact they can 'turn off' 2 cores obviously means
it might not have it. So it's no longer having the bottleneck that  
Larrabee had.

You have to run 4 threads at it simultaneouslly says this article.  
That's factor 2 more than todays top GPU's need.
Both AMD as well as Nvidia you can perform well running 2 'threads'  
"at the same time' (they get alternated).

I assume that's for the same reason, namely to hide the latency  
that's there from releasing results after the execution units executed
the instructions.

 From Larrabee we knew that pretty important instructions to HPC were  
not having a good throughput handling, eating several
cycles. So it's difficult to do calculations now on what is possible  
to achieve.

Let's assume now 1 instruction can get executed and retired each  
clockcycle.

This is a dangerous assumption, as intel historically doesn't have  
very good multiplying execution units at not a single architecture
when compared to competitors. Historically latency also at their  
x86 / x64 cpu's was nearly factor 2 worse than for example AMD's  
opterons. This for 64 bits (integers) multiplication. Latest i7  
should have improved there though.

Under this assumption throughput latency is 1 clock, and that  
multiply-add is several clocks, that gives us:

1.053Ghz * 60 cores * 8 = 505.44 Gflop
Knowing that everyone always "lies" that factor 2 to it for multiply- 
add, even though i bet no one will manage to push
them through within 1 cycle an instruction in a nonstop manner; Also  
the big transforms using Fourier Transforms,
they cannot use multiply-add at all, yet if we ignore that, like  
everyone ignores it,
that gives a bragging rights of 2 * 505 = 1.01088 Tflop

This isn't bad at all considering the fact that K20, which based upon  
Moore's Law deduction of transistors to
doubling of speed, would have landed nearby 2 Tflop, appears to be  
just above 1.0 Tflop right now.

The fear was of course the latest Larrabee incarnation, Xeon Phi
would cost $10k, yet it seems intel wants to conquer the HPC market  
and Heise gives here first time i see it
a price for it which is 2649 dollar.

Available in 2013 though - which is a disadvantage.

Of course be careful buying this chip if you don't know what AVX2 is.
Many tried to write code for AVX2 and it took them years to get some  
prime number transforms to work a tad at it.

We see that Intel has deviated from their original plan, yet that  
they still tell the nonsense story to reporters
as if it would be interesting to run pentium code at it.

A single i7 will beat it there of course, as to get to the maximum  
throughput, you need to put your data
inside vectors of 8 doubles, otherwise it will perform horrible.

Assuming the Larrabee instruction set survived, it is also possible  
to indirectly acces each core using special
instructions.

Those had however a 7 cycle latency at Larrabee so it's not very  
encouraging to use them.
So doing the same thing you can reasonably simple do at GPU's, is  
pretty difficult here, yet not impossible.

Of course the only bummer is that it's not yet available.

Where this from marketing viewpoint is a good idea though from intel  
to already release it now,
as otherwise everyone would already sign a deal with Nvidia, we know  
from some years ago how intel brought several
HPC organisations in big problems by simply not delivering the  
itanium2 cpu's at the appointed time. That took another
6 months to a year. As they all talk there with each other, i am not  
sure of the impact of this.

It's obvious however intel wants to compete right now by pricing the  
chip not so expensive. That's good for the HPC community.
Now let's hope that none of the manufacturers gets a total monopoly,  
otherwise we'll be paying that $7500 that Itanium2 1.5Ghz
had as a cost price at introduction.

Financially seen these manufacturers can easily offer these cpu's for  
$1500 - $2k,
as that pays back easily all production and development costs.

On Nov 13, 2012, at 1:40 PM, Eugen Leitl wrote:

>
> http://www.heise.de/newsticker/meldung/SC12-Intel-bringt- 
> Coprozessor-Xeon-Phi-offiziell-heraus-1747942.html
>
> http://translate.google.com/translate? 
> sl=auto&tl=en&js=n&prev=_t&hl=en&ie=UTF-8&layout=2&eotf=1&u=http%3A% 
> 2F%2Fwww.heise.de%2Fnewsticker%2Fmeldung%2FSC12-Intel-bringt- 
> Coprozessor-Xeon-Phi-offiziell-heraus-1747942.html&act=url
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list