[Beowulf] Xeon Phi questions - does it have *any* future?

Vincent Diepeveen diep at xs4all.nl
Sat Dec 15 14:48:52 PST 2012


On Dec 15, 2012, at 9:58 PM, Mark Hahn wrote:

>> The big question i would like to ask intel architects is whether the
>> Xeon Phi architecture has a future,
>
> it's hard for me to imagine how this is not a silly question.
>

Some very big hardware experts tend to agree with me.
See reactions at realworldtech. That's in fact bunches of intel  
engineers there posting.

That's why i ask this question.

It's cache coherent that Larrabee CPU.

>> so what comes AFTER this Xeon Phi?
>
> more cores.  lower power-per-core.  higher clocks.  more memory  
> bandwidth.
> more per-core cache and/or flops.

more cores with cache coherency is a much bigger problem than you  
might guess.

This is centralized stuff. Imagine a cluster where all cpu's need to  
communicate with each specific
node for *every* shared fetch from the RAM.

x86 compatibility means simply that all 'normal codes' will use  
centralized datastructures.

See it as a cluster where all cores nonstop synchronize with 1  
specific node.

Only specific optimized codes, which not easier to write than gpgpu  
codes, they can avoid this
problem. Yet you no longer have a x86 compatible chip then.

So either they kick it out or?

Time to let the intel engineers explain what their plans are.

>
> this is a place where stacking dram would be a significant win, though
> perhaps it's hard to manage given that modern chips are all normally
> power-limited.



>
>> From what i understand it has cache coherency - otherwise it could
>> not run x86 codes at a slow speed.
>
> it runs x86 because that's such a familiar ISA, and so many tools/ 
> codes
> can run without significant change.  (not that ARM would be difficult
> to imagine, but obviously different from NVidia's ISA.)  cache  
> coherency
> does not make a core run slowly - indeed, there's no reason to  
> believe that cache coherency is not eminently scalable (using  
> directories, of
> course).  what's not scalable is programming techniques that thrash  
> the coherency mechanism.
>

That's about all x86 codes which is my point.

So the x86 compatibility is bloody nonsense. Having a L2 cache on  
each core is not of course. Note that this specific
  architecture needs a cache on each L2 core (apart from the x86  
design).

The question is more relevant than you might imagine right now.

Realize some years ago, intel had first a prototype chip that had 80  
cores.
See 'the teraflops research chip : http://en.wikipedia.org/wiki/ 
Intel_MIC


They scaled that back to 60 now and extended the vector size to 512  
bits for Xeon Phi as we know it right now.

> cache coherency is really best thought of as a modest extension of  
> the ways/tag-match architecture of a normal cache.  some of the  
> line state
> transitions involve interaction with other cores' caches, but it  
> isn't inherently expensive in either space, time or power.

It is dominant.

Look at for example as well at AMD's huge crossbar at the Bulldozer  
architecture.

How huge is the crossbar at xeon phi ?

> Intel has talked
> a bit about the ring bus and how its design was optimized for  
> coherency
> traffic (dual-direction rings, each a cacheline wide, with control/ 
> coherency
> lanes replicated 2x for each.  one clock per hop.  in spite of being
> quite wide, the ring doesn't seem to dominate the die photos.  onchip
> point-to-point links are not, afaik, difficult, especially short ones.
>

We need a photo where someone a De Vries type recognizes how many  
transistors they waste onto the solution they
figured out to avoid a crossbar that eats 95% of all transistors of  
Xeon Phi.

>> The gpgpu hardware doesn't have cache coherency.
>
> well, a Cuda programmer _does_ have to worry about various forms of  
> data synchronization.  in fact, that's really the main reason why  
> the Cuda programming model is so strange to port to.
>

Look i'm not saying CUDA is easier or harder than programming for  
Xeon Phi. Both are vector chips.

For example the factoring code (like mfaktc) runs completely within  
the compute units with the register
files and now and then you fetch a few bytes from dimms ABC (goes  
automatic).

No nothing synchronisation. It's total embarrassingly parallel this  
code.

Any form of synchronisation is always manual synchronisation.

>> This is why we have
>> so many cores in such a short period of time at the
>> gpu's.
>
> eh?

If a core is dead you just kick it out in a manycore, that's why for  
so many years those gpu's can be so huge without
being real expensive, whereas the CPU's which have cache coherency  
problems, they use very sophisticated techniques
to lose cores, so i understand. It's a big problem. That's why we see  
all those cpu's already for years at around a 300 mm^2
in size, whereas the gpu's can easily be 1000 mm^2 so to speak. Price  
scales lineair.
If one compute unit has an error, you just kick it out *no problem*.

Much easier production technique than with cache coherency.

CPU's don't scale lineair in production costs. cost goes up  
dramatically for cpu's above 300mm^2 at todays technology.

>
> let's compare Phi to the Fermi generation.  Phi has 60 macro-cores,
> each 4-threaded and 512b wide.  Fermi has 16 SMs, each with 32  
> pseudo-cores,
> but this is really just SIMD.  from a DP perspective, it's really only
> 16x wide SIMD.  that is, 512b...

This is correct compare.

Fermi is around 576 mm^2 and they use the chip also in dirt cheap gpu's.

Xeon Phi is rumoured to be 350 mm^2.
They're already over the break even size of 300 mm ^ 2 die size.

It's problematic to make it larger.

>
> there are some other differences: Phi has conventional-appearing  
> registers
> (32x512b, probably per-thread - 4x) versus Fermi's 32kx32b shared  
> among all threads.  Phi has conventional caches (32k L1, 512K  
> coherent L2); Fermi has 64K L1 storage per SM as well, but can turn  
> off tag-related
> behavior to form per-SM "shared" memory (and 768K shared L2).
>
> I see no reason to think that Phi's performance will be hurt by the  
> directory-based L2 coherency implemented on its ring bus.  Intel's  
> clearly been working on low-latency rings for a long time - they're  
> forgiving, logically simple and can be quite high-performing.
>
> really, the big difference between Cuda and Phi is the vector  
> programming
> model.  Cuda presents a vector architecture as threads, even though  
> you can end up with a block containing 31 annulled threads that  
> still consumes
> cycles (Nvidia talks about the annulling as "divergence".)  Intel  
> provides
> a more conventional masked 512b SIMD - sort of *explicitly* annulling.
>
> I haven't see any good technical discussions about latencies, which  
> is what will really bite code that is not EP and low-memory.  in a  
> sense,

Forget the word latency on these platforms.


> both systems are ideal for financial MC, and fairly good for dense  
> matrix-matrix math.  for other stuff (say, adaptive mesh astro code),
> I'd say Phi has an architectural advantage simply because it has  
> more independent cores (60 vs 16) and a somewhat more irregularity- 
> forgiving
> memory architecture (distributed coherent L2).
>
>> From answers from engineers i understand the reason why most normal
>> cpu's do not have more cores is because of the cache coherency. More
>
> nonsense.  conventional CPUs do not have more cores because most  
> people want few-fast cores.  look at AMD's bulldozer/piledriver  
> effort: they provided a lot more cores and all the reviews pissed  
> on them.

Look how big of a part from the total chip the crossbar is at the  
modern cpu's. Take bulldozer. It's a HUGE crossbar.

You're really underestimating this.

You can easily build a Nvidia gpgpu chip of 2000 mm^2 if they'd  
wanted to.
With Xeon Phi it's tough to even reach 400 mm ^2.

>
>> cores are a cache coherency nightmare.
>
> nah.
>
>> Cache coherency is very costly to maintain in a cpu. So the question
>
> nah.
>
>> i want to ask is whether Xeon Phi scales for future
>> generations of releases of it.
>
> I think programming models are extremely important: to make the  
> hardware
> easier to take advantage of.  Phi seems pretty easy for a single  
> board,
> but it's still a bit sticky because it's a PCIE coprocessor  
> (without coherent access to host memory, and without any  
> particularly nice way to
> connect multiple chips, let alone multiple systems.)  I think a big  
> question
> is what level of integration is going to actually drive this market.
> are these chips designed for exascale clusters?  if so, scalable  
> interconnect
> is probably the main concern at this point.  if there's a  
> significant volume market for few-teraflop systems (personal  
> supercomputers), then just putting a few cards on a PCIE would work  
> OK.
>
> figuring out how to use stacked memory to provide really big bandwidth
> has to be in every architect's mind right now.
>
> I also wonder whether AMD has anyone working on this stuff.  it  
> would be fascinating if they took a different approach - say a 20W  
> APU with stacked
> dram that could be tiled by the score onto a single board.  in some  
> sense,
> that's probably a more manufacturable, power-efficient and scalable  
> approach
> than the 300W add-in cards that Nvidia and Intel are pursuing.
>
>> Are they going to modify the AVX2 code to AVX3, so vectors from 1024
>> bits in the future, in order to get some extra performance?
>
> I doubt it.  no code is purely vectorizable, and lots of code is  
> merely
> scalar.  in a meaningful sense, Nvidia's GPUs are the spiritual  
> decendent
> of the Tera/Cray MTA architecture, where programmers almost see an  
> ocean
> of threads representing the dataflow graph of their program.  the  
> hardware
> schedules threads onto cores whenever arguments are available (ie,  
> usually
> at the end of a memory access bubble.)  the current Phi is dealing  
> with the same basic workload characteristics, but the real question  
> is: at what granularity are threads independently scheduled.
>
> on Nvidia, threads in a block are never broken up, in spite of the  
> fact that they may have diverged (logically or due to non-coalesced/ 
> contiguous
> memory references).  Phi "blocks" are, in some sense, 64-wide, and  
> a single
> core only has 4 of them in-flight - Phi's thread scheduler is  
> probably quite a bit simpler than Nvidia's.  (MTA is afaikt, the  
> extreme case, where
> all threads are independent and not run in fixed batches (blocks).)
>
>> I assume more cores is going to be near impossible to keep coherent.
>> 60 is already a lot.
>
> nonsense.  coherency costs nothing in the absence of sharing/conflict.
> even when there is sharing, they could scale using an onchip 2d mesh.
>
> regards, mark hahn.
> not a chip architect, but hey, make an offer ;)




More information about the Beowulf mailing list