[Beowulf] A look at the 100-core Tilera Gx

Tue Nov 3 09:37:17 PST 2009

http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/

A look at the 100-core Tilera Gx

It's all about the network(s)

by Charlie Demerjian

October 29, 2009

TILERA IS CLAIMING to have the first commercial CPU to reach 100 cores, and
while this is true, the real interesting technology is in the interconnects.
The overall chip is quite a marvel, and it is unlike any mainstream CPU you
have ever heard of.

Making a lot of cores on a chip isn't very hard. Larrabee for example has 32
Pentium (P54) cores, heavily modified, as the basis of the GPU. If Intel
wanted to, it could put hundreds of cores on a die, that part is actually
quite easy. Keeping those cores fed is the most important problem of modern
chipmaking, and that part is not easy.

Large caches, wide memory busses, ring busses on chip, stacking, and optical
interfaces all are attempts to feed the beast. Everyone thought Intel's
Polaris, also known as the 80 core, 1 TeraFLOPS part from a few years ago,
was about packing cores onto a die. It wasn't, it was a test of routing
algorithms and structures. Routing is where the action is now, packing cores
in is not a big deal.

Routing is where Tilera shines. It has put a great deal of thought into
getting data from core to core with minimal latency and problems. Its rather
unique approach involves five different interconnect networks, programmable
partitioning, accelerators, and simply tons of I/O. Together, these allow
Tilera's third generation Tile-Gx CPUs to scale from 16 to 100 cores without
choking on congestion. They may not have the same single-threaded performance
of a Nehalem or Shanghai core, but they make up for it with volume.

100 core diagram

Tilera 100 core chip

The basic structure is a square array of small cores, 4x4, 6x6, 8x8 or 10x10,
each connected via five (5) on-chip networks, and flanked by some very
interesting accelerators. The cores themselves are a proprietary 32-bit ISA
in the first two generations of Tilera chips, and in the Gx, it is extended
to 64-bit. There are 75 new instructions in the Gx, 20 of which are SIMD, and
the memory controller now sees 64 bits as well.

In previous generations, there was no floating-point (FP) hardware in Tilera
products. The company strongly recommended against using FP code because it
had to be emulated taking hundreds or thousands of cycles. With the new Gx
series chips, FP code is still frowned upon, but there is some FP hardware to
catch the odd instruction without a huge speed hit. The 100 core part can do
50 GigaFLOPS of FP which may sound like a large number, but that is only
about 1/50th of what an ATI Cypress HD5870 chip can do.

The majority of the new instructions are aimed at what the Tilera chips do
best, integer calculations. Things like shuffle and DSP-like
multiply-and-accumulate (MAC) functions, including a quad MAC unit, are where
these new chips shine. Basically, the Gx moves information around very
quickly while twiddling bits here and there with integer functions.

While the cores might not be overly complex, the on-chip busses are. Each Gx
core has 64K of L1 cache, 32K data and 32K instruction, along with a unified
8-way 256KB L2 cache. The cache is totally non-blocking, completely coherent,
and the cache subsystem can reorder requests to other caches or DRAM. On top
of this, the core supports cache pinning to keep often used data or
instructions in cache. On the 100 core model, the Gx has 32MB of cache.

Tiles are the name Tilera uses for for a basic unit of repetition. The 16
core Gx has 16 tiles, the 64 core Gx has 64, etc. A tile consists of a core,
the L1 and L2 caches, and something Tilera calls the Terabit Switch. More
than anything, this switch is the heart of the chip.

Tile diagram

A Tilera tile

Remember when we said that cramming 100 cores on a die is not a big problem,
but feeding them is? The Terabit Switch is how Tilera solves the problem, and
it is a rather unique solution. Instead of one off-core bus, there are five.
Each of them has a dedicated purpose, and that not only gives huge bandwidth,
it also goes a fair way towards minimizing contention. Cache traffic will
never be stepped on by user data, and so on.

The five networks are called QDN, RDN, FDN, IDN and UDN. In the last two
generations of Tilera chips, all of these networks were 32 bits wide, but on
the Gx, the widths vary to give each one more or less bandwidth depending on
their functions.

QDN is called the reQuest Dynamic Network, and it is used for memory and
cache. QDN is 64 bits wide. RDN is Response Dynamic Network, and it is used
to feed memory reads back to the chips. RDN is 112 bits wide, an odd number,
64 + 48 from the look of it.

FDN is the widest at 128 bits, and it is used for cache to cache transfers
and cache coherency. Given the critical nature of cache transactions like
this, the width is no surprise. The last two IDN and UDN are both 32 bits
wide. IDN is I/O Dunamic Network, and passes data on and off the chip. With a
dedicated channel for off-chip transfers, you can see that reaching
theoretical numbers was a priority at Tilera.

The last network UDN is for User Dynamic Network, basically the one users get
to send stuff around on. QDN, RDN, FDN and IDN are basically housekeeping,
they work in the background. If you want to send things from point A to point
B, you send it across the UDN.

Although Tilera didn't explicitly state it, each hop from router to router
takes one cycle. This means that in a pathological case, corner core to
memory on the far corner, it could take 19 cycles to go from request to
memory, plus the memory round trip time, and then another 19 cycles to get
back. That is what you call a long time in computer speak. Even in an
'average' case, you have a 10 cycle latency, which is very long as well.

To be fair, the Tilera architecture is not made to run general purpose code.
As it was described when the first generation came out, workloads are meant
to be chunked up, so a single tile does a function, then the data gets passed
to the next tile for more work, and so on and so forth. If your program has
20 steps, you use 20 tiles and pipeline the work.

This solves many of the problems with variable latency and multi-hop traffic.
The other more elegant solution is the ability to section off chunks of the
chip into sub-units. There is a hypervisor that can partition each Gx chip
into programmable blocks.

Chunking tiles

Sub-sections of tiles

As you can see in the diagram above, each Gx is broken up into sub-chips in
software. You can give each process as much CPU power as it needs, and
arrange it so the output of one block feeds into the input of the next in a
single clock. This example has two Apache web server instances, an intrusion
prevention system (IPS), a secure sockets layer (SSL) stack, a network stack
and a few other processes running next to each other.

The Apache instances have their own memory controller, as do the IPS and the
SSL stack. The network stack is sitting on top of the memory controller for
decreased latency. Basically, the programmer can choose where to put each
process to minimize latency. It doesn't take much to figure out how to apply
these concepts to a database plus web server scenario, or a three-tiered
SAP-like workload.

Basically, Tilera allows you to explicitly place the data and compute
resources where, when and how you need them. The chunks are done at roughly
the same level as hardware VMs are in x86 CPUs, running below the level that
a process can affect. This creates hardware walls to segregate data
transfers, cache coherency traffic, and other tile to tile transfers. If done
correctly, it can minimize latency a lot in addition to keeping processes
from stepping on each other.

Now that you know how the cores work, talk, and are partitioned, what about
the 'uncore'? Talk about that starts with the memory controllers - four
DDR3-2133MHz banks on the 64 and 100 core Gx, two on the 16 and 36 core
models. For the keen eyed out there, this means Tilera has two different
socket configurations, one for the 64 and 100 core chips, and another one for
the 16 and 36 core chips.

DDR3-2133MHz memory is very fast, hugely fast in fact. The math says 17GBps
per contr. Basically, this chip has a lot of available bandwidth. As you
might imagine, on the 16 and 36 core variants, there are only half the
controllers, so half the bandwidth.

In addition, you have a generic controller for USB, UARTs, JTAG and I2C
controllers. Given that Tilera chips are basically embedded, these are not
likely to be used for much more than booting and diagnostics.

On the core diagram above, there are two other blocks, the orange MiCA and
mPIPE accelerators. These are where the other parts of the Tilera Gx 'magic'
happen. MiCA stands for Multistream iMesh Crypto Accelerator, while mPIPE is
short for multicore Programmable Intelligent Packet Engine. If it isn't
blindingly obvious, the MiCA does the crypto and the mPIPE speeds up I/O.

The mPIPE does a lot of interesting things, all supposedly at wire speed. It
has a programmable packet classification engine, said to be usable at 80Gbps
or 120M packets per second. It can twiddle headers and do other evil things
that would make Comcast drool with the potential for 'network management'
extortion payements.

In addition, it can also load balance across the various I/O lanes, and
redirect tile to tile 'I/O' in a somewhat intelligent fashion. On top of
that, the mPIPE manages buffer sizes, queues, and other housekeeping to keep
latencies low. Think of it as a programmable housekeeping offload engine.

The most interesting bit is that the mPIPE can tag a packet with a 32 bit
header before it sends it onto the internal network. This is where the
programmable part shines. You can set up fields in the I/O packet itself to
pass along pre-decode information and other time-saving tidbits. Since I/O is
fully virtualizable, you could theoretically tag the packets with VM data, or
just about anything else a bored programmer can think of.

The MiCA engines, two on the 64/100 core, one on 16/36 cores, are crypto
offload engines. They can work either 'inline' or as ull blown offload
engines, that is up to the programmer. The MiCA can pull data directly from
caches or main memory without CPU overhead, basically fire and forget.

If you like acronyms, the MiCA on the Gx can support AES, 3DES, ARC4, Kasumi
and Snow for crypto, SHA-1, SHA-2, MD5, HMAC and AES-GMAC for hashes, RSA,
DSA, Diffie-Hellman, and Elliptic Curve for public key work, and it has a
true random number generator (RNG). WTF, LOL, ROFL and other netspeak can be
encrypted along with any other text that uses correct grammar. RLY.

Tilera claims that the MiCA engine can do wire speed 40Gbps crypto with full
duplex on the 100 core Gx, and 1024b key RSA at 50K keys per second on the
100 core, 20K keys per second for the 36 core. Not bad at all. In addition,
the MiCA supports a hardware compression engine that uses the tried and true
Deflate algorithm.

The last piece of the puzzle is something that Tilera calls external
acceleration interfaces. This could be as simple as plugging in a PCIe card,
but that lacks elegance. The interesting part is a field programmable gate
array (FPGA) interface. You can take up to 8 lanes of PCIe and connect the
FPGA to the serial deserial unit (SerDes) to enable basically direct and low
latency 32Gbps transfers. Direct transfers to cache and multiple contexts are
supported, meaning you can do quite a bit with an FPGA and a Tilera-Gx chip.

In the end, you have a monster chip for I/O and packet processing. It doesn't
do single-threaded applications all that fast, but it really isn't meant to.
The chip itself is not out yet, nor is there even silicon yet. The first
version out will be the 36 core Gx in Q4 of 2010, followed by the 16 core
later in Q4 or possibly Q1 of 2011. These both share the same socket
configuration and a 35*35mm package.

In Q1 of 2011, the 100 core chip will come out on a new socket and in a
45*45mm package. A bit after that, the 64 core will hit the market. Power
ranges from 10W for the 16 core to 55W for the 100 core, but you can get
power optimized variants that will only suck 35W. Given the programmability
of the parts, power use is likely more dependent on the programs running on
it.

The last bit of information is clock speeds. The 64 and 100 core models will
come in versions that run at 1.25GHz and 1.5GHz, not bad considering how much
there is to synchronize and keep going. The 36 core models will come in
1.0GHz, 1.25GHz and 1.5GHz versions, and the 16 core models will only come in
1.0GHz or 1.25GHz versions. Given the core count, internal interconnections,
memory and I/O capabilities, Tilera will pack a lot of power into these small
packages.S|A