[Beowulf] A look at the 100-core Tilera Gx
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Gerry Creager gerry.creager at tamu.eduWed Nov 4 06:03:28 PST 2009
- Previous message: [Beowulf] A look at the 100-core Tilera Gx
- Next message: [Beowulf] A look at the 100-core Tilera Gx
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I think it was the recent IEEE Spectrum, where they talk about using the Tilera 100-core chips for HPC, tuned to a specific problem using FPGA for optimizing the chips to the problem. The argument is to use a lower-power system with huge numbers of cores and efficient on-chip switching, to replace the xommon x86(_64) architecture we've come to know and hate for its energy consumption and heat generation. Personal take: conventional systems will win this battle (ask SiCortex: <sigh> a great idea overwhelmed by investors who couldn't see its longer-term benefits), but that we just might see changes to slower but more efficient cores. Via Epia-10k comes to mind, as do the Atom and several other variants. A little slower switching fabric (gigabit) with some changes to the core thinking of integration designers, will be required, but I think we could make that 20kcore Atom system using gigabit work pretty well compared to a 4k core Nehalem with QDR. The big thing is reworking our thinking: It costs a LOT (we've said this all before) to create the power and cooling infrastructure for serious HPC, and I'll posit now that "serious" requires at least 4k x86_64 cores in today's logic. If the cost of powering and cooling all this stuff is considered, it's a huge expense, but then, a lot of us are at academic institutions, and don't have to consider infrastructure... or didn't until recently. Example: I have no place to expand our HPC, since we've maxed out power and cooling in the machine room we're currently in. And, in the only reasonable space I can build out to expand into, power's $90K and cooling another $100K to expand, allowing an additional 20 racks. Of x86_64 and QDR. In fact, while I'll gain 20 racks of space, I'm not sure I can get 20 racks of cooling in place for that. I'm reasonably sure I can power the stuff for the $90K figure and even add sufficient generator to keep critical elements (cooling at a reduced level; HPC generally has no requirement for running during a power failure) to continue until a clean shutdown or power's restored. I like what I've read on the Tilera. I think it's got some potential, but I think it's time we consider taking our breed of HPC toward to Maker side of things, and begin hacking minimalist motherboards, adopting low-power devices, and generally reinvent the hardware stack as we knew it. gerry Eugen Leitl wrote: > http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/ > > A look at the 100-core Tilera Gx > > It's all about the network(s) > > by Charlie Demerjian > > October 29, 2009 > > TILERA IS CLAIMING to have the first commercial CPU to reach 100 cores, and > while this is true, the real interesting technology is in the interconnects. > The overall chip is quite a marvel, and it is unlike any mainstream CPU you > have ever heard of. > > Making a lot of cores on a chip isn't very hard. Larrabee for example has 32 > Pentium (P54) cores, heavily modified, as the basis of the GPU. If Intel > wanted to, it could put hundreds of cores on a die, that part is actually > quite easy. Keeping those cores fed is the most important problem of modern > chipmaking, and that part is not easy. > > Large caches, wide memory busses, ring busses on chip, stacking, and optical > interfaces all are attempts to feed the beast. Everyone thought Intel's > Polaris, also known as the 80 core, 1 TeraFLOPS part from a few years ago, > was about packing cores onto a die. It wasn't, it was a test of routing > algorithms and structures. Routing is where the action is now, packing cores > in is not a big deal. > > Routing is where Tilera shines. It has put a great deal of thought into > getting data from core to core with minimal latency and problems. Its rather > unique approach involves five different interconnect networks, programmable > partitioning, accelerators, and simply tons of I/O. Together, these allow > Tilera's third generation Tile-Gx CPUs to scale from 16 to 100 cores without > choking on congestion. They may not have the same single-threaded performance > of a Nehalem or Shanghai core, but they make up for it with volume. > > 100 core diagram > > Tilera 100 core chip > > The basic structure is a square array of small cores, 4x4, 6x6, 8x8 or 10x10, > each connected via five (5) on-chip networks, and flanked by some very > interesting accelerators. The cores themselves are a proprietary 32-bit ISA > in the first two generations of Tilera chips, and in the Gx, it is extended > to 64-bit. There are 75 new instructions in the Gx, 20 of which are SIMD, and > the memory controller now sees 64 bits as well. > > In previous generations, there was no floating-point (FP) hardware in Tilera > products. The company strongly recommended against using FP code because it > had to be emulated taking hundreds or thousands of cycles. With the new Gx > series chips, FP code is still frowned upon, but there is some FP hardware to > catch the odd instruction without a huge speed hit. The 100 core part can do > 50 GigaFLOPS of FP which may sound like a large number, but that is only > about 1/50th of what an ATI Cypress HD5870 chip can do. > > The majority of the new instructions are aimed at what the Tilera chips do > best, integer calculations. Things like shuffle and DSP-like > multiply-and-accumulate (MAC) functions, including a quad MAC unit, are where > these new chips shine. Basically, the Gx moves information around very > quickly while twiddling bits here and there with integer functions. > > While the cores might not be overly complex, the on-chip busses are. Each Gx > core has 64K of L1 cache, 32K data and 32K instruction, along with a unified > 8-way 256KB L2 cache. The cache is totally non-blocking, completely coherent, > and the cache subsystem can reorder requests to other caches or DRAM. On top > of this, the core supports cache pinning to keep often used data or > instructions in cache. On the 100 core model, the Gx has 32MB of cache. > > Tiles are the name Tilera uses for for a basic unit of repetition. The 16 > core Gx has 16 tiles, the 64 core Gx has 64, etc. A tile consists of a core, > the L1 and L2 caches, and something Tilera calls the Terabit Switch. More > than anything, this switch is the heart of the chip. > > Tile diagram > > A Tilera tile > > Remember when we said that cramming 100 cores on a die is not a big problem, > but feeding them is? The Terabit Switch is how Tilera solves the problem, and > it is a rather unique solution. Instead of one off-core bus, there are five. > Each of them has a dedicated purpose, and that not only gives huge bandwidth, > it also goes a fair way towards minimizing contention. Cache traffic will > never be stepped on by user data, and so on. > > The five networks are called QDN, RDN, FDN, IDN and UDN. In the last two > generations of Tilera chips, all of these networks were 32 bits wide, but on > the Gx, the widths vary to give each one more or less bandwidth depending on > their functions. > > QDN is called the reQuest Dynamic Network, and it is used for memory and > cache. QDN is 64 bits wide. RDN is Response Dynamic Network, and it is used > to feed memory reads back to the chips. RDN is 112 bits wide, an odd number, > 64 + 48 from the look of it. > > FDN is the widest at 128 bits, and it is used for cache to cache transfers > and cache coherency. Given the critical nature of cache transactions like > this, the width is no surprise. The last two IDN and UDN are both 32 bits > wide. IDN is I/O Dunamic Network, and passes data on and off the chip. With a > dedicated channel for off-chip transfers, you can see that reaching > theoretical numbers was a priority at Tilera. > > The last network UDN is for User Dynamic Network, basically the one users get > to send stuff around on. QDN, RDN, FDN and IDN are basically housekeeping, > they work in the background. If you want to send things from point A to point > B, you send it across the UDN. > > Although Tilera didn't explicitly state it, each hop from router to router > takes one cycle. This means that in a pathological case, corner core to > memory on the far corner, it could take 19 cycles to go from request to > memory, plus the memory round trip time, and then another 19 cycles to get > back. That is what you call a long time in computer speak. Even in an > 'average' case, you have a 10 cycle latency, which is very long as well. > > To be fair, the Tilera architecture is not made to run general purpose code. > As it was described when the first generation came out, workloads are meant > to be chunked up, so a single tile does a function, then the data gets passed > to the next tile for more work, and so on and so forth. If your program has > 20 steps, you use 20 tiles and pipeline the work. > > This solves many of the problems with variable latency and multi-hop traffic. > The other more elegant solution is the ability to section off chunks of the > chip into sub-units. There is a hypervisor that can partition each Gx chip > into programmable blocks. > > Chunking tiles > > Sub-sections of tiles > > As you can see in the diagram above, each Gx is broken up into sub-chips in > software. You can give each process as much CPU power as it needs, and > arrange it so the output of one block feeds into the input of the next in a > single clock. This example has two Apache web server instances, an intrusion > prevention system (IPS), a secure sockets layer (SSL) stack, a network stack > and a few other processes running next to each other. > > The Apache instances have their own memory controller, as do the IPS and the > SSL stack. The network stack is sitting on top of the memory controller for > decreased latency. Basically, the programmer can choose where to put each > process to minimize latency. It doesn't take much to figure out how to apply > these concepts to a database plus web server scenario, or a three-tiered > SAP-like workload. > > Basically, Tilera allows you to explicitly place the data and compute > resources where, when and how you need them. The chunks are done at roughly > the same level as hardware VMs are in x86 CPUs, running below the level that > a process can affect. This creates hardware walls to segregate data > transfers, cache coherency traffic, and other tile to tile transfers. If done > correctly, it can minimize latency a lot in addition to keeping processes > from stepping on each other. > > Now that you know how the cores work, talk, and are partitioned, what about > the 'uncore'? Talk about that starts with the memory controllers - four > DDR3-2133MHz banks on the 64 and 100 core Gx, two on the 16 and 36 core > models. For the keen eyed out there, this means Tilera has two different > socket configurations, one for the 64 and 100 core chips, and another one for > the 16 and 36 core chips. > > DDR3-2133MHz memory is very fast, hugely fast in fact. The math says 17GBps > per contr. Basically, this chip has a lot of available bandwidth. As you > might imagine, on the 16 and 36 core variants, there are only half the > controllers, so half the bandwidth. > > In addition, you have a generic controller for USB, UARTs, JTAG and I2C > controllers. Given that Tilera chips are basically embedded, these are not > likely to be used for much more than booting and diagnostics. > > On the core diagram above, there are two other blocks, the orange MiCA and > mPIPE accelerators. These are where the other parts of the Tilera Gx 'magic' > happen. MiCA stands for Multistream iMesh Crypto Accelerator, while mPIPE is > short for multicore Programmable Intelligent Packet Engine. If it isn't > blindingly obvious, the MiCA does the crypto and the mPIPE speeds up I/O. > > The mPIPE does a lot of interesting things, all supposedly at wire speed. It > has a programmable packet classification engine, said to be usable at 80Gbps > or 120M packets per second. It can twiddle headers and do other evil things > that would make Comcast drool with the potential for 'network management' > extortion payements. > > In addition, it can also load balance across the various I/O lanes, and > redirect tile to tile 'I/O' in a somewhat intelligent fashion. On top of > that, the mPIPE manages buffer sizes, queues, and other housekeeping to keep > latencies low. Think of it as a programmable housekeeping offload engine. > > The most interesting bit is that the mPIPE can tag a packet with a 32 bit > header before it sends it onto the internal network. This is where the > programmable part shines. You can set up fields in the I/O packet itself to > pass along pre-decode information and other time-saving tidbits. Since I/O is > fully virtualizable, you could theoretically tag the packets with VM data, or > just about anything else a bored programmer can think of. > > The MiCA engines, two on the 64/100 core, one on 16/36 cores, are crypto > offload engines. They can work either 'inline' or as ull blown offload > engines, that is up to the programmer. The MiCA can pull data directly from > caches or main memory without CPU overhead, basically fire and forget. > > If you like acronyms, the MiCA on the Gx can support AES, 3DES, ARC4, Kasumi > and Snow for crypto, SHA-1, SHA-2, MD5, HMAC and AES-GMAC for hashes, RSA, > DSA, Diffie-Hellman, and Elliptic Curve for public key work, and it has a > true random number generator (RNG). WTF, LOL, ROFL and other netspeak can be > encrypted along with any other text that uses correct grammar. RLY. > > Tilera claims that the MiCA engine can do wire speed 40Gbps crypto with full > duplex on the 100 core Gx, and 1024b key RSA at 50K keys per second on the > 100 core, 20K keys per second for the 36 core. Not bad at all. In addition, > the MiCA supports a hardware compression engine that uses the tried and true > Deflate algorithm. > > The last piece of the puzzle is something that Tilera calls external > acceleration interfaces. This could be as simple as plugging in a PCIe card, > but that lacks elegance. The interesting part is a field programmable gate > array (FPGA) interface. You can take up to 8 lanes of PCIe and connect the > FPGA to the serial deserial unit (SerDes) to enable basically direct and low > latency 32Gbps transfers. Direct transfers to cache and multiple contexts are > supported, meaning you can do quite a bit with an FPGA and a Tilera-Gx chip. > > In the end, you have a monster chip for I/O and packet processing. It doesn't > do single-threaded applications all that fast, but it really isn't meant to. > The chip itself is not out yet, nor is there even silicon yet. The first > version out will be the 36 core Gx in Q4 of 2010, followed by the 16 core > later in Q4 or possibly Q1 of 2011. These both share the same socket > configuration and a 35*35mm package. > > In Q1 of 2011, the 100 core chip will come out on a new socket and in a > 45*45mm package. A bit after that, the 64 core will hit the market. Power > ranges from 10W for the 16 core to 55W for the 100 core, but you can get > power optimized variants that will only suck 35W. Given the programmability > of the parts, power use is likely more dependent on the programs running on > it. > > The last bit of information is clock speeds. The 64 and 100 core models will > come in versions that run at 1.25GHz and 1.5GHz, not bad considering how much > there is to synchronize and keep going. The 36 core models will come in > 1.0GHz, 1.25GHz and 1.5GHz versions, and the 16 core models will only come in > 1.0GHz or 1.25GHz versions. Given the core count, internal interconnections, > memory and I/O capabilities, Tilera will pack a lot of power into these small > packages.S|A > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: [Beowulf] A look at the 100-core Tilera Gx
- Next message: [Beowulf] A look at the 100-core Tilera Gx
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
