[Beowulf] Multicore Is Bad News For Supercomputers

Fri Dec 5 12:36:44 PST 2008

Mark Hahn wrote:
>> (Well, duh).
>
> yeah - the point seems to be that we (still) need to scale memory
> along with core count.  not just memory bandwidth but also concurrency
> (number of banks), though "ieee spectrum online for tech insiders"
> doesn't get into that kind of depth :(

I think this needs to be elaborated a little for those who don't know the 
layout of SDRAM ...

A typical chip that may be used in a 4 GB DIMM would be a 2 Gbit SDRAM chip, 
of which there would be 16 (total 32 Gbits = 4 Gbytes). Each chip 
contributes 8 bits towards the 64-bit DIMM interface, so there's two 
"ranks", each comprised of 8 chips. Each rank operates independently from 
the other, but share (and are limited by) the bandwidth of the memory 
channel. From here I'm going to be using the Micron MT47H128M16 as the SDRAM 
chip, because I have the datasheet, though other chips are probably very 
similar.

Each SDRAM chip internally is make up of 8 banks of 32 K * 8 Kbit memory 
arrays. Each bank can be controlled seperately but shares the DIMM 
bandwidth, much like each rank does. Before accessing a particular memory 
cell, the whole 8 Kbit "row" needs to be activated. Only one row can be 
active per bank at any point in time. Once the memory controller is done 
with a particular row, it needs to be "precharged", which basically equates 
to writing it back into the main array. Activating and precharging are 
relatively expensive operations - precharging one row and activating another 
takes at least 11 cycles (tRTP + tRP) and 7 cycles (tRCD) respectively at 
top speed (DDR2-1066) for the Micron chips mentioned, during which no data 
can be read from or written to the bank. Precharging takes another 4 cycles 
if you've just written to the bank.

The second thing to know is that processors operate in cacheline sized 
blocks. Current x86 cache lines are 64 bytes, IIRC. In a dual-channel system 
with channel interleaving, odd-numbered cachelines come from one channel, 
and even numbered cachelines from the other. So each cacheline fill requires 
8 bytes read per chip (which fits in nicely with the standard burst length 
of 8, since each read is 8 bits), coming out to 128 cachelines per row. Like 
channel interleaving, bank interleaving is also used. So:
[] Cacheline 0 comes from channel 0, bank 0
[] Cacheline 1 comes from channel 1, bank 0
[] Cacheline 2 comes from channel 0, bank 1
[] Cacheline 3 comes from channel 1, bank 1
:
:
[] Cacheline 14 comes from channel 0, bank 7
[] Cacheline 15 comes from channel 1, bank 7
So this pattern repeats every 1 KB, and every 128 KB a new row needs to be 
opened on each bank. IIRC, rank interleaving is done on AMD quad-core 
processors, but not the older dual-core processors nor Intel's discrete 
northbridges. I'm not sure about Nehalem.

This is all fine and dandy on a single-core system. The bank interleaving 
allows the channel to be active by using another bank when one bank is being 
activated or precharged. With a good prefetcher, you can hit close to 100% 
utilization of the channel. However, it can cause problems on a multi-core 
system. Say if you have two cores, each scanning through separate 1 MB 
blocks of memory. Each core is demanding a different row from the same bank, 
so the memory controller has to keep on changing rows. This may not appear 
to be an issue at first glance - after all, we have 128 cycles between each 
CPU hitting a particular bank (8 bursts * 8 cycles per burst * 2 processors 
sharing bandwidth), so we've got 64 cycles between row changes. That's over 
twice what we need (unless we're using 1 GB or smaller DIMMS, which only 
have 4 pages so things become tight).

The killer though is latency - instead of 4-ish cycles CAS delay per read, 
we're now looking at 22 for a precharge + activate + CAS. In a streaming 
situation, this doesn't hurt too much as a good prefetcher would already be 
indicating it needs the next cacheline. But if you've got access patterns 
that aren't extremely prefetcher-friendly, you're going to suffer.

Simply cranking up the number of banks doesn't help this. You've still got 
thrashing, you're just thrashing more banks. Turning up the cacheline size 
can help, as you transfer more data per stall. The extreme solution is to 
turn off bank interleaving. Our memory layout now looks like:
[] Cacheline 0 comes from channel 0, bank 0, row 0, offset 0 bits
[] Cacheline 1 comes from channel 1, bank 0, row 0, offset 0 bits
[] Cacheline 2 comes from channel 0, bank 0, row 0, offset 64 bits
[] Cacheline 3 comes from channel 1, bank 0, row 0, offset 64 bits
:
:
[] Cacheline 254 comes from channel 0, bank 0, row 0, offset 8 K - 64 bits
[] Cacheline 255 comes from channel 1, bank 0, row 0, offset 8 K - 64 bits
[] Cacheline 256 comes from channel 0, bank 0, row 1, offset 0 bits
[] Cacheline 257 comes from channel 1, bank 0, row 1, offset 0 bits
So a new row every 16 KB, and a new bank every 512 MB (and a new rank every 
4 GB).

For a single core, this generally doesn't have a big effect, since the 18 
cycle precharge+activate delay can often be hidden by a good prefetcher, and 
in any case only comes around every 16 KB (as opposed to every 128 KB for 
bank interleaving, so it's a bit more frequent, though for large memory 
blocks it's a wash). However, this is a big killer for multicore - if you 
have two cores walking through the same 512 MB area, they'll be thrashing 
the same bank. Not only does latency suffer, but bandwidth as well since the 
other 7 banks can't be used to cover up the wasted time. Every 8 cycles of 
reading will require 18 cycles of sitting around waiting for the bank, 
dropping bandwidth by about 70%.

However, with proper OS support this can be a bit of a win. By associating 
banks (512 MB memory blocks) to cores in the standard NUMA way, each core 
can be operating out of its own bank. There's no bank thrashing at all, 
which allows much looser requirements on activation and precharge, which in 
turn can allow higher speeds. With channel interleaving, we can have up to 8 
cores/threads operating in this way. With independent channels (ala 
Barcelona) we can do 16. Of course, this isn't ideal either. A row change 
will stall the associated CPU and can't be hidden, so ideally we want at 
least 2 banks per CPU, interleaved. Also, shared memory will be hurt under 
this scheme (bandwidth and latency) since it will experience bank thrashing 
and will only have 2 banks. To cover the activate and precharge times, we 
need at least 4 banks, so for a quad core CPU we need a total of 16 memory 
banks in the system, partly interleaved. 8 banks per core can improve 
performance further with certain access patterns. Also, to keep good 
single-core performance, we'll need to use both channels. In this case, 
4-way bank interleaving per channel (so two sets of 4-way interleaves), with 
channel interleaving and no rank interleaving would work, though again 8-way 
bank interleaving would be better if there's enough to go around.

This setup is electronically obtainable in current systems, if you use two 
dual-rank DIMMS per channel and no rank interleaving. In this case, you have 
8-way bank interleaving, with channel interleaving and with the 4 ranks in 
contiguous memory blocks. With AMD's Barcelona, you can get away with a 
single dual-rank DIMM per channel if you run the two channels independently 
(though in this case single-threaded performance is compromised, because 
each core will tend to only access memory on a single controller). An 
8-thread system like Nehalam + hyperthreading would ideally like 64 banks. 
Because of Nehalem's wonky memory controller (seriously, who was the guy in 
charge who settled on three channels? I can imagine the joy of the memory 
controller engineers when they found out they'd have to implement a 
divide-by-three in a critical path) it'd be a little more difficult to get 
working there, though there's still enough banks to go around (12 banks per 
thread).

However, I'm not sure of any OSes that support this quasi-NUMA. I'm guessing 
it could be hacked into Linux without too much trouble, given that real NUMA 
support is already there. It's something I've been meaning to look into for 
a while, but I've never had the time to really get my hands dirty trying to 
figure out Linux's NUMA architecture.

Cheers,
Michael