[Beowulf] bizarre scaling behavior on a Nehalem

Tue Aug 11 16:07:40 PDT 2009

On Tue, Aug 11, 2009 at 5:57 PM, Bruno Coutinho<coutinho at dcc.ufmg.br> wrote:
> Nehalem and Barcelona have the following cache architecture:
>
> L1 cache: 64KB (32kb data, 32kb instruction), per core
> L2 cache: Barcelona :512kb, Nehalem: 256kb, per core
> L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores.
>
>
> Both in Barcelona and Nehalem, the "uncore" (everything outside a core, like
> L3 and memory controllers) runs at lower speed than the cores and all cores
> communicate through L3, so it must handle some coherence signals too.
> This makes impossible to L3 feed all cores at full speed if L2 caches have
> big miss ratios.
>
> So, what is happening with your program is something like:
>
> Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate,
> but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate.
> So in Nehelem the shared L3 cache has to handle much more requests from all
> cores than Barcelona, becoming a big bottleneck.

Thanks Bruno! That makes a lot of sense now. Assuming that is what is
happening is there any way of still using the Nehalems fruitfully for
this code? Any smart tricks / hacks?

The reason is that the Nehalems seem to scale and perform beautifully
for my other codes.

The only other option is to relapse back to the AMDs. I believe the
Shanghai would be a choice or an Instanbul. I assume the cache
structure there is as good as the Barcelona if not better! Any
experiences with these chips on the group?

Funnily, I haven't heard of any such Nehalem (-ive) stories anywhere
else. Am I the first one to hit this cache bottleneck? I doubt it. Any
other cache heavy users?

-- 
Rahul