<br><br><div class="gmail_quote">2009/8/11 Rahul Nabar <span dir="ltr"><<a href="mailto:rpnabar@gmail.com" target="_blank">rpnabar@gmail.com</a>></span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley<<a href="mailto:bill@cse.ucdavis.edu" target="_blank">bill@cse.ucdavis.edu</a>> wrote:<br>

> Looks to me like you fit in the barcelona 512KB L2 cache (and get good<br>

> scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling).<br>

<br>

Thanks Bill! I never realized that the L2 cache of the Nehalem is<br>

actually smaller than that of the Barcelona!<br>

<br>

I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe.<br>

THe size of the L2 cache is fixed across the steppings of the Nehlem<br>

isn't it?</blockquote><div><br>I think that probably it only will be fixed on newer models or only in Westmere (Nehalem shrink to 32nm).<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<br>

<br>

> Were the binaries compiled specifically to target both architectures?  As a<br>

> first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's<br>

> compiler for intel.  But portland group does a good job at both in most cases.<br>

<br>

We used the intel compilers. One of my fellow grad students did the<br>

actual compilation for VASP but I believe he used the "correct" [sic]<br>

flags to the best of our knowledge. I could post them on the list<br>

perhaps. There was no cross-compilation. We compiled a fresh binary<br>

for the Nehalem.<br>

<br>

> I"m curious about the hyperthreading on data point as well.<br>

<br>

Didn't test for VASP yet but for our other two DFT codes i.e. DACAPO<br>

and GPAW hyperthreading "off" seems to be about 10% faster.<br>

<br>

<br>

> A doubling of the can have that effect.  The Intel L3 can no come anywhere<br>

> close to feeding 4 cores running flat out.<br>

<br>

Could you explain this more? I am a little lost with the processor<br>

dynamics. Does this mean using a quad core for HPC on the Nehlem is<br>

not likely to work well for scaling? Or do you imply a solution so<br>

that I could fix this somehow?<br>

</blockquote><div><br>Nehalem and Barcelona have the following cache architecture:<br><br>L1 cache: 64KB (32kb data, 32kb instruction), per core<br>L2 cache: Barcelona :512kb, Nehalem: 256kb, per core<br>L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores.<br>


<br><br>Both in Barcelona and Nehalem, the "uncore" (everything outside a core, like L3 and memory controllers) runs at lower speed than the cores and all cores communicate through L3, so it must handle some coherence signals too.<br>


This makes impossible to L3 feed all cores at full speed if L2 caches have big miss ratios. <br><br>So, what is happening with your program is something like:<br><br>Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate,<br>

but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate.<br>

So in Nehelem the shared L3 cache has to handle much more requests from all cores than Barcelona, becoming a big bottleneck.<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br></div></div><br>