[Beowulf] bizarre scaling behavior on a Nehalem

Wed Aug 12 19:42:30 PDT 2009

Rahul Nabar wrote:
> On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley<bill at cse.ucdavis.edu> wrote:
>> Looks to me like you fit in the barcelona 512KB L2 cache (and get good
>> scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling).
> 
> Thanks Bill! I never realized that the L2 cache of the Nehalem is
> actually smaller than that of the Barcelona!

Indeed.  Usually a doubling of cache size doesn't make a huge difference, but
of course there are the occasional times when it makes a big difference.

> I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe.
> THe size of the L2 cache is fixed across the steppings of the Nehlem
> isn't it?

I believe so, at least so far.

>> Were the binaries compiled specifically to target both architectures?  As a
>> first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's
>> compiler for intel.  But portland group does a good job at both in most cases.
> 
> We used the intel compilers. One of my fellow grad students did the
> actual compilation for VASP but I believe he used the "correct" [sic]
> flags to the best of our knowledge. I could post them on the list
> perhaps. There was no cross-compilation. We compiled a fresh binary
> for the Nehalem.

I'd make sure the compiler is fairly current.  I believe both the
barcelona/shanghai and the core i7/nehalem have some significant tweaks that
if the compiler isn't aware of the new functionality you leave significant
performance on the table.  In particular the newest SSE features won't be of
any benefit without direct compiler support.

>> A doubling of the can have that effect.  The Intel L3 can no come anywhere
>> close to feeding 4 cores running flat out.
> 
> Could you explain this more? I am a little lost with the processor
> dynamics.

In general each step through the memory hierarchy (registers, l1, l2, l3, and
main memory) approximately double latency and halve the bandwidth available.

So for instance if you fit in L1 caches you might well be able to enjoy
160GB/sec, but if you more than 1MB on a nehalem chip you will be in L3 with
only 48GB/sec or so.

Check out: (the slightly updated)
http://cse.ucdavis.edu/bill/pstream.svg

So if you compare the 2MB lines the core i7 with 4 threads running can handle
47GB/sec.  The dual socket barcelona or shanghai system can handle 128GB/sec.
So even a dual socket Nehalem, even with one of the faster clocks (I tested
2.6 GHz) and perfect scaling the dual nehelam would only get 95GB/sec still
well below the amd score.  Of course there are many other things going on and
it might well be other differences in the architecture responsible for the
difference.  Even if it was memory bandwidth there was many other parts of the
graph where the single socket intel does substantially better than half the
AMD, and in the case of accessing main memory the single socket intel is
faster than the dual socket AMD.

So basically it comes down to fun handwaving about the architecture, but if
you are making a price/performance decision collect a bunch of production runs
 and get out a stop watch.  Your vasp difference in performance and scaling
might well disappear with different inputs.

> Does this mean using a quad core for HPC on the Nehlem is
> not likely to work well for scaling? Or do you imply a solution so
> that I could fix this somehow?

I didn't test a dual socket nehalem because I didn't have access, I hope to
have numbers soonish.  In the mean time contact me off list if you want the
code to try it yourself.