[Beowulf] AMD performance (was 500GB systems)

Sat Jan 12 18:21:27 PST 2013

On 01/12/2013 04:25 PM, Stu Midgley wrote:
> Until the Phi's came along, we were purchasing 1RU, 4 sockets nodes
> with 6276's and 256GB ram.  On all our codes, we found the throughput
> to be greater than any equivalent density Sandy bridge systems
> (usually 2 x dual socket in 1RU) at about 10-15% less energy and
> about 1/3 the price for the actual CPU (save a couple thousand $$ per
> 1RU).

For many workloads we found similar.  The last few generations of AMD
CPUs have had 4 memory channels per socket.  At first I was puzzled that
even fairly memory intensive codes scaled well.

Even following a random pointer chain performance almost doubled when I
tested with 2 threads per memory channel instead of 1.

Then I realized the L3 latency is almost half of the latency to main
memory.  So you get significant throughput advantages by having a queue
of L3 cache misses waiting for the instant any of the memory channels
free up.

In fact even with 2 jobs per memory channel sometimes the memory channel
goes idle.  Even 4 jobs jobs per memory channel sees some increases.
The good news is that most codes aren't as memory bandwidth/latency
intensive as the related micro benchmarks (and therefore scale better).

I think the more cores per memory channel is a key part of AMDs improved
throughput per socket when compared to Intel.  Not always true of
course, again it's highly application dependent.

> Of course, we are now purchasing Phi's.  First 2 racks meant to turn
> up this week.

Interesting, please report back on anything of interest that you find.