[Beowulf] AMD performance (was 500GB systems)

Thu Jan 10 21:03:38 PST 2013

Over the last few months I've been hearing quite a few negative comments
about AMD.  Seems like most of them are extrapolating from desktop
performance.

Keep in mind that it's quite a stretch going from a desktop (single
socket, 2 memory channels) to a server (dual socket, 4x the cores, 8
memory channels).

Also keep in mind that compilers and kernels can make quite a
difference.  The vector units have changed significantly (a factor of 2)
and the scheduler needs tweaks to account for the various latencies and
NUMA related values.  Using old kernels/compilers may well significantly
impact AMD and/or Intel.

I've found the bandwidth and latency mostly controlled by the socket and
specifically the number of memory channels.  2, 3, and 4 channel per
socket systems have very similar bandwidth and latency for AMD and Intel
systems.

When taking a pragmatic approach to best price performance I find AMD
competitive.  Normally I figure out how much ram per CPU is needed, disk
needs, then figure out which Intel chip has the best system price/system
perf on the relevant applications.  Then do similar for AMD.  Then buy
whichever is better.  Often the result is a 15% improvement in one
direction or another (HIGHLY application dependent).

Of course sometimes a user asks for the "better" system for running a
wide variety of floating point codes.  In such cases I often use CPU2006
FP rate.

In a recent comparison I compared (both perf numbers from HP systems)
* AMD 6344,      64GB ram, SpecFPRateBase=333 $2,915, $8.75 per spec
* Intel E5-2620, 64GB ram, SpecFPRateBase=322 $2,990, $9.22 per spec

Whenever possible I try to use actual applications justifying the
purchase of a cluster.

When using actual end user applications it's about a 50/50 chance that
AMD or Intel will win.

I figured I'd add a few comments:
* Latency for a quad socket AMD is around 64ns to a random piece
  of memory (not 600ns as recently mentioned).
* AMD quad sockets with 512GB ram start around $9k ($USA)
* With OpenMP, pthreads, MPI or other parallel friendly code a quad
  socket amd can look up random cache line approximately every 2.25ns.
  (64 threads banging on 16 memory channels at once).
* I've seen no problems with the AMD memory system, in general
  the 2k pin/4 memory bus amd sockets seem to performance similarly
  to Intel.

And example of AMD's bandwidth scaling on a quad socket with 64 cores:
  http://cse.ucdavis.edu/bill/pstream/bm3-all.png

I don't have a similar Intel, but I do have a dual socket e5:
  http://cse.ucdavis.edu/bill/pstream/e5-2609.png