[Beowulf] AMD performance (was 500GB systems)
bill at cse.ucdavis.edu
Thu Jan 10 21:03:38 PST 2013
Over the last few months I've been hearing quite a few negative comments
about AMD. Seems like most of them are extrapolating from desktop
Keep in mind that it's quite a stretch going from a desktop (single
socket, 2 memory channels) to a server (dual socket, 4x the cores, 8
Also keep in mind that compilers and kernels can make quite a
difference. The vector units have changed significantly (a factor of 2)
and the scheduler needs tweaks to account for the various latencies and
NUMA related values. Using old kernels/compilers may well significantly
impact AMD and/or Intel.
I've found the bandwidth and latency mostly controlled by the socket and
specifically the number of memory channels. 2, 3, and 4 channel per
socket systems have very similar bandwidth and latency for AMD and Intel
When taking a pragmatic approach to best price performance I find AMD
competitive. Normally I figure out how much ram per CPU is needed, disk
needs, then figure out which Intel chip has the best system price/system
perf on the relevant applications. Then do similar for AMD. Then buy
whichever is better. Often the result is a 15% improvement in one
direction or another (HIGHLY application dependent).
Of course sometimes a user asks for the "better" system for running a
wide variety of floating point codes. In such cases I often use CPU2006
In a recent comparison I compared (both perf numbers from HP systems)
* AMD 6344, 64GB ram, SpecFPRateBase=333 $2,915, $8.75 per spec
* Intel E5-2620, 64GB ram, SpecFPRateBase=322 $2,990, $9.22 per spec
Whenever possible I try to use actual applications justifying the
purchase of a cluster.
When using actual end user applications it's about a 50/50 chance that
AMD or Intel will win.
I figured I'd add a few comments:
* Latency for a quad socket AMD is around 64ns to a random piece
of memory (not 600ns as recently mentioned).
* AMD quad sockets with 512GB ram start around $9k ($USA)
* With OpenMP, pthreads, MPI or other parallel friendly code a quad
socket amd can look up random cache line approximately every 2.25ns.
(64 threads banging on 16 memory channels at once).
* I've seen no problems with the AMD memory system, in general
the 2k pin/4 memory bus amd sockets seem to performance similarly
And example of AMD's bandwidth scaling on a quad socket with 64 cores:
I don't have a similar Intel, but I do have a dual socket e5:
More information about the Beowulf