[Beowulf] AMD performance (was 500GB systems)

Fri Jan 11 16:22:37 PST 2013

On 01/11/2013 04:01 AM, Joshua mora acosta wrote:
> Hi Bill, 
> AMD should pay you for these wise comments ;) 
> 
> But since this list is about providing feedback, and sharing knowledge, I
> would like to add something to your comments, and somewhat HW agnostic. When
> you are running stream benchmark it is an easy way to find out what the memory
> controllers are capable. 

Well it's my own code, last I checked stream didn't do dynamic
allocations or use pthreads.  Not to mention various tweaks for NUMA,
affinity, and related.

> Stream does minimal computation, at most the triad but it really exposes the
> bottleneck (in negative terms) or the throughput (in positive terms) of the
> processor and platform (when accounting multiple processors connected by some
> type of fabric: cHT, QPI, network) when looking at the aggregated memory
> bandwidth. 

Correct, stream a lousy benchmark to quantify application performance.
Just wanted to counter some comments I've heard about AMD's memory system.

> The main comment I would like to add is with respect to your stream bandwidth
> results. Looking at your log2 chart, it says that AMD delivers about ~100GB/s
> on 4P system and on Intel it delivers ~30GB/s on 2P systems. I may be reading
> wrong in the chart but it should be about 140GB/s with AMD
> (Interlagos/Abudhabi) with 1600MHz DDR3 memory and about 40GB/s with INTEL
> (Nehalem/Westmere) with memory at 1333MHz DDR3 and about 75GB/s with
> Sandybridge with memory at 1600MHz DDR3.

Well in my experience there's 3 major numbers for sequential memory
bandwidth:
1) the marketing numbers (clockspeed * width) which is approximately
   50GB per socket for Intel/AMD with 4 channels.
2) Stream returned numbers using good compilers (intel, portland
   group, or open64) that only work with static arrays.  Often 50-75%
   or so of the marketing numbers
3) Stream returned numbers using good compilers using dynamic
   allocation (malloc in c or new in c++) often 25-50% of the marketing
   numbers.  From what I can tell the use of dynamic allocation disables
   non-temporal stores.

Gcc usually matches dynamic allocation numbers (#3) with or without
dynamic allocation.

I wonder what percentage of bandwidth intensive codes dynamically
allocate memory.

> In order to do so, you want to use non temporal stores, which bypass the
> regular process of cache coherence. Many applications behave that way since
> you have to do a pass through the data and you may access it again (eg. in the

I believe Intel, Portland Group, and Intel automatically do this, even
when just doing the obvious:
  for (j=0; j<N; j++) // where N = large array
	c[j] = a[j]+b[j];

Sadly if a,b, or c were dynamically allocated that seems to disable the
non temporal stores.

For instance, open64, openmp, 1831MB array:
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:      101336.5507       0.0135       0.0126       0.0146
Scale:      98265.0155       0.0141       0.0130       0.0153
Add:       103543.0881       0.0202       0.0185       0.0225
Triad:     104677.6852       0.0194       0.0183       0.0213

If I switch to using malloc:
97,99c97
< static double	a[N+OFFSET],
< 		b[N+OFFSET],
< 		c[N+OFFSET];
---
> static double *a,*b,*c;
134a133,135
>     a = (double *) malloc ((N+OFFSET)*sizeof(double));
>     b = (double *) malloc ((N+OFFSET)*sizeof(double));
>     c = (double *) malloc ((N+OFFSET)*sizeof(double));

Copy:       74228.2843       0.0178       0.0172       0.0185
Scale:      74310.4782       0.0180       0.0172       0.0189
Add:        82776.3594       0.0240       0.0232       0.0249
Triad:      82598.0664       0.0239       0.0232       0.0250

> Finally, I have done a chart of performance/dollar for a wide range of
> processor variants, taking as performance both FLOPs and memory bandwidth and
> assuming equal cost of chassis and amount of memory, dividing the performance
> by the cost of the processor. 

I agree that the costs of chassis, ram, motherboard and related are very
similar.  But it's seems odd to evaluate price/performance without using
the system (not CPU) price.  The best price/perf CPU will be very often
be different than the CPU for the best price/perf node.

While interesting, when making design/purchase decisions I look at
price/performance per node.

> I am attaching it to this email. I took the cost of the processors from
> publicly available information on both AMD and INTEL processors. I know that
> price varies for each deal but as a fair as possible estimate, I get that
> Perf/$ is 2X on AMD than on INTEL, regardless of looking at FLOP/s or GB/s,
> and comparing similar processor models (ie. 8c INTEL vs 16c AMD). 

Did you intentionally ignore the current generation AMDs?  Personally
I'd find a CPU2006 per $ more interesting (Int or FP rate).

> You can make the chart by yourself if you know how to compute real FLOPs and
> real bandwidth. 

Normally I take wall clock time on an application justifying the
purchase of a cluster / cost of node.