[Beowulf] AMD performance (was 500GB systems)

Vincent Diepeveen diep at xs4all.nl
Fri Jan 11 05:22:25 PST 2013


On Jan 11, 2013, at 6:03 AM, Bill Broadley wrote:

>
> Over the last few months I've been hearing quite a few negative  
> comments
> about AMD.  Seems like most of them are extrapolating from desktop
> performance.
>
> Keep in mind that it's quite a stretch going from a desktop (single
> socket, 2 memory channels) to a server (dual socket, 4x the cores, 8
> memory channels).
>

Bill - a 2 socket system doesn't deliver 512GB ram.

Your compare at 2 socket domain doesn't make sense for someone who  
needs 512GB ram,
the performance of 4 socket systems is total different from 2.

[snip]
>
> I figured I'd add a few comments:
> * Latency for a quad socket AMD is around 64ns to a random piece
>   of memory (not 600ns as recently mentioned).

I wrote a testprogram for this in 2003.

You have no idea what TLB trashing accesses are obviously at the  
hundreds of gigabyte area.

There is 0 cheap systems on the planet where you can get a bunch of  
random bytes in 64 ns
from a random spot out of 500GB of RAM, a memory line you previously  
hadn't opened yet and
which with sureness isn't in the cache yet. You will be looking at   
400+ ns latencies bestcase.

You won't get it faster at any platform which is affordable (of  
course 512GB of SRAM would be faster,
yet let's not go into theoretic discussions here - as you can't  
afford 512GB of SRAM).

> * AMD quad sockets with 512GB ram start around $9k ($USA)

You can easily build one with new components from ebay for $2k. Then  
add the 512GB ram price to that.
New from a shop the AMD stuff is dirt cheap as well, as a single core  
ain't fast of course of the new bulldozer line,
offers fully assembled and everything ready working is around $6k  
mark - excluding 512GB ram of course.

Yet it has better latency to a 512 GB block of RAM than intels 4  
socket systems.

And that will be many many hundreds of nanoseconds of course.

> * With OpenMP, pthreads, MPI or other parallel friendly code a quad
>   socket amd can look up random cache line approximately every 2.25ns.
>   (64 threads banging on 16 memory channels at once).

You still didn't get the picture of TLB trashing software huh?

It reads each time from a random memory location. Only at the end of  
the calculation the search space converges a tad,
but still it's random.

A measurement i have from a tad older 8 socket intel box here is 700  
ns for similar TLB trashing behaviour.

> * I've seen no problems with the AMD memory system, in general
>   the 2k pin/4 memory bus amd sockets seem to performance similarly
>   to Intel.

For random accesses at a single or 2 sockets there is huge  
differences (all cores busy).

Intel single socket around 90 ns for my benchmark and bulldozer  
single socket around 150-170 ns ( 8 cores busy).

You really have no idea what 'random' reads are.

>
> And example of AMD's bandwidth scaling on a quad socket with 64 cores:
>   http://cse.ucdavis.edu/bill/pstream/bm3-all.png
>
> I don't have a similar Intel, but I do have a dual socket e5:
>   http://cse.ucdavis.edu/bill/pstream/e5-2609.png
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list