[Beowulf] [External] Re: AMD and AVX512

Wed Jun 16 20:39:39 UTC 2021

Scott (and Michael and Carlos),

Thanks for your excellent feedback. That's the kind of enlightening 
feedback I was looking for. Interesting that the HBM on Fugaku exceeds 
the needs of the processor.

Prentice

On 6/16/21 2:23 PM, Scott Atchley wrote:

> On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf 
> <beowulf at beowulf.org <mailto:beowulf at beowulf.org>> wrote:
>
>     Did anyone else attend this webinar panel discussion with AMD
>     hosted by
>     HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your
>     Success in HPC"
>
>     https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/
>     <https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/>
>
>     I attended it, and noticed there was no mention of AMD supporting
>     AVX512, so during the question and answer portion of the program, I
>     asked when AMD processors will support AVX512. The answer given,
>     and I'm
>     not making this up, is that AMD listens to their users and gives the
>     users what they want, and right now they're not hearing any demand
>     for
>     AVX512.
>
>     Personally, I call BS on that one. I can't imagine anyone in the HPC
>     community saying "we'd like processors that offer only 1/2 the
>     floating
>     point performance of Intel processors". Sure, AMD can offer more
>     cores,
>     but with only AVX2, you'd need twice as many cores as Intel
>     processors,
>     all other things being equal.
>
>     Last fall I evaluated potential new cluster nodes for a large cluster
>     purchase using the HPL benchmark. I compared a server with dual
>     AMD EPYC
>     7H12 processors (128) cores to a server with quad Intel Xeon 8268
>     processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
>     only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
>     64% of the Xeon 8268 system, despite having 33% more cores.
>
>      From what I've heard, the AMD processors run much hotter than the
>     Intel
>     processors, too, so I imagine a FLOPS/Watt comparison would be
>     even less
>     favorable to AMD.
>
>     An argument can be made that for calculations that lend themselves to
>     vectorization should be done on GPUs, instead of the main
>     processors but
>     the last time I checked, GPU jobs are still memory is limited, and
>     moving data in and out of GPU memory can still take time, so I can
>     see
>     situations where for large amounts of data using CPUs would be
>     preferred
>     over GPUs.
>
>     Your thoughts?
>
>     -- 
>     Prentice
>
>
> AMD has studied this quite a bit in DOE's FastForward-2 and 
> PathForward. I think Carlos' comment is on track. Having a unit that 
> cannot be fed data quick enough is pointless. It is application 
> dependent. If your working set fits in cache, then the vector units 
> work well. If not, you have to move data which stalls compute 
> pipelines. NERSC saw only a 10% increase in performance when moving 
> from low core count Xeon CPUs with AVX2 to Knights Landing with many 
> cores and AVX-512 when it should have seen an order of magnitude 
> increase. Although Knights Landing had MCDRAM (Micron's not-quite 
> HBM), other constraints limited performance (e.g., lack of enough 
> memory references in flight, coherence traffic).
>
> Fujitsu's ARM64 chip with 512b SVE in Fugaku does much better than 
> Xeon with AVX-512 (or Knights Landing) because of the High Bandwidth 
> Memory (HBM) attached and I assume a larger number of memory 
> references in flight. The downside is the lack of memory capacity 
> (only 32 GB per node). This shows that it is possible to get more 
> performance with a CPU with a 512b vector engine. That said, it is not 
> clear that even this CPU design can extract the most from the memory 
> bandwidth. If you look at the increase in memory bandwidth from Summit 
> to Fugaku, one would expect performance on real apps to increase by 
> that amount as well. From the presentations that I have seen, that is 
> not always the case. For some apps, the GPU architecture, with its 
> coherence on demand rather than with every operation, can extract more 
> performance.
>
> AMD will add 512b vectors if/when it makes sense on real apps.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210616/d4d85ca9/attachment.htm>