[Beowulf] AMD and AVX512

Mon Jun 21 13:20:00 UTC 2021

I have followed this thinking "square peg, round hole."
You have got it again, Joe. Compilers are your problem.

On Sun, Jun 20, 2021, 10:21 AM Joe Landman <joe.landman at gmail.com> wrote:

> (Note:  not disagreeing at all with Gerald, actually agreeing strongly ...
> also, correct address this time!  Thanks Gerald!)
>
>
> On 6/19/21 11:49 AM, Gerald Henriksen wrote:
>
> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>
>
> The answer given, and I'm
> not making this up, is that AMD listens to their users and gives the
> users what they want, and right now they're not hearing any demand for
> AVX512.
>
> More accurately, there is call for it.  From a very small segment of the
> market.  Ones who buy small quantities of processors (under 100k volume per
> purchase).
>
> That is, not a significant enough portion of the market to make a huge
> difference to the supplier (Intel).
>
> And more to the point, AI and HPC joining forces has put the spotlight on
> small matrix multiplies, often with lower precision.  I'm not sure (haven't
> read much on it recently) if AVX512 will be enabling/has enabled support
> for bfloat16/FP16 or similar.  These tend to go to GPUs and other
> accelerators.
>
> Personally, I call BS on that one. I can't imagine anyone in the HPC
> community saying "we'd like processors that offer only 1/2 the floating
> point performance of Intel processors".
>
> I suspect that is marketing speak, which roughly translates to not
> that no one has asked for it, but rather requests haven't reached a
> threshold where the requests are viewed as significant enough.
>
> This, precisely.  AMD may be losing the AVX512 users to Intel.  But that's
> a small/miniscule fraction of the overall users of its products.  The
> demand for this is quite constrained.  Moreover, there are often
> significant performance consequences to using AVX512 (downclocking,
> pipeline stalls, etc.) whereby the cost of enabling it and using it, far
> outweighs the benefits of providing it, for the vast, overwhelming portion
> of the market.
>
> And, as noted above on the accelerator side, this use case (large vectors)
> are better handled by the accelerators.  There is a cost (engineering, code
> design, etc.) to using accelerators as well.  But it won't directly impact
> the CPUs.
>
> Sure, AMD can offer more cores,
> but with only AVX2, you'd need twice as many cores as Intel processors,
> all other things being equal.
>
> ... or you run the GPU versions of the code, which are likely getting more
> active developer attention.  AVX512 applies to only a miniscule number of
> codes/problems.  Its really not a panacea.
>
> More to the point, have you seen how "well" compilers use AVX2/SSE
> registers and do code gen?  Its not pretty in general.  Would you want the
> compilers to purposefully spit out AVX512 code the way the do AVX2/SSE code
> now?  I've found one has to work very hard with intrinsics to get good
> performance out of AVX2, never mind AVX512.
>
> Put another way, we've been hearing about "smart" compilers for a while,
> and in all honesty, most can barely implement a standard correctly, never
> mind generate reasonably (near) optimal code for the target system.  This
> has been a problem my entire professional life, and while I wish they were
> better, at the end of the day, this is where human intelligence fits into
> the HPC/AI narrative.
>
> But of course all other things aren't equal.
>
> AVX512 is a mess.
>
> Understated, and yes.
>
> Look at the Wikipedia page(*) and note that AVX512 means different
> things depending on the processor implementing it.
>
> I made comments previously about which ISA ARM folks were going to write
> to.  That is, different processors, likely implementing different
> instructions, differently ... you won't really have 1 equally good compiler
> for all these features.  You'll have a compiler that implements common
> denominators reasonably well.  Which mitigates the benefits of the
> ISA/architecture.
>
> Intel has the same problem with AVX512.  I know, I know ... feature flags
> on the CPU (see last line of lscpu output).  And how often have certain
> (ahem) compilers ignored the flags, and used a different mechanism to
> determine CPU feature support, specifically targeting their competitor
> offerings to force (literally) low performance paths for those CPUs?
>
>
> So what does the poor software developer target?
>
> Lowest common denominator.  Make the code work correctly first.  Then make
> it fast.  If fast is platform specific, ask how often with that platform be
> used.
>
>
> Or that it can for heat reasons cause CPU frequency reductions,
> meaning real world performance may not match theoritical - thus easier
> to just go with GPU's.
>
> The result is that most of the world is quite happily (at least for
> now) ignoring AVX512 and going with GPU's as necessary - particularly
> given the convenient libraries that Nvidia offers.
>
> Yeah ... like it or not, that battle is over (for now).
>
> [...]
>
>
> An argument can be made that for calculations that lend themselves to
> vectorization should be done on GPUs, instead of the main processors but
> the last time I checked, GPU jobs are still memory is limited, and
> moving data in and out of GPU memory can still take time, so I can see
> situations where for large amounts of data using CPUs would be preferred
> over GPUs.
>
> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
> which may or may not mean a difference.
>
> It does.  IO and memory bandwidth/latency are very important, and oft
> overlooked aspects of performance.  If you have a choice of doubling IO and
> memory bandwidth at lower latency (usable by everyone) vs adding an AVX512
> unit or two (usable by a small fraction of a percent of all users), which
> would net you, as an architect, the best "bang for the buck"?
>
>
> But what despite all of the above and the other replies, it is AMD who
> has been winning the HPC contracts of late, not Intel.
>
> There's a reason for that.  I will admit I have a devil of a time trying
> to convince people that higher clock frequency for computing matters only
> to a small fraction of operations, especially ones waiting on (slow) RAM
> and (slower) IO.  Make the RAM and IO faster (lower latency, higher
> bandwidth), and the system will be far more performant.
>
>
>
> --
>
> Joe Landman
> e: joe.landman at gmail.com
> t: @hpcjoe
> w: https://scalability.org
> g: https://github.com/joelandman
> l: https://www.linkedin.com/in/joelandman
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210621/2fb7164a/attachment.htm>