From harshscience777 at gmail.com  Thu Jun  3 12:07:41 2021
From: harshscience777 at gmail.com (harsh_google lastname)
Date: Thu, 3 Jun 2021 17:37:41 +0530
Subject: [Beowulf] Theoretical peak performance of DGX A100
Message-ID: <CAFYDKrgGkVFcGzGiUy=ZLY7X1aM4NQO27rK2_pmS4CDubnPAHQ@mail.gmail.com>

 I am calculating the theoretical peak (FP64) performance of the Nvidia DGX
A100 system.

Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS.
Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64 ops /
cycle = 4.6 TFLOPS.
This gives a total of 82.2 TFLOPS per DGX-A100.

Here is my problem. For any system with DGX A100 on top500.org, numbers
just don't add up. For eg: Selene has 560 DGX boxes, but its theoretical
peak is listed as 79.2 PFLOPS, whereas I expect it should be 46 PFLOPS (ie
82.2 TFLOPS x560). The same is true for any other DGX based system listed
on top500. What am I missing here?

Thanks!

Harsh Hemani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210603/c5eedad7/attachment.htm>

From carlos.bederian at unc.edu.ar  Thu Jun  3 12:20:17 2021
From: carlos.bederian at unc.edu.ar (=?UTF-8?Q?Carlos_Bederi=C3=A1n?=)
Date: Thu, 3 Jun 2021 09:20:17 -0300
Subject: [Beowulf] Theoretical peak performance of DGX A100
In-Reply-To: <CAFYDKrgGkVFcGzGiUy=ZLY7X1aM4NQO27rK2_pmS4CDubnPAHQ@mail.gmail.com>
References: <CAFYDKrgGkVFcGzGiUy=ZLY7X1aM4NQO27rK2_pmS4CDubnPAHQ@mail.gmail.com>
Message-ID: <CAFRNPiw3X626uKsj3uNp8rXH3DF0M3SaQ7kqgE_W9UoZdfCTMQ@mail.gmail.com>

A100 does 19.5 FP64 TFLOPS using tensor cores.

On Thu, Jun 3, 2021 at 9:08 AM harsh_google lastname <
harshscience777 at gmail.com> wrote:

> I am calculating the theoretical peak (FP64) performance of the Nvidia DGX
> A100 system.
>
> Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS.
> Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64 ops
> / cycle = 4.6 TFLOPS.
> This gives a total of 82.2 TFLOPS per DGX-A100.
>
> Here is my problem. For any system with DGX A100 on top500.org, numbers
> just don't add up. For eg: Selene has 560 DGX boxes, but its theoretical
> peak is listed as 79.2 PFLOPS, whereas I expect it should be 46 PFLOPS (ie
> 82.2 TFLOPS x560). The same is true for any other DGX based system listed
> on top500. What am I missing here?
>
> Thanks!
>
> Harsh Hemani
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210603/6f0e4743/attachment.htm>

From harshscience777 at gmail.com  Thu Jun  3 12:22:30 2021
From: harshscience777 at gmail.com (harsh_google lastname)
Date: Thu, 3 Jun 2021 17:52:30 +0530
Subject: [Beowulf] Theoretical peak performance of DGX A100
In-Reply-To: <CAFRNPiw3X626uKsj3uNp8rXH3DF0M3SaQ7kqgE_W9UoZdfCTMQ@mail.gmail.com>
References: <CAFYDKrgGkVFcGzGiUy=ZLY7X1aM4NQO27rK2_pmS4CDubnPAHQ@mail.gmail.com>
 <CAFRNPiw3X626uKsj3uNp8rXH3DF0M3SaQ7kqgE_W9UoZdfCTMQ@mail.gmail.com>
Message-ID: <CAFYDKrhHdSz_jo0uRgcAW06fHCjR7TMRuoodDR+FuXmDzAKnbw@mail.gmail.com>

But that wouls bring the theoretical performance to 160 TFLOPS per box,
which also doesn't match!

On Thu, Jun 3, 2021, 5:50 PM Carlos Bederi?n <carlos.bederian at unc.edu.ar>
wrote:

> A100 does 19.5 FP64 TFLOPS using tensor cores.
>
> On Thu, Jun 3, 2021 at 9:08 AM harsh_google lastname <
> harshscience777 at gmail.com> wrote:
>
>> I am calculating the theoretical peak (FP64) performance of the Nvidia
>> DGX A100 system.
>>
>> Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS.
>> Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64 ops
>> / cycle = 4.6 TFLOPS.
>> This gives a total of 82.2 TFLOPS per DGX-A100.
>>
>> Here is my problem. For any system with DGX A100 on top500.org, numbers
>> just don't add up. For eg: Selene has 560 DGX boxes, but its theoretical
>> peak is listed as 79.2 PFLOPS, whereas I expect it should be 46 PFLOPS (ie
>> 82.2 TFLOPS x560). The same is true for any other DGX based system listed
>> on top500. What am I missing here?
>>
>> Thanks!
>>
>> Harsh Hemani
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210603/bbf994e1/attachment.htm>

From carlos.bederian at unc.edu.ar  Thu Jun  3 12:55:36 2021
From: carlos.bederian at unc.edu.ar (=?UTF-8?Q?Carlos_Bederi=C3=A1n?=)
Date: Thu, 3 Jun 2021 09:55:36 -0300
Subject: [Beowulf] Theoretical peak performance of DGX A100
In-Reply-To: <CAFYDKrhHdSz_jo0uRgcAW06fHCjR7TMRuoodDR+FuXmDzAKnbw@mail.gmail.com>
References: <CAFYDKrgGkVFcGzGiUy=ZLY7X1aM4NQO27rK2_pmS4CDubnPAHQ@mail.gmail.com>
 <CAFRNPiw3X626uKsj3uNp8rXH3DF0M3SaQ7kqgE_W9UoZdfCTMQ@mail.gmail.com>
 <CAFYDKrhHdSz_jo0uRgcAW06fHCjR7TMRuoodDR+FuXmDzAKnbw@mail.gmail.com>
Message-ID: <CAFRNPiyYanhyNGrkGFPbfWnFzepEEJSxgW6-z=OOVZMo=kWZeg@mail.gmail.com>

The Top500 has been listing wrong Rpeak values for most clusters for many
years now, so I wouldn't dwell on it...

Take a Skylake-based cluster like Frontera. Its listed Rpeak is 38,745.9
TFLOPS = 8008 nodes * 56 cores * 32 ops/cycle * 2.7GHz.
But 2.7GHz is the regular base frequency, and to do 32 ops/cycle you need
to use AVX-512. All-core AVX-512 frequencies for a Xeon 8280 are 1.8GHz
base and 2.4GHz turbo, so the Rpeak is off by 12-33%.

On Thu, Jun 3, 2021 at 9:22 AM harsh_google lastname <
harshscience777 at gmail.com> wrote:

> But that wouls bring the theoretical performance to 160 TFLOPS per box,
> which also doesn't match!
>
> On Thu, Jun 3, 2021, 5:50 PM Carlos Bederi?n <carlos.bederian at unc.edu.ar>
> wrote:
>
>> A100 does 19.5 FP64 TFLOPS using tensor cores.
>>
>> On Thu, Jun 3, 2021 at 9:08 AM harsh_google lastname <
>> harshscience777 at gmail.com> wrote:
>>
>>> I am calculating the theoretical peak (FP64) performance of the Nvidia
>>> DGX A100 system.
>>>
>>> Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS.
>>> Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64
>>> ops / cycle = 4.6 TFLOPS.
>>> This gives a total of 82.2 TFLOPS per DGX-A100.
>>>
>>> Here is my problem. For any system with DGX A100 on top500.org, numbers
>>> just don't add up. For eg: Selene has 560 DGX boxes, but its theoretical
>>> peak is listed as 79.2 PFLOPS, whereas I expect it should be 46 PFLOPS (ie
>>> 82.2 TFLOPS x560). The same is true for any other DGX based system listed
>>> on top500. What am I missing here?
>>>
>>> Thanks!
>>>
>>> Harsh Hemani
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210603/6384febd/attachment-0001.htm>

From harshscience777 at gmail.com  Thu Jun  3 13:29:50 2021
From: harshscience777 at gmail.com (harsh_google lastname)
Date: Thu, 3 Jun 2021 18:59:50 +0530
Subject: [Beowulf] Theoretical peak performance of DGX A100
In-Reply-To: <CAFRNPiyYanhyNGrkGFPbfWnFzepEEJSxgW6-z=OOVZMo=kWZeg@mail.gmail.com>
References: <CAFYDKrgGkVFcGzGiUy=ZLY7X1aM4NQO27rK2_pmS4CDubnPAHQ@mail.gmail.com>
 <CAFRNPiw3X626uKsj3uNp8rXH3DF0M3SaQ7kqgE_W9UoZdfCTMQ@mail.gmail.com>
 <CAFYDKrhHdSz_jo0uRgcAW06fHCjR7TMRuoodDR+FuXmDzAKnbw@mail.gmail.com>
 <CAFRNPiyYanhyNGrkGFPbfWnFzepEEJSxgW6-z=OOVZMo=kWZeg@mail.gmail.com>
Message-ID: <CAFYDKrio0DChFakLVcwya7eHzkxM46QAXX29MQwJUbQKxgoJTw@mail.gmail.com>

Cool, thanks!

On Thu, Jun 3, 2021, 6:25 PM Carlos Bederi?n <carlos.bederian at unc.edu.ar>
wrote:

> The Top500 has been listing wrong Rpeak values for most clusters for many
> years now, so I wouldn't dwell on it...
>
> Take a Skylake-based cluster like Frontera. Its listed Rpeak is 38,745.9
> TFLOPS = 8008 nodes * 56 cores * 32 ops/cycle * 2.7GHz.
> But 2.7GHz is the regular base frequency, and to do 32 ops/cycle you need
> to use AVX-512. All-core AVX-512 frequencies for a Xeon 8280 are 1.8GHz
> base and 2.4GHz turbo, so the Rpeak is off by 12-33%.
>
> On Thu, Jun 3, 2021 at 9:22 AM harsh_google lastname <
> harshscience777 at gmail.com> wrote:
>
>> But that wouls bring the theoretical performance to 160 TFLOPS per box,
>> which also doesn't match!
>>
>> On Thu, Jun 3, 2021, 5:50 PM Carlos Bederi?n <carlos.bederian at unc.edu.ar>
>> wrote:
>>
>>> A100 does 19.5 FP64 TFLOPS using tensor cores.
>>>
>>> On Thu, Jun 3, 2021 at 9:08 AM harsh_google lastname <
>>> harshscience777 at gmail.com> wrote:
>>>
>>>> I am calculating the theoretical peak (FP64) performance of the Nvidia
>>>> DGX A100 system.
>>>>
>>>> Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS.
>>>> Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64
>>>> ops / cycle = 4.6 TFLOPS.
>>>> This gives a total of 82.2 TFLOPS per DGX-A100.
>>>>
>>>> Here is my problem. For any system with DGX A100 on top500.org,
>>>> numbers just don't add up. For eg: Selene has 560 DGX boxes, but its
>>>> theoretical peak is listed as 79.2 PFLOPS, whereas I expect it should be 46
>>>> PFLOPS (ie 82.2 TFLOPS x560). The same is true for any other DGX based
>>>> system listed on top500. What am I missing here?
>>>>
>>>> Thanks!
>>>>
>>>> Harsh Hemani
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210603/a263491f/attachment.htm>

From mdidomenico4 at gmail.com  Mon Jun 14 16:38:50 2021
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 14 Jun 2021 12:38:50 -0400
Subject: [Beowulf] odd vlan issue
Message-ID: <CABOsP2PRKpPV89XsBHK0kp91xNpJvQ40E-+gJ0AySL20dU=e-Q@mail.gmail.com>

i got roped into troubleshooting an odd network issue.  we have a mix
of cisco (mostly nexus) gear spread over our facility.  on one
particular vlan it's operating as if it's a hub instead of switch.

has anyone seen this before?  i bounced around the net, my issue
appears to be "unicast flooding", but the chief correction seems to be
adjusting the ARP/CAM timeouts.  but that hasn't working.

i'm sure this is a loaded question given none of you can see the
network, but i'm grasping at straws to come up with a reason why this
is happening.

what's weird is we have two dozen other vlans, which aren't affected.
i cannot locate a difference in the config on the switches.

any thoughts are appreciated at this point

From lindahl at pbm.com  Wed Jun 16 04:48:21 2021
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 16 Jun 2021 04:48:21 +0000
Subject: [Beowulf] odd vlan issue
In-Reply-To: <CABOsP2PRKpPV89XsBHK0kp91xNpJvQ40E-+gJ0AySL20dU=e-Q@mail.gmail.com>
References: <CABOsP2PRKpPV89XsBHK0kp91xNpJvQ40E-+gJ0AySL20dU=e-Q@mail.gmail.com>
Message-ID: <20210616044821.GA10207@rd.bx9.net>

On Mon, Jun 14, 2021 at 12:38:50PM -0400, Michael Di Domenico wrote:
> i got roped into troubleshooting an odd network issue.  we have a mix
> of cisco (mostly nexus) gear spread over our facility.  on one
> particular vlan it's operating as if it's a hub instead of switch.

I have run into this situation when I have servers that have incoming
UDP traffic and never talk or do TCP. The switches have no idea where the
server is, so they broadcast all of the incoming packets.

An ARP reply or connecting with TCP tells the switch which port to use.

-- greg


From mdidomenico4 at gmail.com  Wed Jun 16 11:38:27 2021
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Wed, 16 Jun 2021 07:38:27 -0400
Subject: [Beowulf] odd vlan issue
In-Reply-To: <20210616044821.GA10207@rd.bx9.net>
References: <CABOsP2PRKpPV89XsBHK0kp91xNpJvQ40E-+gJ0AySL20dU=e-Q@mail.gmail.com>
 <20210616044821.GA10207@rd.bx9.net>
Message-ID: <CABOsP2NTTaj92y7GwUWyVz3x2WAA3WxC4ztEKtCYdmQYWHtUig@mail.gmail.com>

thanks after two days of digging, i think i finally figured out that
we have a layer 2 routing problem.  i'm not the network guy so i'm not
digging into it deeper, but it appears that there are either
malfunctioning LACP trunks or more likely a misconfigured VPC
connection inside the menagerie of switches the network team built.

the network is too complicated to describe here, but the base issue is
that there are two switches 'supposedly' operating jointly, but don't
seem to be sharing their CAM/ARP tables correctly.  for whatever
reason packets get duped to the switch that does not have the
destination machine and since there's no arp/cam entry the switch just
blasts the packet out all the ports.

its not clear why the packets are being sent to layer 2 devices where
the device doesn't exist, but it's clear there's something broken in
the spanning tree database.  it's also not clear why it only affects
one of the vlans and not all.

but again, not the network guy...  and for once it is the network... :)


On Wed, Jun 16, 2021 at 12:48 AM Greg Lindahl <lindahl at pbm.com> wrote:
>
> On Mon, Jun 14, 2021 at 12:38:50PM -0400, Michael Di Domenico wrote:
> > i got roped into troubleshooting an odd network issue.  we have a mix
> > of cisco (mostly nexus) gear spread over our facility.  on one
> > particular vlan it's operating as if it's a hub instead of switch.
>
> I have run into this situation when I have servers that have incoming
> UDP traffic and never talk or do TCP. The switches have no idea where the
> server is, so they broadcast all of the incoming packets.
>
> An ARP reply or connecting with TCP tells the switch which port to use.
>
> -- greg
>
>

From rgt at wi.mit.edu  Wed Jun 16 13:25:18 2021
From: rgt at wi.mit.edu (Robert Taylor)
Date: Wed, 16 Jun 2021 09:25:18 -0400
Subject: [Beowulf] odd vlan issue
In-Reply-To: <CABOsP2NTTaj92y7GwUWyVz3x2WAA3WxC4ztEKtCYdmQYWHtUig@mail.gmail.com>
References: <CABOsP2PRKpPV89XsBHK0kp91xNpJvQ40E-+gJ0AySL20dU=e-Q@mail.gmail.com>
 <20210616044821.GA10207@rd.bx9.net>
 <CABOsP2NTTaj92y7GwUWyVz3x2WAA3WxC4ztEKtCYdmQYWHtUig@mail.gmail.com>
Message-ID: <CADbJ0W62vnYiahXe0HqSMwzNV+bFinF5QUibcY1utvoW=4YkAQ@mail.gmail.com>

 I?ve seen it happen with udp, in my case it was a syslog server, that
hardly ever ?spoke? so eventually it?s MAC address disappears from the Cam
table long before it leaves the arp table and the UDP packets get flooded
hoping to find the server.

I?ve also seen this happen in an HA environment, Where it?s possible that
traffic can take an asymmetric path. It happens enough that Cisco wrote an
article about it years ago.

https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6000-series-switches/23563-143.html

Hope this helps.

rgt


On Wed, Jun 16, 2021 at 7:38 AM Michael Di Domenico <mdidomenico4 at gmail.com>
wrote:

> thanks after two days of digging, i think i finally figured out that
> we have a layer 2 routing problem.  i'm not the network guy so i'm not
> digging into it deeper, but it appears that there are either
> malfunctioning LACP trunks or more likely a misconfigured VPC
> connection inside the menagerie of switches the network team built.
>
> the network is too complicated to describe here, but the base issue is
> that there are two switches 'supposedly' operating jointly, but don't
> seem to be sharing their CAM/ARP tables correctly.  for whatever
> reason packets get duped to the switch that does not have the
> destination machine and since there's no arp/cam entry the switch just
> blasts the packet out all the ports.
>
> its not clear why the packets are being sent to layer 2 devices where
> the device doesn't exist, but it's clear there's something broken in
> the spanning tree database.  it's also not clear why it only affects
> one of the vlans and not all.
>
> but again, not the network guy...  and for once it is the network... :)
>
>
>
>
>
>
>
>
>
> On Wed, Jun 16, 2021 at 12:48 AM Greg Lindahl <lindahl at pbm.com> wrote:
> >
> > On Mon, Jun 14, 2021 at 12:38:50PM -0400, Michael Di Domenico wrote:
> > > i got roped into troubleshooting an odd network issue.  we have a mix
> > > of cisco (mostly nexus) gear spread over our facility.  on one
> > > particular vlan it's operating as if it's a hub instead of switch.
> >
> > I have run into this situation when I have servers that have incoming
> > UDP traffic and never talk or do TCP. The switches have no idea where the
> > server is, so they broadcast all of the incoming packets.
> >
> > An ARP reply or connecting with TCP tells the switch which port to use.
> >
> > -- greg
> >
> >
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210616/a9de8fb7/attachment.htm>

From mdidomenico4 at gmail.com  Wed Jun 16 13:41:24 2021
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Wed, 16 Jun 2021 09:41:24 -0400
Subject: [Beowulf] odd vlan issue
In-Reply-To: <CADbJ0W62vnYiahXe0HqSMwzNV+bFinF5QUibcY1utvoW=4YkAQ@mail.gmail.com>
References: <CABOsP2PRKpPV89XsBHK0kp91xNpJvQ40E-+gJ0AySL20dU=e-Q@mail.gmail.com>
 <20210616044821.GA10207@rd.bx9.net>
 <CABOsP2NTTaj92y7GwUWyVz3x2WAA3WxC4ztEKtCYdmQYWHtUig@mail.gmail.com>
 <CADbJ0W62vnYiahXe0HqSMwzNV+bFinF5QUibcY1utvoW=4YkAQ@mail.gmail.com>
Message-ID: <CABOsP2NTW=L30b_3GJKhevPxX_0e_SZgAgLdqnRn_XtV9Ei6XQ@mail.gmail.com>

On Wed, Jun 16, 2021 at 9:25 AM Robert Taylor <rgt at wi.mit.edu> wrote:
>
>  I?ve seen it happen with udp, in my case it was a syslog server, that hardly ever ?spoke? so eventually it?s MAC address disappears from the Cam table long before it leaves the arp table and the UDP packets get flooded hoping to find the server.
>
> I?ve also seen this happen in an HA environment, Where it?s possible that traffic can take an asymmetric path. It happens enough that Cisco wrote an article about it years ago.
>
> https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6000-series-switches/23563-143.html

the udp scenario is definitely not what's happening.  but the unicast
flooding appears to be.  the article linked is what lead me to find
the initial problem and prove the theory yesterday.  the messed up
asymmetric nature of our network is where the problem lies.  <rant>but
that's what happens when you give network guys an unlimited budget and
they buy more/larger network gear then they can handle.</rant>

From pbisbal at pppl.gov  Wed Jun 16 17:15:40 2021
From: pbisbal at pppl.gov (Prentice Bisbal)
Date: Wed, 16 Jun 2021 13:15:40 -0400
Subject: [Beowulf] AMD and AVX512
Message-ID: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>

Did anyone else attend this webinar panel discussion with AMD hosted by 
HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your 
Success in HPC"

https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/

I attended it, and noticed there was no mention of AMD supporting 
AVX512, so during the question and answer portion of the program, I 
asked when AMD processors will support AVX512. The answer given, and I'm 
not making this up, is that AMD listens to their users and gives the 
users what they want, and right now they're not hearing any demand for 
AVX512.

Personally, I call BS on that one. I can't imagine anyone in the HPC 
community saying "we'd like processors that offer only 1/2 the floating 
point performance of Intel processors". Sure, AMD can offer more cores, 
but with only AVX2, you'd need twice as many cores as Intel processors, 
all other things being equal.

Last fall I evaluated potential new cluster nodes for a large cluster 
purchase using the HPL benchmark. I compared a server with dual AMD EPYC 
7H12 processors (128) cores to a server with quad Intel Xeon 8268 
processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and 
only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only 
64% of the Xeon 8268 system, despite having 33% more cores.

 From what I've heard, the AMD processors run much hotter than the Intel 
processors, too, so I imagine a FLOPS/Watt comparison would be even less 
favorable to AMD.

An argument can be made that for calculations that lend themselves to 
vectorization should be done on GPUs, instead of the main processors but 
the last time I checked, GPU jobs are still memory is limited, and 
moving data in and out of GPU memory can still take time, so I can see 
situations where for large amounts of data using CPUs would be preferred 
over GPUs.

Your thoughts?

-- 
Prentice


From carlos.bederian at unc.edu.ar  Wed Jun 16 17:52:59 2021
From: carlos.bederian at unc.edu.ar (=?UTF-8?Q?Carlos_Bederi=C3=A1n?=)
Date: Wed, 16 Jun 2021 14:52:59 -0300
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
Message-ID: <CAFRNPiypECVcpPFgBK4570z8CW77D39NLSNsx-ekUm5TQGJT9A@mail.gmail.com>

On Wed, Jun 16, 2021 at 2:16 PM Prentice Bisbal via Beowulf <
beowulf at beowulf.org> wrote:

> Last fall I evaluated potential new cluster nodes for a large cluster
> purchase using the HPL benchmark. I compared a server with dual AMD EPYC
> 7H12 processors (128) cores to a server with quad Intel Xeon 8268
> processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
> only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
> 64% of the Xeon 8268 system, despite having 33% more cores.
>

Most of the workloads we see on our clusters have arithmetic intensities
much lower than LINPACK's, so all that extra compute gets starved by lack
of memory bandwidth.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210616/a26447ff/attachment.htm>

From mdidomenico4 at gmail.com  Wed Jun 16 17:53:44 2021
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Wed, 16 Jun 2021 13:53:44 -0400
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
Message-ID: <CABOsP2OJp=E7OaphdK6LX5D_Q1FKgoouL9iQS7DTn5jeRo4d6w@mail.gmail.com>

AMD's argument is a little unsalesmen like, but i'd buy it as an
explanation.  avx512 uptake isn't a profound as intel would lead you
to believe and the push to GPU's for vectors will probably remove the
need for most of these high end vectors sooner or later (but that's my
opinion, some chip changes need to happen first)

i also think you're hpl numbers on the amd chip are low, you should be
>4000 which would put you closer to intel, but intel will still edge
out just because it has a higher base clock.

On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf
<beowulf at beowulf.org> wrote:
>
> Did anyone else attend this webinar panel discussion with AMD hosted by
> HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your
> Success in HPC"
>
> https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/
>
> I attended it, and noticed there was no mention of AMD supporting
> AVX512, so during the question and answer portion of the program, I
> asked when AMD processors will support AVX512. The answer given, and I'm
> not making this up, is that AMD listens to their users and gives the
> users what they want, and right now they're not hearing any demand for
> AVX512.
>
> Personally, I call BS on that one. I can't imagine anyone in the HPC
> community saying "we'd like processors that offer only 1/2 the floating
> point performance of Intel processors". Sure, AMD can offer more cores,
> but with only AVX2, you'd need twice as many cores as Intel processors,
> all other things being equal.
>
> Last fall I evaluated potential new cluster nodes for a large cluster
> purchase using the HPL benchmark. I compared a server with dual AMD EPYC
> 7H12 processors (128) cores to a server with quad Intel Xeon 8268
> processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
> only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
> 64% of the Xeon 8268 system, despite having 33% more cores.
>
>  From what I've heard, the AMD processors run much hotter than the Intel
> processors, too, so I imagine a FLOPS/Watt comparison would be even less
> favorable to AMD.
>
> An argument can be made that for calculations that lend themselves to
> vectorization should be done on GPUs, instead of the main processors but
> the last time I checked, GPU jobs are still memory is limited, and
> moving data in and out of GPU memory can still take time, so I can see
> situations where for large amounts of data using CPUs would be preferred
> over GPUs.
>
> Your thoughts?
>
> --
> Prentice
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

From e.scott.atchley at gmail.com  Wed Jun 16 18:23:46 2021
From: e.scott.atchley at gmail.com (Scott Atchley)
Date: Wed, 16 Jun 2021 14:23:46 -0400
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
Message-ID: <CAL8g0jL8R+HEB9MZnM-0Bh6UzyhUQs7VfJvQbHbnpO6yuXj7tg@mail.gmail.com>

On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf <
beowulf at beowulf.org> wrote:

> Did anyone else attend this webinar panel discussion with AMD hosted by
> HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your
> Success in HPC"
>
> https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/
>
> I attended it, and noticed there was no mention of AMD supporting
> AVX512, so during the question and answer portion of the program, I
> asked when AMD processors will support AVX512. The answer given, and I'm
> not making this up, is that AMD listens to their users and gives the
> users what they want, and right now they're not hearing any demand for
> AVX512.
>
> Personally, I call BS on that one. I can't imagine anyone in the HPC
> community saying "we'd like processors that offer only 1/2 the floating
> point performance of Intel processors". Sure, AMD can offer more cores,
> but with only AVX2, you'd need twice as many cores as Intel processors,
> all other things being equal.
>
> Last fall I evaluated potential new cluster nodes for a large cluster
> purchase using the HPL benchmark. I compared a server with dual AMD EPYC
> 7H12 processors (128) cores to a server with quad Intel Xeon 8268
> processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
> only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
> 64% of the Xeon 8268 system, despite having 33% more cores.
>
>  From what I've heard, the AMD processors run much hotter than the Intel
> processors, too, so I imagine a FLOPS/Watt comparison would be even less
> favorable to AMD.
>
> An argument can be made that for calculations that lend themselves to
> vectorization should be done on GPUs, instead of the main processors but
> the last time I checked, GPU jobs are still memory is limited, and
> moving data in and out of GPU memory can still take time, so I can see
> situations where for large amounts of data using CPUs would be preferred
> over GPUs.
>
> Your thoughts?
>
> --
> Prentice
>

AMD has studied this quite a bit in DOE's FastForward-2 and PathForward. I
think Carlos' comment is on track. Having a unit that cannot be fed data
quick enough is pointless. It is application dependent. If your working set
fits in cache, then the vector units work well. If not, you have to move
data which stalls compute pipelines. NERSC saw only a 10% increase in
performance when moving from low core count Xeon CPUs with AVX2 to Knights
Landing with many cores and AVX-512 when it should have seen an order of
magnitude increase. Although Knights Landing had MCDRAM (Micron's not-quite
HBM), other constraints limited performance (e.g., lack of enough memory
references in flight, coherence traffic).

Fujitsu's ARM64 chip with 512b SVE in Fugaku does much better than Xeon
with AVX-512 (or Knights Landing) because of the High Bandwidth Memory
(HBM) attached and I assume a larger number of memory references in flight.
The downside is the lack of memory capacity (only 32 GB per node). This
shows that it is possible to get more performance with a CPU with a 512b
vector engine. That said, it is not clear that even this CPU design can
extract the most from the memory bandwidth. If you look at the increase in
memory bandwidth from Summit to Fugaku, one would expect performance on
real apps to increase by that amount as well. From the presentations that I
have seen, that is not always the case. For some apps, the GPU
architecture, with its coherence on demand rather than with every
operation, can extract more performance.

AMD will add 512b vectors if/when it makes sense on real apps.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210616/f4240b87/attachment.htm>

From pbisbal at pppl.gov  Wed Jun 16 20:39:39 2021
From: pbisbal at pppl.gov (Prentice Bisbal)
Date: Wed, 16 Jun 2021 16:39:39 -0400
Subject: [Beowulf] [External] Re:  AMD and AVX512
In-Reply-To: <CAL8g0jL8R+HEB9MZnM-0Bh6UzyhUQs7VfJvQbHbnpO6yuXj7tg@mail.gmail.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <CAL8g0jL8R+HEB9MZnM-0Bh6UzyhUQs7VfJvQbHbnpO6yuXj7tg@mail.gmail.com>
Message-ID: <24cf21a9-57de-b720-18b1-82da7c75e53b@pppl.gov>

Scott (and Michael and Carlos),

Thanks for your excellent feedback. That's the kind of enlightening 
feedback I was looking for. Interesting that the HBM on Fugaku exceeds 
the needs of the processor.

Prentice

On 6/16/21 2:23 PM, Scott Atchley wrote:

> On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf 
> <beowulf at beowulf.org <mailto:beowulf at beowulf.org>> wrote:
>
>     Did anyone else attend this webinar panel discussion with AMD
>     hosted by
>     HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your
>     Success in HPC"
>
>     https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/
>     <https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/>
>
>     I attended it, and noticed there was no mention of AMD supporting
>     AVX512, so during the question and answer portion of the program, I
>     asked when AMD processors will support AVX512. The answer given,
>     and I'm
>     not making this up, is that AMD listens to their users and gives the
>     users what they want, and right now they're not hearing any demand
>     for
>     AVX512.
>
>     Personally, I call BS on that one. I can't imagine anyone in the HPC
>     community saying "we'd like processors that offer only 1/2 the
>     floating
>     point performance of Intel processors". Sure, AMD can offer more
>     cores,
>     but with only AVX2, you'd need twice as many cores as Intel
>     processors,
>     all other things being equal.
>
>     Last fall I evaluated potential new cluster nodes for a large cluster
>     purchase using the HPL benchmark. I compared a server with dual
>     AMD EPYC
>     7H12 processors (128) cores to a server with quad Intel Xeon 8268
>     processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
>     only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
>     64% of the Xeon 8268 system, despite having 33% more cores.
>
>     ?From what I've heard, the AMD processors run much hotter than the
>     Intel
>     processors, too, so I imagine a FLOPS/Watt comparison would be
>     even less
>     favorable to AMD.
>
>     An argument can be made that for calculations that lend themselves to
>     vectorization should be done on GPUs, instead of the main
>     processors but
>     the last time I checked, GPU jobs are still memory is limited, and
>     moving data in and out of GPU memory can still take time, so I can
>     see
>     situations where for large amounts of data using CPUs would be
>     preferred
>     over GPUs.
>
>     Your thoughts?
>
>     -- 
>     Prentice
>
>
> AMD has studied this quite a bit in DOE's FastForward-2 and 
> PathForward. I think Carlos' comment is on track. Having a unit that 
> cannot be fed data quick enough is pointless. It is application 
> dependent. If your working set fits in cache, then the vector units 
> work well. If not, you have to move data which stalls compute 
> pipelines. NERSC saw only a 10% increase in performance when moving 
> from low core count Xeon CPUs with AVX2 to Knights Landing with many 
> cores and AVX-512 when it should have seen an order of magnitude 
> increase. Although Knights Landing had MCDRAM (Micron's not-quite 
> HBM), other constraints limited performance (e.g., lack of enough 
> memory references in flight, coherence traffic).
>
> Fujitsu's ARM64 chip with 512b SVE in Fugaku does much better than 
> Xeon with AVX-512 (or Knights Landing) because of the High Bandwidth 
> Memory (HBM) attached and I assume a larger number of memory 
> references in flight. The downside is the lack of memory capacity 
> (only 32 GB per node). This shows that it is possible to get more 
> performance with a CPU with a 512b vector engine. That said, it is not 
> clear that even this CPU design can extract the most from the memory 
> bandwidth. If you look at the increase in memory bandwidth from Summit 
> to Fugaku, one would expect performance on real apps to increase by 
> that amount as well. From the presentations that I have seen, that is 
> not always the case. For some apps, the GPU architecture, with its 
> coherence on demand rather than with every operation, can extract more 
> performance.
>
> AMD will add 512b vectors if/when it makes sense on real apps.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210616/d4d85ca9/attachment.htm>

From pbisbal at pppl.gov  Wed Jun 16 20:46:49 2021
From: pbisbal at pppl.gov (Prentice Bisbal)
Date: Wed, 16 Jun 2021 16:46:49 -0400
Subject: [Beowulf] [External] Re:  AMD and AVX512
In-Reply-To: <CABOsP2OJp=E7OaphdK6LX5D_Q1FKgoouL9iQS7DTn5jeRo4d6w@mail.gmail.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <CABOsP2OJp=E7OaphdK6LX5D_Q1FKgoouL9iQS7DTn5jeRo4d6w@mail.gmail.com>
Message-ID: <6cf3edb7-fe83-5c34-0101-64a3eec70a62@pppl.gov>

i also think you're hpl numbers on the amd chip are low, you should be

   > 4000 which would put you closer to intel, but intel will still edge

out just because it has a higher base clock.

I think I could probably get better numbers out of the AMD chip now, 
too. I've done some testing since then compiler and library choice can 
make a noticeable difference for the AMD processors. Unfortunately, I no 
longer have access to that 7H12 system to test again.


Prentice

On 6/16/21 1:53 PM, Michael Di Domenico wrote:
> AMD's argument is a little unsalesmen like, but i'd buy it as an
> explanation.  avx512 uptake isn't a profound as intel would lead you
> to believe and the push to GPU's for vectors will probably remove the
> need for most of these high end vectors sooner or later (but that's my
> opinion, some chip changes need to happen first)
>
> i also think you're hpl numbers on the amd chip are low, you should be
>> 4000 which would put you closer to intel, but intel will still edge
> out just because it has a higher base clock.
>
> On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf
> <beowulf at beowulf.org> wrote:
>> Did anyone else attend this webinar panel discussion with AMD hosted by
>> HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your
>> Success in HPC"
>>
>> https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/
>>
>> I attended it, and noticed there was no mention of AMD supporting
>> AVX512, so during the question and answer portion of the program, I
>> asked when AMD processors will support AVX512. The answer given, and I'm
>> not making this up, is that AMD listens to their users and gives the
>> users what they want, and right now they're not hearing any demand for
>> AVX512.
>>
>> Personally, I call BS on that one. I can't imagine anyone in the HPC
>> community saying "we'd like processors that offer only 1/2 the floating
>> point performance of Intel processors". Sure, AMD can offer more cores,
>> but with only AVX2, you'd need twice as many cores as Intel processors,
>> all other things being equal.
>>
>> Last fall I evaluated potential new cluster nodes for a large cluster
>> purchase using the HPL benchmark. I compared a server with dual AMD EPYC
>> 7H12 processors (128) cores to a server with quad Intel Xeon 8268
>> processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
>> only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
>> 64% of the Xeon 8268 system, despite having 33% more cores.
>>
>>   From what I've heard, the AMD processors run much hotter than the Intel
>> processors, too, so I imagine a FLOPS/Watt comparison would be even less
>> favorable to AMD.
>>
>> An argument can be made that for calculations that lend themselves to
>> vectorization should be done on GPUs, instead of the main processors but
>> the last time I checked, GPU jobs are still memory is limited, and
>> moving data in and out of GPU memory can still take time, so I can see
>> situations where for large amounts of data using CPUs would be preferred
>> over GPUs.
>>
>> Your thoughts?
>>
>> --
>> Prentice
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

From sdm900 at gmail.com  Thu Jun 17 02:53:04 2021
From: sdm900 at gmail.com (Stu Midgley)
Date: Thu, 17 Jun 2021 10:53:04 +0800
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
Message-ID: <CAEM1RsX-YW9xdw2wKb3LZ216m57wFyb8ud1Z_LXc7HkK+6R0Lg@mail.gmail.com>

I've told AMD brass that we need AVX512 many many times.

I've also told them that we need more memory bandwidth and that adding
dimms is not the answer.  We don't need more capacity - just more bandwidth.

We have a stack load of KNL systems and have invested heavily in AVX512
(writing with intrinsics) and shifting those codes away from it would be
considerable work.

Bring on Sapphire Rapids :)


On Thu, Jun 17, 2021 at 1:16 AM Prentice Bisbal via Beowulf <
beowulf at beowulf.org> wrote:

> Did anyone else attend this webinar panel discussion with AMD hosted by
> HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your
> Success in HPC"
>
> https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/
>
> I attended it, and noticed there was no mention of AMD supporting
> AVX512, so during the question and answer portion of the program, I
> asked when AMD processors will support AVX512. The answer given, and I'm
> not making this up, is that AMD listens to their users and gives the
> users what they want, and right now they're not hearing any demand for
> AVX512.
>
> Personally, I call BS on that one. I can't imagine anyone in the HPC
> community saying "we'd like processors that offer only 1/2 the floating
> point performance of Intel processors". Sure, AMD can offer more cores,
> but with only AVX2, you'd need twice as many cores as Intel processors,
> all other things being equal.
>
> Last fall I evaluated potential new cluster nodes for a large cluster
> purchase using the HPL benchmark. I compared a server with dual AMD EPYC
> 7H12 processors (128) cores to a server with quad Intel Xeon 8268
> processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and
> only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only
> 64% of the Xeon 8268 system, despite having 33% more cores.
>
>  From what I've heard, the AMD processors run much hotter than the Intel
> processors, too, so I imagine a FLOPS/Watt comparison would be even less
> favorable to AMD.
>
> An argument can be made that for calculations that lend themselves to
> vectorization should be done on GPUs, instead of the main processors but
> the last time I checked, GPU jobs are still memory is limited, and
> moving data in and out of GPU memory can still take time, so I can see
> situations where for large amounts of data using CPUs would be preferred
> over GPUs.
>
> Your thoughts?
>
> --
> Prentice
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>


-- 
Dr Stuart Midgley
sdm900 at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210617/11a488f3/attachment.htm>

From ghenriks at gmail.com  Sat Jun 19 15:49:06 2021
From: ghenriks at gmail.com (Gerald Henriksen)
Date: Sat, 19 Jun 2021 11:49:06 -0400
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
Message-ID: <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>

On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:

>The answer given, and I'm 
>not making this up, is that AMD listens to their users and gives the 
>users what they want, and right now they're not hearing any demand for 
>AVX512.
>
>Personally, I call BS on that one. I can't imagine anyone in the HPC 
>community saying "we'd like processors that offer only 1/2 the floating 
>point performance of Intel processors".

I suspect that is marketing speak, which roughly translates to not
that no one has asked for it, but rather requests haven't reached a
threshold where the requests are viewed as significant enough.

> Sure, AMD can offer more cores, 
>but with only AVX2, you'd need twice as many cores as Intel processors, 
>all other things being equal.

But of course all other things aren't equal.

AVX512 is a mess.

Look at the Wikipedia page(*) and note that AVX512 means different
things depending on the processor implementing it.

So what does the poor software developer target?

Or that it can for heat reasons cause CPU frequency reductions,
meaning real world performance may not match theoritical - thus easier
to just go with GPU's.

The result is that most of the world is quite happily (at least for
now) ignoring AVX512 and going with GPU's as necessary - particularly
given the convenient libraries that Nvidia offers.

> I compared a server with dual AMD EPYC >7H12 processors (128)
> quad Intel Xeon 8268 >processors (96 cores).

> From what I've heard, the AMD processors run much hotter than the Intel 
>processors, too, so I imagine a FLOPS/Watt comparison would be even less 
>favorable to AMD.

Spec sheets would indicate AMD runs hotter, but then again you
benchmarked twice as many Intel processors.

So, per spec sheets for you processors above:

AMD - 280W - 2 processors means system 560W
Intel - 205W - 4 processors means system 820W

(and then you also need to factor in purchase price).

>An argument can be made that for calculations that lend themselves to 
>vectorization should be done on GPUs, instead of the main processors but 
>the last time I checked, GPU jobs are still memory is limited, and 
>moving data in and out of GPU memory can still take time, so I can see 
>situations where for large amounts of data using CPUs would be preferred 
>over GPUs.

AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
which may or may not mean a difference.

But what despite all of the above and the other replies, it is AMD who
has been winning the HPC contracts of late, not Intel.

* - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

From tjrc at sanger.ac.uk  Sun Jun 20 01:04:03 2021
From: tjrc at sanger.ac.uk (Tim Cutts)
Date: Sun, 20 Jun 2021 01:04:03 +0000
Subject: [Beowulf] AMD and AVX512 [EXT]
In-Reply-To: <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
Message-ID: <E7FE1E2F-E5F7-4ACD-A056-1BBDDDB632B4@sanger.ac.uk>

I think that?s a major important point.  Even if the whole of the HPC market were clamouring for it (which they?re not, judging by this discussion) that?s still a very small proportion of the worldwide CPU market.  We have to remember that we in the HPC community are a niche market.  I recall at SC a couple of years ago someone from Intel pointing out that mobile devices and IoT were what was driving IT technology; the volume dwarfs everything else.  Hence the drive to NVRAM - not to make things faster for HPC (although that was the benefit being presented through that talk), but the fundamental driver was to increase phone battery life.

Tim

--
Tim Cutts
Head of Scientific Computing
Wellcome Sanger Institute


On 19 Jun 2021, at 16:49, Gerald Henriksen <ghenriks at gmail.com<mailto:ghenriks at gmail.com>> wrote:

I suspect that is marketing speak, which roughly translates to not
that no one has asked for it, but rather requests haven't reached a
threshold where the requests are viewed as significant enough.


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210620/e34f1c61/attachment.htm>

From hearnsj at gmail.com  Sun Jun 20 05:38:06 2021
From: hearnsj at gmail.com (John Hearns)
Date: Sun, 20 Jun 2021 06:38:06 +0100
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
Message-ID: <CAPqNE2WnHFB8aqf1LgHT0FUm+R3Y8+BiujCCndmC4f5ywQRA5w@mail.gmail.com>

Regarding benchmarking real world codes on AMD , every year Martyn Guest
presents a comprehensive set of benchmark studies to the UK Computing
Insights Conference.
I suggest a Sunday afternoon with the beverage of your choice is a good
time to settle down and take time to read these or watch the presentation.

2019
https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf


2020 Video session
https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000

Skylake / Cascade Lake / AMD Rome

The slides for 2020 do exist - as I remember all the slides from all talks
are grouped together, but I cannot find them.
Watch the video - it is an excellent presentation.


On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen <ghenriks at gmail.com> wrote:

> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>
> >The answer given, and I'm
> >not making this up, is that AMD listens to their users and gives the
> >users what they want, and right now they're not hearing any demand for
> >AVX512.
> >
> >Personally, I call BS on that one. I can't imagine anyone in the HPC
> >community saying "we'd like processors that offer only 1/2 the floating
> >point performance of Intel processors".
>
> I suspect that is marketing speak, which roughly translates to not
> that no one has asked for it, but rather requests haven't reached a
> threshold where the requests are viewed as significant enough.
>
> > Sure, AMD can offer more cores,
> >but with only AVX2, you'd need twice as many cores as Intel processors,
> >all other things being equal.
>
> But of course all other things aren't equal.
>
> AVX512 is a mess.
>
> Look at the Wikipedia page(*) and note that AVX512 means different
> things depending on the processor implementing it.
>
> So what does the poor software developer target?
>
> Or that it can for heat reasons cause CPU frequency reductions,
> meaning real world performance may not match theoritical - thus easier
> to just go with GPU's.
>
> The result is that most of the world is quite happily (at least for
> now) ignoring AVX512 and going with GPU's as necessary - particularly
> given the convenient libraries that Nvidia offers.
>
> > I compared a server with dual AMD EPYC >7H12 processors (128)
> > quad Intel Xeon 8268 >processors (96 cores).
>
> > From what I've heard, the AMD processors run much hotter than the Intel
> >processors, too, so I imagine a FLOPS/Watt comparison would be even less
> >favorable to AMD.
>
> Spec sheets would indicate AMD runs hotter, but then again you
> benchmarked twice as many Intel processors.
>
> So, per spec sheets for you processors above:
>
> AMD - 280W - 2 processors means system 560W
> Intel - 205W - 4 processors means system 820W
>
> (and then you also need to factor in purchase price).
>
> >An argument can be made that for calculations that lend themselves to
> >vectorization should be done on GPUs, instead of the main processors but
> >the last time I checked, GPU jobs are still memory is limited, and
> >moving data in and out of GPU memory can still take time, so I can see
> >situations where for large amounts of data using CPUs would be preferred
> >over GPUs.
>
> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
> which may or may not mean a difference.
>
> But what despite all of the above and the other replies, it is AMD who
> has been winning the HPC contracts of late, not Intel.
>
> * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210620/0aab4bc4/attachment-0001.htm>

From hearnsj at gmail.com  Sun Jun 20 05:51:58 2021
From: hearnsj at gmail.com (John Hearns)
Date: Sun, 20 Jun 2021 06:51:58 +0100
Subject: [Beowulf] AMD and AVX512 [EXT]
In-Reply-To: <E7FE1E2F-E5F7-4ACD-A056-1BBDDDB632B4@sanger.ac.uk>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
 <E7FE1E2F-E5F7-4ACD-A056-1BBDDDB632B4@sanger.ac.uk>
Message-ID: <CAPqNE2Wf7ZEABvpXOF4An0XtHhBgfY8RbuqGvwhd34g310q18A@mail.gmail.com>

That is a very interesting point! I never thought of that.
Also mobile drives ARM development - yes I know the CPUs in Isambard and
Fugaku will not be seen in your mobile phone but the ecosystem is propped
up by having a diverse market and also the power saving priorities of
mobile will influence HPC ARM CPUs.


On Sun, 20 Jun 2021 at 02:04, Tim Cutts <tjrc at sanger.ac.uk> wrote:

> I think that?s a major important point.  Even if the whole of the HPC
> market were clamouring for it (which they?re not, judging by this
> discussion) that?s still a very small proportion of the worldwide CPU
> market.  We have to remember that we in the HPC community are a niche
> market.  I recall at SC a couple of years ago someone from Intel pointing
> out that mobile devices and IoT were what was driving IT technology; the
> volume dwarfs everything else.  Hence the drive to NVRAM - not to make
> things faster for HPC (although that was the benefit being presented
> through that talk), but the fundamental driver was to increase phone
> battery life.
>
> Tim
>
> --
> Tim Cutts
> Head of Scientific Computing
> Wellcome Sanger Institute
>
>
> On 19 Jun 2021, at 16:49, Gerald Henriksen <ghenriks at gmail.com> wrote:
>
> I suspect that is marketing speak, which roughly translates to not
> that no one has asked for it, but rather requests haven't reached a
> threshold where the requests are viewed as significant enough.
>
>
> -- The Wellcome Sanger Institute is operated by Genome Research Limited, a
> charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210620/8b93583f/attachment.htm>

From amacater at einval.com  Sun Jun 20 12:13:21 2021
From: amacater at einval.com (Andrew M.A. Cater)
Date: Sun, 20 Jun 2021 12:13:21 +0000
Subject: [Beowulf] Just a quick heads up: new Beowulf coming ...
Message-ID: <YM8w4bQYfo37rbK7@einval.com>

The folks over at Devuan have chosen to name their next code release (based
on upcoming Debian 11) beowulf. I don't think it will impact this list - but
you never know.

For anyone runnng Debian on HPC / in labs: the latest Debian point release
10.10 was yesterday. Debian 11 should be released in about six weeks on
31 July 2021.

Thanks for the excellence in this list

All best, as ever,

Andy Cater

From sdm900 at gmail.com  Sun Jun 20 14:38:42 2021
From: sdm900 at gmail.com (Stu Midgley)
Date: Sun, 20 Jun 2021 22:38:42 +0800
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <CAPqNE2WnHFB8aqf1LgHT0FUm+R3Y8+BiujCCndmC4f5ywQRA5w@mail.gmail.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
 <CAPqNE2WnHFB8aqf1LgHT0FUm+R3Y8+BiujCCndmC4f5ywQRA5w@mail.gmail.com>
Message-ID: <CAEM1RsV00fdMRBmGeO+2A6c1oW7F1JhbOuvAqC6wVpomNsoz7w@mail.gmail.com>

we should be upto about EV12 by now...

On Sun, Jun 20, 2021 at 1:38 PM John Hearns <hearnsj at gmail.com> wrote:

> Regarding benchmarking real world codes on AMD , every year Martyn Guest
> presents a comprehensive set of benchmark studies to the UK Computing
> Insights Conference.
> I suggest a Sunday afternoon with the beverage of your choice is a good
> time to settle down and take time to read these or watch the presentation.
>
> 2019
>
> https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf
>
>
> 2020 Video session
>
> https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000
>
> Skylake / Cascade Lake / AMD Rome
>
> The slides for 2020 do exist - as I remember all the slides from all talks
> are grouped together, but I cannot find them.
> Watch the video - it is an excellent presentation.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen <ghenriks at gmail.com> wrote:
>
>> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>>
>> >The answer given, and I'm
>> >not making this up, is that AMD listens to their users and gives the
>> >users what they want, and right now they're not hearing any demand for
>> >AVX512.
>> >
>> >Personally, I call BS on that one. I can't imagine anyone in the HPC
>> >community saying "we'd like processors that offer only 1/2 the floating
>> >point performance of Intel processors".
>>
>> I suspect that is marketing speak, which roughly translates to not
>> that no one has asked for it, but rather requests haven't reached a
>> threshold where the requests are viewed as significant enough.
>>
>> > Sure, AMD can offer more cores,
>> >but with only AVX2, you'd need twice as many cores as Intel processors,
>> >all other things being equal.
>>
>> But of course all other things aren't equal.
>>
>> AVX512 is a mess.
>>
>> Look at the Wikipedia page(*) and note that AVX512 means different
>> things depending on the processor implementing it.
>>
>> So what does the poor software developer target?
>>
>> Or that it can for heat reasons cause CPU frequency reductions,
>> meaning real world performance may not match theoritical - thus easier
>> to just go with GPU's.
>>
>> The result is that most of the world is quite happily (at least for
>> now) ignoring AVX512 and going with GPU's as necessary - particularly
>> given the convenient libraries that Nvidia offers.
>>
>> > I compared a server with dual AMD EPYC >7H12 processors (128)
>> > quad Intel Xeon 8268 >processors (96 cores).
>>
>> > From what I've heard, the AMD processors run much hotter than the Intel
>> >processors, too, so I imagine a FLOPS/Watt comparison would be even less
>> >favorable to AMD.
>>
>> Spec sheets would indicate AMD runs hotter, but then again you
>> benchmarked twice as many Intel processors.
>>
>> So, per spec sheets for you processors above:
>>
>> AMD - 280W - 2 processors means system 560W
>> Intel - 205W - 4 processors means system 820W
>>
>> (and then you also need to factor in purchase price).
>>
>> >An argument can be made that for calculations that lend themselves to
>> >vectorization should be done on GPUs, instead of the main processors but
>> >the last time I checked, GPU jobs are still memory is limited, and
>> >moving data in and out of GPU memory can still take time, so I can see
>> >situations where for large amounts of data using CPUs would be preferred
>> >over GPUs.
>>
>> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
>> which may or may not mean a difference.
>>
>> But what despite all of the above and the other replies, it is AMD who
>> has been winning the HPC contracts of late, not Intel.
>>
>> * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>


-- 
Dr Stuart Midgley
sdm900 at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210620/5400de21/attachment.htm>

From ghenriks at gmail.com  Sun Jun 20 14:44:12 2021
From: ghenriks at gmail.com (Gerald Henriksen)
Date: Sun, 20 Jun 2021 10:44:12 -0400
Subject: [Beowulf] AMD and AVX512 [EXT]
In-Reply-To: <CAPqNE2Wf7ZEABvpXOF4An0XtHhBgfY8RbuqGvwhd34g310q18A@mail.gmail.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
 <E7FE1E2F-E5F7-4ACD-A056-1BBDDDB632B4@sanger.ac.uk>
 <CAPqNE2Wf7ZEABvpXOF4An0XtHhBgfY8RbuqGvwhd34g310q18A@mail.gmail.com>
Message-ID: <ebkucg17qj1psplhm4fdf0o6anbcnbvjt6@4ax.com>

On Sun, 20 Jun 2021 06:51:58 +0100, you wrote:

>That is a very interesting point! I never thought of that.
>Also mobile drives ARM development - yes I know the CPUs in Isambard and
>Fugaku will not be seen in your mobile phone but the ecosystem is propped
>up by having a diverse market and also the power saving priorities of
>mobile will influence HPC ARM CPUs.

I think the danger is in thinking of ARM (or going forward RISC-V) in
the same way that we have traditionally considered CPU families like
the x86 / x64 / Power families.

One of things hobbling x64 is that is effectively 1 design that Intel
(and to a lesser extent AMD) try to fit into multiple roles - often
without success.  Consider the now abandoned attempts to get Intel
chips into phones and tablets.

ARM has no such contraints - they are quite happy to develop new
designs for specific markets that are entirely unsuitable for their
existing strengths.

Hence, as part of the ARM push into HPC, the new Neoverse V1 - a
design for HPC that probably won't appear in phones.

https://www.arm.com/blogs/blueprint/neoverse-v1

Or consider that the ARM ecosystem has shunned making multiple-bitness
CPUs/SOCs - they essentially made a clean break with 64-bit only chips
that sit alongside the 32-bit only chips - vendors choose the hardware
for their needs and don't carry along legacy stuff that eats up
silicon space and power.

ARM is about taking ARM IP and creating custom designs for specific
markets.

From joe.landman at gmail.com  Sun Jun 20 17:21:15 2021
From: joe.landman at gmail.com (Joe Landman)
Date: Sun, 20 Jun 2021 13:21:15 -0400
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com>
References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com>
Message-ID: <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com>

(Note:? not disagreeing at all with Gerald, actually agreeing strongly 
... also, correct address this time!? Thanks Gerald!)


On 6/19/21 11:49 AM, Gerald Henriksen wrote:
> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>
>> The answer given, and I'm
>> not making this up, is that AMD listens to their users and gives the
>> users what they want, and right now they're not hearing any demand for
>> AVX512.

More accurately, there is call for it.? From a very small segment of the 
market.? Ones who buy small quantities of processors (under 100k volume 
per purchase).

That is, not a significant enough portion of the market to make a huge 
difference to the supplier (Intel).

And more to the point, AI and HPC joining forces has put the spotlight 
on small matrix multiplies, often with lower precision.? I'm not sure 
(haven't read much on it recently) if AVX512 will be enabling/has 
enabled support for bfloat16/FP16 or similar.? These tend to go to GPUs 
and other accelerators.

>> Personally, I call BS on that one. I can't imagine anyone in the HPC
>> community saying "we'd like processors that offer only 1/2 the floating
>> point performance of Intel processors".
> I suspect that is marketing speak, which roughly translates to not
> that no one has asked for it, but rather requests haven't reached a
> threshold where the requests are viewed as significant enough.

This, precisely.? AMD may be losing the AVX512 users to Intel. But 
that's a small/miniscule fraction of the overall users of its products.? 
The demand for this is quite constrained. Moreover, there are often 
significant performance consequences to using AVX512 (downclocking, 
pipeline stalls, etc.) whereby the cost of enabling it and using it, far 
outweighs the benefits of providing it, for the vast, overwhelming 
portion of the market.

And, as noted above on the accelerator side, this use case (large 
vectors) are better handled by the accelerators.? There is a cost 
(engineering, code design, etc.) to using accelerators as well.? But it 
won't directly impact the CPUs.

>> Sure, AMD can offer more cores,
>> but with only AVX2, you'd need twice as many cores as Intel processors,
>> all other things being equal.

... or you run the GPU versions of the code, which are likely getting 
more active developer attention.? AVX512 applies to only a miniscule 
number of codes/problems.? Its really not a panacea.

More to the point, have you seen how "well" compilers use AVX2/SSE 
registers and do code gen?? Its not pretty in general. Would you want 
the compilers to purposefully spit out AVX512 code the way the do 
AVX2/SSE code now?? I've found one has to work very hard with intrinsics 
to get good performance out of AVX2, never mind AVX512.

Put another way, we've been hearing about "smart" compilers for a while, 
and in all honesty, most can barely implement a standard correctly, 
never mind generate reasonably (near) optimal code for the target 
system.? This has been a problem my entire professional life, and while 
I wish they were better, at the end of the day, this is where human 
intelligence fits into the HPC/AI narrative.

> But of course all other things aren't equal.
>
> AVX512 is a mess.

Understated, and yes.

> Look at the Wikipedia page(*) and note that AVX512 means different
> things depending on the processor implementing it.

I made comments previously about which ISA ARM folks were going to write 
to.? That is, different processors, likely implementing different 
instructions, differently ... you won't really have 1 equally good 
compiler for all these features.? You'll have a compiler that implements 
common denominators reasonably well. Which mitigates the benefits of the 
ISA/architecture.

Intel has the same problem with AVX512.? I know, I know ... feature 
flags on the CPU (see last line of lscpu output).? And how often have 
certain (ahem) compilers ignored the flags, and used a different 
mechanism to determine CPU feature support, specifically targeting their 
competitor offerings to force (literally) low performance paths for 
those CPUs?


> So what does the poor software developer target?

Lowest common denominator.? Make the code work correctly first.? Then 
make it fast.? If fast is platform specific, ask how often with that 
platform be used.


> Or that it can for heat reasons cause CPU frequency reductions,
> meaning real world performance may not match theoritical - thus easier
> to just go with GPU's.
>
> The result is that most of the world is quite happily (at least for
> now) ignoring AVX512 and going with GPU's as necessary - particularly
> given the convenient libraries that Nvidia offers.

Yeah ... like it or not, that battle is over (for now).

[...]

>
>> An argument can be made that for calculations that lend themselves to
>> vectorization should be done on GPUs, instead of the main processors but
>> the last time I checked, GPU jobs are still memory is limited, and
>> moving data in and out of GPU memory can still take time, so I can see
>> situations where for large amounts of data using CPUs would be preferred
>> over GPUs.
> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
> which may or may not mean a difference.

It does.? IO and memory bandwidth/latency are very important, and oft 
overlooked aspects of performance.? If you have a choice of doubling IO 
and memory bandwidth at lower latency (usable by everyone) vs adding an 
AVX512 unit or two (usable by a small fraction of a percent of all 
users), which would net you, as an architect, the best "bang for the buck"?


> But what despite all of the above and the other replies, it is AMD who
> has been winning the HPC contracts of late, not Intel.

There's a reason for that.? I will admit I have a devil of a time trying 
to convince people that higher clock frequency for computing matters 
only to a small fraction of operations, especially ones waiting on 
(slow) RAM and (slower) IO.? Make the RAM and IO faster (lower latency, 
higher bandwidth), and the system will be far more performant.


-- 

Joe Landman
e:joe.landman at gmail.com
t: @hpcjoe
w:https://scalability.org
g:https://github.com/joelandman
l:https://www.linkedin.com/in/joelandman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210620/4440d7b6/attachment-0001.htm>

From kus at free.net  Sun Jun 20 17:28:25 2021
From: kus at free.net (Mikhail Kuzminsky)
Date: Sun, 20 Jun 2021 20:28:25 +0300
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <CAPqNE2WnHFB8aqf1LgHT0FUm+R3Y8+BiujCCndmC4f5ywQRA5w@mail.gmail.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
 <CAPqNE2WnHFB8aqf1LgHT0FUm+R3Y8+BiujCCndmC4f5ywQRA5w@mail.gmail.com>
Message-ID: <web-1545476@free.net>

I apologize - I should have written earlier, but I don't always work
with my broken right hand. It seems to me that a reasonable basis for
discussing AMD EPYC performance could be the specified performance
data in the Daresburg University benchmark from M.Guest. Yes, newer
versions of AMD EPYC and Xeon Scalable processors have appeared since
then, and new compiler versions. However, Intel already had AVX-512
support, and AMD - AVX-256.
Of course, peak performanceis is not so important as application
performance. There are applications where performance is not limited
to working with vectors - there AVX-512 may not be needed. And in AI
tasks, working with vectors is actual - and GPUs are often used there.
For AI, the Daresburg benchmark, on the other hand, is less relevant.
And in Zen 4, AMD seemed to be going to support 512 bit vectors. But
performance of linear algebra does not always require work with GPU.
In quantum chemistry, you can get acceleration due to vectors on the
V100, let's say a 2 times - how much more expensive is the GPU?
Of course, support for 512 bit vectors is a plus, but you really need
to look to application performance and cost (including power
consumption). I prefer to see to the A64FX now, although there may
need to be rebuild applications. Servers w/A64FX sold now, but the
price is very important.

In message from John Hearns <hearnsj at gmail.com> (Sun, 20 Jun 2021
06:38:06 +0100):
> Regarding benchmarking real world codes on AMD , every year Martyn 
>Guest
> presents a comprehensive set of benchmark studies to the UK Computing
> Insights Conference.
> I suggest a Sunday afternoon with the beverage of your choice is a 
>good
> time to settle down and take time to read these or watch the 
>presentation.
> 
> 2019
> https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf
> 
> 
> 2020 Video session
> https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000
> 
> Skylake / Cascade Lake / AMD Rome
> 
> The slides for 2020 do exist - as I remember all the slides from all 
>talks
> are grouped together, but I cannot find them.
> Watch the video - it is an excellent presentation.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen <ghenriks at gmail.com> 
>wrote:
> 
>> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>>
>> >The answer given, and I'm
>> >not making this up, is that AMD listens to their users and gives the
>> >users what they want, and right now they're not hearing any demand 
>>for
>> >AVX512.
>> >
>> >Personally, I call BS on that one. I can't imagine anyone in the HPC
>> >community saying "we'd like processors that offer only 1/2 the 
>>floating
>> >point performance of Intel processors".
>>
>> I suspect that is marketing speak, which roughly translates to not
>> that no one has asked for it, but rather requests haven't reached a
>> threshold where the requests are viewed as significant enough.
>>
>> > Sure, AMD can offer more cores,
>> >but with only AVX2, you'd need twice as many cores as Intel 
>>processors,
>> >all other things being equal.
>>
>> But of course all other things aren't equal.
>>
>> AVX512 is a mess.
>>
>> Look at the Wikipedia page(*) and note that AVX512 means different
>> things depending on the processor implementing it.
>>
>> So what does the poor software developer target?
>>
>> Or that it can for heat reasons cause CPU frequency reductions,
>> meaning real world performance may not match theoritical - thus 
>>easier
>> to just go with GPU's.
>>
>> The result is that most of the world is quite happily (at least for
>> now) ignoring AVX512 and going with GPU's as necessary - particularly
>> given the convenient libraries that Nvidia offers.
>>
>> > I compared a server with dual AMD EPYC >7H12 processors (128)
>> > quad Intel Xeon 8268 >processors (96 cores).
>>
>> > From what I've heard, the AMD processors run much hotter than the 
>>Intel
>> >processors, too, so I imagine a FLOPS/Watt comparison would be even 
>>less
>> >favorable to AMD.
>>
>> Spec sheets would indicate AMD runs hotter, but then again you
>> benchmarked twice as many Intel processors.
>>
>> So, per spec sheets for you processors above:
>>
>> AMD - 280W - 2 processors means system 560W
>> Intel - 205W - 4 processors means system 820W
>>
>> (and then you also need to factor in purchase price).
>>
>> >An argument can be made that for calculations that lend themselves 
>>to
>> >vectorization should be done on GPUs, instead of the main processors 
>>but
>> >the last time I checked, GPU jobs are still memory is limited, and
>> >moving data in and out of GPU memory can still take time, so I can 
>>see
>> >situations where for large amounts of data using CPUs would be 
>>preferred
>> >over GPUs.
>>
>> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
>> which may or may not mean a difference.
>>
>> But what despite all of the above and the other replies, it is AMD 
>>who
>> has been winning the HPC contracts of late, not Intel.
>>
>> * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
>>Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
> 
> -- 
> ??? ????????? ???? ????????? ?? ??????? ? ??? ???????
> ? ????? ???????? ??????????? ???????????
> MailScanner, ? ?? ????????
> ??? ??? ?? ???????? ???????????? ????.
> 


From sassy-work at sassy.formativ.net  Sun Jun 20 22:45:26 2021
From: sassy-work at sassy.formativ.net (=?ISO-8859-1?Q?J=F6rg_Sa=DFmannshausen?=)
Date: Sun, 20 Jun 2021 23:45:26 +0100
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <web-1545476@free.net>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <CAPqNE2WnHFB8aqf1LgHT0FUm+R3Y8+BiujCCndmC4f5ywQRA5w@mail.gmail.com>
 <web-1545476@free.net>
Message-ID: <4123509.Cr6JYYXRPx@deepblue>

Dear all,

same here, I should have joined the discussion earlier but currently I am 
recovering from a trapped ulnaris nerve OP, so long typing is something I need 
to avoid.
As it is quite apt I think, I would like to inform you about this upcoming 
talk (copy&pasta):

**********
*Performance Optimizations & Best Practices for AMD Rome and Milan CPUs in HPC 
Environments*
- date & time: Fri July 2nd 2021 - 16:00-17:30 UTC
- speakers: Evan Burness and Jithin Jose (Principal Program Managers for High-
Performance Computing in Microsoft Azure)

More information available at https://github.com/easybuilders/easybuild/wiki/
EasyBuild-tech-talks-IV:-AMD-Rome-&-Milan

The talk will be presented via a Zoom session, which registered attendees can 
join, and will be streamed (+ recorded) via the EasyBuild YouTube channel.
Q&A via the #tech-talks channel in the EasyBuild Slack.

Please register (free or charge) if you plan to attend, via:
https://webappsx.ugent.be/eventManager/events/ebtechtalkamdromemilan
The Zoom link will only be shared with registered attendees.
**********

These talks are really tech talks and not sales talks and all of the ones I 
been to were very informative and friendly. So that might be a good idea to 
ask some questions there?

All the best

J?rg

Am Sonntag, 20. Juni 2021, 18:28:25 BST schrieb Mikhail Kuzminsky:
> I apologize - I should have written earlier, but I don't always work
> with my broken right hand. It seems to me that a reasonable basis for
> discussing AMD EPYC performance could be the specified performance
> data in the Daresburg University benchmark from M.Guest. Yes, newer
> versions of AMD EPYC and Xeon Scalable processors have appeared since
> then, and new compiler versions. However, Intel already had AVX-512
> support, and AMD - AVX-256.
> Of course, peak performanceis is not so important as application
> performance. There are applications where performance is not limited
> to working with vectors - there AVX-512 may not be needed. And in AI
> tasks, working with vectors is actual - and GPUs are often used there.
> For AI, the Daresburg benchmark, on the other hand, is less relevant.
> And in Zen 4, AMD seemed to be going to support 512 bit vectors. But
> performance of linear algebra does not always require work with GPU.
> In quantum chemistry, you can get acceleration due to vectors on the
> V100, let's say a 2 times - how much more expensive is the GPU?
> Of course, support for 512 bit vectors is a plus, but you really need
> to look to application performance and cost (including power
> consumption). I prefer to see to the A64FX now, although there may
> need to be rebuild applications. Servers w/A64FX sold now, but the
> price is very important.
> 
> In message from John Hearns <hearnsj at gmail.com> (Sun, 20 Jun 2021
> 
> 06:38:06 +0100):
> > Regarding benchmarking real world codes on AMD , every year Martyn
> >
> >Guest
> >
> > presents a comprehensive set of benchmark studies to the UK Computing
> > Insights Conference.
> > I suggest a Sunday afternoon with the beverage of your choice is a
> >
> >good
> >
> > time to settle down and take time to read these or watch the
> >
> >presentation.
> >
> > 2019
> > https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn
> > _Guest.pdf
> > 
> > 
> > 2020 Video session
> > https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49E
> > hq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000
> > 
> > Skylake / Cascade Lake / AMD Rome
> > 
> > The slides for 2020 do exist - as I remember all the slides from all
> >
> >talks
> >
> > are grouped together, but I cannot find them.
> > Watch the video - it is an excellent presentation.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen <ghenriks at gmail.com>
> >
> >wrote:
> >> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
> >> >The answer given, and I'm
> >> >not making this up, is that AMD listens to their users and gives the
> >> >users what they want, and right now they're not hearing any demand
> >>
> >>for
> >>
> >> >AVX512.
> >> >
> >> >Personally, I call BS on that one. I can't imagine anyone in the HPC
> >> >community saying "we'd like processors that offer only 1/2 the
> >>
> >>floating
> >>
> >> >point performance of Intel processors".
> >> 
> >> I suspect that is marketing speak, which roughly translates to not
> >> that no one has asked for it, but rather requests haven't reached a
> >> threshold where the requests are viewed as significant enough.
> >> 
> >> > Sure, AMD can offer more cores,
> >> >
> >> >but with only AVX2, you'd need twice as many cores as Intel
> >>
> >>processors,
> >>
> >> >all other things being equal.
> >> 
> >> But of course all other things aren't equal.
> >> 
> >> AVX512 is a mess.
> >> 
> >> Look at the Wikipedia page(*) and note that AVX512 means different
> >> things depending on the processor implementing it.
> >> 
> >> So what does the poor software developer target?
> >> 
> >> Or that it can for heat reasons cause CPU frequency reductions,
> >> meaning real world performance may not match theoritical - thus
> >>
> >>easier
> >>
> >> to just go with GPU's.
> >> 
> >> The result is that most of the world is quite happily (at least for
> >> now) ignoring AVX512 and going with GPU's as necessary - particularly
> >> given the convenient libraries that Nvidia offers.
> >> 
> >> > I compared a server with dual AMD EPYC >7H12 processors (128)
> >> > quad Intel Xeon 8268 >processors (96 cores).
> >> > 
> >> > From what I've heard, the AMD processors run much hotter than the
> >>
> >>Intel
> >>
> >> >processors, too, so I imagine a FLOPS/Watt comparison would be even
> >>
> >>less
> >>
> >> >favorable to AMD.
> >> 
> >> Spec sheets would indicate AMD runs hotter, but then again you
> >> benchmarked twice as many Intel processors.
> >> 
> >> So, per spec sheets for you processors above:
> >> 
> >> AMD - 280W - 2 processors means system 560W
> >> Intel - 205W - 4 processors means system 820W
> >> 
> >> (and then you also need to factor in purchase price).
> >> 
> >> >An argument can be made that for calculations that lend themselves
> >>
> >>to
> >>
> >> >vectorization should be done on GPUs, instead of the main processors
> >>
> >>but
> >>
> >> >the last time I checked, GPU jobs are still memory is limited, and
> >> >moving data in and out of GPU memory can still take time, so I can
> >>
> >>see
> >>
> >> >situations where for large amounts of data using CPUs would be
> >>
> >>preferred
> >>
> >> >over GPUs.
> >> 
> >> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
> >> which may or may not mean a difference.
> >> 
> >> But what despite all of the above and the other replies, it is AMD
> >>
> >>who
> >>
> >> has been winning the HPC contracts of late, not Intel.
> >> 
> >> * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> >>
> >>Computing
> >>
> >> To change your subscription (digest mode or unsubscribe) visit
> >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


From engwalljonathanthereal at gmail.com  Mon Jun 21 13:20:00 2021
From: engwalljonathanthereal at gmail.com (Jonathan Engwall)
Date: Mon, 21 Jun 2021 06:20:00 -0700
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com>
References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com>
 <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com>
Message-ID: <CAP=T6ZPkezTK9+WAwNHtSyn7DXzPpTeJ_5Hyg+N36qggaWuBVg@mail.gmail.com>

I have followed this thinking "square peg, round hole."
You have got it again, Joe. Compilers are your problem.

On Sun, Jun 20, 2021, 10:21 AM Joe Landman <joe.landman at gmail.com> wrote:

> (Note:  not disagreeing at all with Gerald, actually agreeing strongly ...
> also, correct address this time!  Thanks Gerald!)
>
>
> On 6/19/21 11:49 AM, Gerald Henriksen wrote:
>
> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>
>
> The answer given, and I'm
> not making this up, is that AMD listens to their users and gives the
> users what they want, and right now they're not hearing any demand for
> AVX512.
>
> More accurately, there is call for it.  From a very small segment of the
> market.  Ones who buy small quantities of processors (under 100k volume per
> purchase).
>
> That is, not a significant enough portion of the market to make a huge
> difference to the supplier (Intel).
>
> And more to the point, AI and HPC joining forces has put the spotlight on
> small matrix multiplies, often with lower precision.  I'm not sure (haven't
> read much on it recently) if AVX512 will be enabling/has enabled support
> for bfloat16/FP16 or similar.  These tend to go to GPUs and other
> accelerators.
>
> Personally, I call BS on that one. I can't imagine anyone in the HPC
> community saying "we'd like processors that offer only 1/2 the floating
> point performance of Intel processors".
>
> I suspect that is marketing speak, which roughly translates to not
> that no one has asked for it, but rather requests haven't reached a
> threshold where the requests are viewed as significant enough.
>
> This, precisely.  AMD may be losing the AVX512 users to Intel.  But that's
> a small/miniscule fraction of the overall users of its products.  The
> demand for this is quite constrained.  Moreover, there are often
> significant performance consequences to using AVX512 (downclocking,
> pipeline stalls, etc.) whereby the cost of enabling it and using it, far
> outweighs the benefits of providing it, for the vast, overwhelming portion
> of the market.
>
> And, as noted above on the accelerator side, this use case (large vectors)
> are better handled by the accelerators.  There is a cost (engineering, code
> design, etc.) to using accelerators as well.  But it won't directly impact
> the CPUs.
>
> Sure, AMD can offer more cores,
> but with only AVX2, you'd need twice as many cores as Intel processors,
> all other things being equal.
>
> ... or you run the GPU versions of the code, which are likely getting more
> active developer attention.  AVX512 applies to only a miniscule number of
> codes/problems.  Its really not a panacea.
>
> More to the point, have you seen how "well" compilers use AVX2/SSE
> registers and do code gen?  Its not pretty in general.  Would you want the
> compilers to purposefully spit out AVX512 code the way the do AVX2/SSE code
> now?  I've found one has to work very hard with intrinsics to get good
> performance out of AVX2, never mind AVX512.
>
> Put another way, we've been hearing about "smart" compilers for a while,
> and in all honesty, most can barely implement a standard correctly, never
> mind generate reasonably (near) optimal code for the target system.  This
> has been a problem my entire professional life, and while I wish they were
> better, at the end of the day, this is where human intelligence fits into
> the HPC/AI narrative.
>
> But of course all other things aren't equal.
>
> AVX512 is a mess.
>
> Understated, and yes.
>
> Look at the Wikipedia page(*) and note that AVX512 means different
> things depending on the processor implementing it.
>
> I made comments previously about which ISA ARM folks were going to write
> to.  That is, different processors, likely implementing different
> instructions, differently ... you won't really have 1 equally good compiler
> for all these features.  You'll have a compiler that implements common
> denominators reasonably well.  Which mitigates the benefits of the
> ISA/architecture.
>
> Intel has the same problem with AVX512.  I know, I know ... feature flags
> on the CPU (see last line of lscpu output).  And how often have certain
> (ahem) compilers ignored the flags, and used a different mechanism to
> determine CPU feature support, specifically targeting their competitor
> offerings to force (literally) low performance paths for those CPUs?
>
>
> So what does the poor software developer target?
>
> Lowest common denominator.  Make the code work correctly first.  Then make
> it fast.  If fast is platform specific, ask how often with that platform be
> used.
>
>
> Or that it can for heat reasons cause CPU frequency reductions,
> meaning real world performance may not match theoritical - thus easier
> to just go with GPU's.
>
> The result is that most of the world is quite happily (at least for
> now) ignoring AVX512 and going with GPU's as necessary - particularly
> given the convenient libraries that Nvidia offers.
>
> Yeah ... like it or not, that battle is over (for now).
>
> [...]
>
>
> An argument can be made that for calculations that lend themselves to
> vectorization should be done on GPUs, instead of the main processors but
> the last time I checked, GPU jobs are still memory is limited, and
> moving data in and out of GPU memory can still take time, so I can see
> situations where for large amounts of data using CPUs would be preferred
> over GPUs.
>
> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
> which may or may not mean a difference.
>
> It does.  IO and memory bandwidth/latency are very important, and oft
> overlooked aspects of performance.  If you have a choice of doubling IO and
> memory bandwidth at lower latency (usable by everyone) vs adding an AVX512
> unit or two (usable by a small fraction of a percent of all users), which
> would net you, as an architect, the best "bang for the buck"?
>
>
> But what despite all of the above and the other replies, it is AMD who
> has been winning the HPC contracts of late, not Intel.
>
> There's a reason for that.  I will admit I have a devil of a time trying
> to convince people that higher clock frequency for computing matters only
> to a small fraction of operations, especially ones waiting on (slow) RAM
> and (slower) IO.  Make the RAM and IO faster (lower latency, higher
> bandwidth), and the system will be far more performant.
>
>
>
> --
>
> Joe Landman
> e: joe.landman at gmail.com
> t: @hpcjoe
> w: https://scalability.org
> g: https://github.com/joelandman
> l: https://www.linkedin.com/in/joelandman
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210621/2fb7164a/attachment.htm>

From joe.landman at gmail.com  Mon Jun 21 13:46:30 2021
From: joe.landman at gmail.com (Joe Landman)
Date: Mon, 21 Jun 2021 09:46:30 -0400
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <CAP=T6ZPkezTK9+WAwNHtSyn7DXzPpTeJ_5Hyg+N36qggaWuBVg@mail.gmail.com>
References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com>
 <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com>
 <CAP=T6ZPkezTK9+WAwNHtSyn7DXzPpTeJ_5Hyg+N36qggaWuBVg@mail.gmail.com>
Message-ID: <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com>

On 6/21/21 9:20 AM, Jonathan Engwall wrote:
> I have followed this thinking "square peg, round hole."
> You have got it again, Joe. Compilers are your problem.


Erp ... did I mess up again?

System architecture has been a problem ... making a processing unit 
10-100x as fast as its support components means you have to code with 
that in mind.? A simple `gfortran -O3 mycode.f` won't necessarily 
generate optimal code for the system ( but I swear ... -O3 ... it says 
it on the package!)

Way back at Scalable, our secret sauce was largely increasing IO 
bandwidth and lowering IO latency while coupling computing more tightly 
to this massive IO/network pipe set, combined with intelligence in the 
kernel on how to better use the resources.? It was simply a better 
architecture.? We used the same CPUs.? We simply exploited the design 
better.

End result was codes that ran on our systems with off-cpu work (storage, 
networking, etc.) could push our systems far harder than competitors.? 
And you didn't have to use a different ISA to get these benefits.? No 
recompilation needed, though we did show the folks who were interested, 
how to get even better performance.

Architecture matters, as does implementation of that architecture.? 
There are costs to every decision within an architecture.? For AVX512, 
along comes lots of other baggage associated with downclocking, etc.? 
You have to do a cost-benefit analysis on whether or not it is worth 
paying for that baggage, with the benefits you get from doing so.? Some 
folks have made that decision towards AVX512, and have been enjoying the 
benefits of doing so (e.g. willing to pay the costs).? For the general 
audience, these costs represent a (significant) hurdle one must overcome.

Here's where awesome compiler support would help.? FWIW, gcc isn't that 
great a compiler.? Its not performance minded for HPC. Its a reasonable 
general purpose standards compliant (for some subset of standards) 
compilation system.? LLVM is IMO a better compiler system, and its 
clang/flang are developing nicely, albeit still not really HPC focused.? 
Then you have variants built on that.? Like the Cray compiler, Nvidia 
compiler and AMD compiler. These are HPC focused, and actually do quite 
well with some codes (though the AMD version lags the Cray and Nvidia 
compilers). You've got the Intel compiler, which would be a good general 
compiler if it wasn't more of a marketing vehicle for Intel processors 
and their features (hey you got an AMD chip?? you will take the slowest 
code path even if you support the features needed for the high 
performance code path).

Maybe, someday, we'll get a great HPC compiler for C/Fortran.


-- 
Joe Landman
e: joe.landman at gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210621/7162ff2d/attachment-0001.htm>

From amacater at einval.com  Mon Jun 21 14:46:53 2021
From: amacater at einval.com (Andrew M.A. Cater)
Date: Mon, 21 Jun 2021 14:46:53 +0000
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com>
References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com>
 <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com>
 <CAP=T6ZPkezTK9+WAwNHtSyn7DXzPpTeJ_5Hyg+N36qggaWuBVg@mail.gmail.com>
 <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com>
Message-ID: <YNCmXSdftGBIuvRN@einval.com>

On Mon, Jun 21, 2021 at 09:46:30AM -0400, Joe Landman wrote:
> On 6/21/21 9:20 AM, Jonathan Engwall wrote:
> > I have followed this thinking "square peg, round hole."
> > You have got it again, Joe. Compilers are your problem.
> 
> 
> Erp ... did I mess up again?
> 
> Here's where awesome compiler support would help.? FWIW, gcc isn't that
> great a compiler.? Its not performance minded for HPC. Its a reasonable
> general purpose standards compliant (for some subset of standards)
> compilation system.? LLVM is IMO a better compiler system, and its
> clang/flang are developing nicely, albeit still not really HPC focused.?
> Then you have variants built on that.? Like the Cray compiler, Nvidia
> compiler and AMD compiler. These are HPC focused, and actually do quite well
> with some codes (though the AMD version lags the Cray and Nvidia compilers).
> You've got the Intel compiler, which would be a good general compiler if it
> wasn't more of a marketing vehicle for Intel processors and their features
> (hey you got an AMD chip?? you will take the slowest code path even if you
> support the features needed for the high performance code path).
> 
> Maybe, someday, we'll get a great HPC compiler for C/Fortran.
> 
The problem is that, maybe, the HPC market is still not _quite_ big enough
to merit a dedicated set of compilers and is diverse enough in its problem 
sets that we still need a dozen or more specialist use cases to work well.

You would think there would be a cross-over point where massively parallel
scalable cloud infrastructure wold intersect with HPC but that doesn't
seem to be happening. Parallelisation is the great bugbear anyway.

Most of the experts I know on all of this are the regulars on this list:
paging Greg Lindahl ... 

All the best,

Andy Cater

> 
> -- 
> Joe Landman
> e: joe.landman at gmail.com
> t: @hpcjoe
> w: https://scalability.org
> g: https://github.com/joelandman
> l: https://www.linkedin.com/in/joelandman
> 

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


From engwalljonathanthereal at gmail.com  Mon Jun 21 14:58:51 2021
From: engwalljonathanthereal at gmail.com (Jonathan Engwall)
Date: Mon, 21 Jun 2021 07:58:51 -0700
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <YNCmXSdftGBIuvRN@einval.com>
References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com>
 <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com>
 <CAP=T6ZPkezTK9+WAwNHtSyn7DXzPpTeJ_5Hyg+N36qggaWuBVg@mail.gmail.com>
 <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> <YNCmXSdftGBIuvRN@einval.com>
Message-ID: <CAP=T6ZOaWye0dc_npKfte82E82vy+-R9jXetFEJ=m37G0kUsKA@mail.gmail.com>

AVX-512 is SIMD and in that respect compiled Intel routines will run almost
automatically on Intel processors.
It's not like I was answering the question. I realize or under realize the
implementation problems. You need to do a side by side comparison of the
die.

On Mon, Jun 21, 2021, 7:47 AM Andrew M.A. Cater <amacater at einval.com> wrote:

> On Mon, Jun 21, 2021 at 09:46:30AM -0400, Joe Landman wrote:
> > On 6/21/21 9:20 AM, Jonathan Engwall wrote:
> > > I have followed this thinking "square peg, round hole."
> > > You have got it again, Joe. Compilers are your problem.
> >
> >
> > Erp ... did I mess up again?
> >
> > Here's where awesome compiler support would help.  FWIW, gcc isn't that
> > great a compiler.  Its not performance minded for HPC. Its a reasonable
> > general purpose standards compliant (for some subset of standards)
> > compilation system.  LLVM is IMO a better compiler system, and its
> > clang/flang are developing nicely, albeit still not really HPC focused.
> > Then you have variants built on that.  Like the Cray compiler, Nvidia
> > compiler and AMD compiler. These are HPC focused, and actually do quite
> well
> > with some codes (though the AMD version lags the Cray and Nvidia
> compilers).
> > You've got the Intel compiler, which would be a good general compiler if
> it
> > wasn't more of a marketing vehicle for Intel processors and their
> features
> > (hey you got an AMD chip?  you will take the slowest code path even if
> you
> > support the features needed for the high performance code path).
> >
> > Maybe, someday, we'll get a great HPC compiler for C/Fortran.
> >
> The problem is that, maybe, the HPC market is still not _quite_ big enough
> to merit a dedicated set of compilers and is diverse enough in its problem
> sets that we still need a dozen or more specialist use cases to work well.
>
> You would think there would be a cross-over point where massively parallel
> scalable cloud infrastructure wold intersect with HPC but that doesn't
> seem to be happening. Parallelisation is the great bugbear anyway.
>
> Most of the experts I know on all of this are the regulars on this list:
> paging Greg Lindahl ...
>
> All the best,
>
> Andy Cater
>
> >
> > --
> > Joe Landman
> > e: joe.landman at gmail.com
> > t: @hpcjoe
> > w: https://scalability.org
> > g: https://github.com/joelandman
> > l: https://www.linkedin.com/in/joelandman
> >
>
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210621/fe304b1b/attachment.htm>

From sassy-work at sassy.formativ.net  Mon Jun 21 17:11:37 2021
From: sassy-work at sassy.formativ.net (=?ISO-8859-1?Q?J=F6rg_Sa=DFmannshausen?=)
Date: Mon, 21 Jun 2021 18:11:37 +0100
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com>
References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com>
 <CAP=T6ZPkezTK9+WAwNHtSyn7DXzPpTeJ_5Hyg+N36qggaWuBVg@mail.gmail.com>
 <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com>
Message-ID: <3326969.2YInZGPpZj@deepblue>

Dear all

> System architecture has been a problem ... making a processing unit
> 10-100x as fast as its support components means you have to code with
> that in mind.  A simple `gfortran -O3 mycode.f` won't necessarily
> generate optimal code for the system ( but I swear ... -O3 ... it says
> it on the package!)

From a computational Chemist perspective I agree. In an ideal world, you want 
to get the right hardware for the program you want to use. Some of the code is 
running entirely in memory, others is using disc space for offloading files. 

This is, in my humble opinion, also the big problem CPUs are facing. They are 
build to tackle all possible scenarios, from simple integer to floating point, 
from in-memory to disc I/O. In some respect it would have been better to stick 
with a separate math unit which then could be selected according to your 
workload you want to run on that server. I guess this is where the GPUs are 
trying to fit in here, or maybe ARM. 

I also agree with the compiler "problem". If you are starting to push some 
compilers too much, the code is running very fast but the results are simply 
wrong. Again, in an ideal world we have a compiler for the job for the given 
hardware which also depends on the job you want to run. 

The problem here is not: is that possible, the problem is more: how much does 
it cost? From what I understand, some big server farms are actually not using 
commodity HPC stuff but they are designing what they need themselves. 

Maybe the whole climate problem will finally push HPC into the more bespoken 
system where the components are fit for the job in question, say weather 
modeling for example, simply as that would be more energy efficient and 
faster. 
Before somebody comes along with: but but but it costs! Think about how much 
money is being spent simply to kill people, or at other wasteful project like 
Brexit etc. 

My 2 shillings for what it is worth! :D

J?rg

Am Montag, 21. Juni 2021, 14:46:30 BST schrieb Joe Landman:
> On 6/21/21 9:20 AM, Jonathan Engwall wrote:
> > I have followed this thinking "square peg, round hole."
> > You have got it again, Joe. Compilers are your problem.
> 
> Erp ... did I mess up again?
> 
> System architecture has been a problem ... making a processing unit
> 10-100x as fast as its support components means you have to code with
> that in mind.  A simple `gfortran -O3 mycode.f` won't necessarily
> generate optimal code for the system ( but I swear ... -O3 ... it says
> it on the package!)
> 
> Way back at Scalable, our secret sauce was largely increasing IO
> bandwidth and lowering IO latency while coupling computing more tightly
> to this massive IO/network pipe set, combined with intelligence in the
> kernel on how to better use the resources.  It was simply a better
> architecture.  We used the same CPUs.  We simply exploited the design
> better.
> 
> End result was codes that ran on our systems with off-cpu work (storage,
> networking, etc.) could push our systems far harder than competitors. 
> And you didn't have to use a different ISA to get these benefits.  No
> recompilation needed, though we did show the folks who were interested,
> how to get even better performance.
> 
> Architecture matters, as does implementation of that architecture. 
> There are costs to every decision within an architecture.  For AVX512,
> along comes lots of other baggage associated with downclocking, etc. 
> You have to do a cost-benefit analysis on whether or not it is worth
> paying for that baggage, with the benefits you get from doing so.  Some
> folks have made that decision towards AVX512, and have been enjoying the
> benefits of doing so (e.g. willing to pay the costs).  For the general
> audience, these costs represent a (significant) hurdle one must overcome.
> 
> Here's where awesome compiler support would help.  FWIW, gcc isn't that
> great a compiler.  Its not performance minded for HPC. Its a reasonable
> general purpose standards compliant (for some subset of standards)
> compilation system.  LLVM is IMO a better compiler system, and its
> clang/flang are developing nicely, albeit still not really HPC focused. 
> Then you have variants built on that.  Like the Cray compiler, Nvidia
> compiler and AMD compiler. These are HPC focused, and actually do quite
> well with some codes (though the AMD version lags the Cray and Nvidia
> compilers). You've got the Intel compiler, which would be a good general
> compiler if it wasn't more of a marketing vehicle for Intel processors
> and their features (hey you got an AMD chip?  you will take the slowest
> code path even if you support the features needed for the high
> performance code path).
> 
> Maybe, someday, we'll get a great HPC compiler for C/Fortran.


From bdobbins at gmail.com  Mon Jun 21 18:39:06 2021
From: bdobbins at gmail.com (Brian Dobbins)
Date: Mon, 21 Jun 2021 12:39:06 -0600
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <3326969.2YInZGPpZj@deepblue>
References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com>
 <CAP=T6ZPkezTK9+WAwNHtSyn7DXzPpTeJ_5Hyg+N36qggaWuBVg@mail.gmail.com>
 <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> <3326969.2YInZGPpZj@deepblue>
Message-ID: <CAFkGP2=2HtdBuYx_iHAb0CJ8QRZjTy2vuvxwyQnL-MK-jb0ryw@mail.gmail.com>

Hi all,

This is, in my humble opinion, also the big problem CPUs are facing. They
> are
> build to tackle all possible scenarios, from simple integer to floating
> point,
> from in-memory to disc I/O. In some respect it would have been better to
> stick
> with a separate math unit which then could be selected according to your
> workload you want to run on that server. I guess this is where the GPUs
> are
> trying to fit in here, or maybe ARM.
>

  I recall a few years ago the rumors that the Argonne "A18" system was
going to use the 'Configurable Spatial Accelerators' that Intel was
developing, with the idea being you *could* reconfigure based on the needs
of the code.  In principle, it sounds like the Holy Grail, but in practice
it seems quite difficult, and I don't believe I've heard much more about
the CSA approach since.

WikiChip on the CSA:
https://en.wikichip.org/wiki/intel/configurable_spatial_accelerator
NextPlatform article:
https://www.nextplatform.com/2018/08/30/intels-exascale-dataflow-engine-drops-x86-and-von-neuman/

  I have to imagine that research hasn't gone fully quiet, especially with
Intel's moves towards oneAPI and their FPGA experiences, but I haven't seen
anything about it in a while.  Of course....


> I also agree with the compiler "problem". If you are starting to push some
> compilers too much, the code is running very fast but the results are
> simply
> wrong. Again, in an ideal world we have a compiler for the job for the
> given
> hardware which also depends on the job you want to run.
>

 ... It exacerbates the compiler issues, *I think*.  I hesitate to say it
does so definitively, since the patent write-up talks about how the CSA
architecture uses a representation very similar to what the (now old) Intel
compilers created as an IR (intermediate representation).  In my opinion,
having a compiler that can 'do everything' is like having an AI that can do
everything - we're good at very, *very* specific use-cases, but not
generality.  So configurable systems are a big challenge.  (I'm *way* out
of my depth on compilers, though - maybe they're improving massively?)


> Maybe the whole climate problem will finally push HPC into the more
> bespoken
> system where the components are fit for the job in question, say weather
> modeling for example, simply as that would be more energy efficient and
> faster.
>

  I can't speak to whether climate research will influence hardware, but
back to the *original* theme of this thread, I actually had some data -very
*limited* data, mind you!- on how NCAR's climate model, CESM, run in an
'F2000climo' case (one of many, many cases, and very atmospheric focused)
at 2-degree atmosphere resolution (*very* coarse) on a 36-core Xeon Skylake
performs across AVX2, AVX512 and AVX512+FMA.  By default, FMA is turned off
in these cases due to numerical sensitivity.  So, that's a *very* specific
case, but on the off chance people are curious, here's what it looks like -
note that this is *noisy* data, because the model also does a lot of I/O,
hence why I tend to look at median times, in blue below:

SKX (AWS C5N.18xlarge) Performance Comparison
CESM Case: F2000climo @ f19_g17 resolution
(36 cores each component / 10 model day run, skipping 1st and last)
Flags AVX2 (no FMA) AVX512 (no FMA) AVX512 + FMA
Min 60.18 60.24 59.16
Max 66.26 60.47 59.40
Median 60.28 60.38 59.32

  The take-away?  We're not really benefiting *at all* (at this resolution,
for this compset, etc) from AVX512 here.  Maybe at higher resolution?
Maybe with more vertical levels, or chemistry, or something like that?
*Maybe*, but differences seem indistinguishable from noise here, and
possibly negative!  Now, give us more *memory bandwidth*, and that's
fantastic.  Could this code be rewritten to take better advantage of larger
vectors?  Sure, and some *really* capable people do work on that sort of
stuff, and it helps, but as an *evolution* in performance, not a revolution
in it.

  (Also, I'm always horrified by presenting one-off tests as examples of
anything, but it's the only data I have on-hand!  Other cases may indeed
vary.)

Before somebody comes along with: but but but it costs! Think about how
> much
> money is being spent simply to kill people, or at other wasteful project
> like
> Brexit etc.
>

    One can only hope.  When it comes to spending on research, I recall the
quote:
   "If you think education is expensive, try ignorance!"

  Cheers,
  - Brian


Am Montag, 21. Juni 2021, 14:46:30 BST schrieb Joe Landman:
> > On 6/21/21 9:20 AM, Jonathan Engwall wrote:
> > > I have followed this thinking "square peg, round hole."
> > > You have got it again, Joe. Compilers are your problem.
> >
> > Erp ... did I mess up again?
> >
> > System architecture has been a problem ... making a processing unit
> > 10-100x as fast as its support components means you have to code with
> > that in mind.  A simple `gfortran -O3 mycode.f` won't necessarily
> > generate optimal code for the system ( but I swear ... -O3 ... it says
> > it on the package!)
> >
> > Way back at Scalable, our secret sauce was largely increasing IO
> > bandwidth and lowering IO latency while coupling computing more tightly
> > to this massive IO/network pipe set, combined with intelligence in the
> > kernel on how to better use the resources.  It was simply a better
> > architecture.  We used the same CPUs.  We simply exploited the design
> > better.
> >
> > End result was codes that ran on our systems with off-cpu work (storage,
> > networking, etc.) could push our systems far harder than competitors.
> > And you didn't have to use a different ISA to get these benefits.  No
> > recompilation needed, though we did show the folks who were interested,
> > how to get even better performance.
> >
> > Architecture matters, as does implementation of that architecture.
> > There are costs to every decision within an architecture.  For AVX512,
> > along comes lots of other baggage associated with downclocking, etc.
> > You have to do a cost-benefit analysis on whether or not it is worth
> > paying for that baggage, with the benefits you get from doing so.  Some
> > folks have made that decision towards AVX512, and have been enjoying the
> > benefits of doing so (e.g. willing to pay the costs).  For the general
> > audience, these costs represent a (significant) hurdle one must overcome.
> >
> > Here's where awesome compiler support would help.  FWIW, gcc isn't that
> > great a compiler.  Its not performance minded for HPC. Its a reasonable
> > general purpose standards compliant (for some subset of standards)
> > compilation system.  LLVM is IMO a better compiler system, and its
> > clang/flang are developing nicely, albeit still not really HPC focused.
> > Then you have variants built on that.  Like the Cray compiler, Nvidia
> > compiler and AMD compiler. These are HPC focused, and actually do quite
> > well with some codes (though the AMD version lags the Cray and Nvidia
> > compilers). You've got the Intel compiler, which would be a good general
> > compiler if it wasn't more of a marketing vehicle for Intel processors
> > and their features (hey you got an AMD chip?  you will take the slowest
> > code path even if you support the features needed for the high
> > performance code path).
> >
> > Maybe, someday, we'll get a great HPC compiler for C/Fortran.
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210621/cc8a703f/attachment-0001.htm>

From pbisbal at pppl.gov  Mon Jun 21 20:56:08 2021
From: pbisbal at pppl.gov (Prentice Bisbal)
Date: Mon, 21 Jun 2021 16:56:08 -0400
Subject: [Beowulf] [External] Re:  AMD and AVX512
In-Reply-To: <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
Message-ID: <5433f49a-7331-4e10-1b4f-e1e5c6878920@pppl.gov>

Thanks for the input. I have looked at that Wikipedia page before, but 
never checked it that closely. I just looked mainly to see what 
processors supported what extensions. After taking a closer look at 
AVX-512, and all the different subdivisions, I see exactly what you're 
saying. It's a mess! Compare that to AVX and AVX2, where it's an 
all-or-nothing thing. Makes a lot more sense.

Prentice

On 6/19/21 11:49 AM, Gerald Henriksen wrote:
> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>
>> The answer given, and I'm
>> not making this up, is that AMD listens to their users and gives the
>> users what they want, and right now they're not hearing any demand for
>> AVX512.
>>
>> Personally, I call BS on that one. I can't imagine anyone in the HPC
>> community saying "we'd like processors that offer only 1/2 the floating
>> point performance of Intel processors".
> I suspect that is marketing speak, which roughly translates to not
> that no one has asked for it, but rather requests haven't reached a
> threshold where the requests are viewed as significant enough.
>
>> Sure, AMD can offer more cores,
>> but with only AVX2, you'd need twice as many cores as Intel processors,
>> all other things being equal.
> But of course all other things aren't equal.
>
> AVX512 is a mess.
>
> Look at the Wikipedia page(*) and note that AVX512 means different
> things depending on the processor implementing it.
>
> So what does the poor software developer target?
>
> Or that it can for heat reasons cause CPU frequency reductions,
> meaning real world performance may not match theoritical - thus easier
> to just go with GPU's.
>
> The result is that most of the world is quite happily (at least for
> now) ignoring AVX512 and going with GPU's as necessary - particularly
> given the convenient libraries that Nvidia offers.
>
>> I compared a server with dual AMD EPYC >7H12 processors (128)
>> quad Intel Xeon 8268 >processors (96 cores).
>>  From what I've heard, the AMD processors run much hotter than the Intel
>> processors, too, so I imagine a FLOPS/Watt comparison would be even less
>> favorable to AMD.
> Spec sheets would indicate AMD runs hotter, but then again you
> benchmarked twice as many Intel processors.
>
> So, per spec sheets for you processors above:
>
> AMD - 280W - 2 processors means system 560W
> Intel - 205W - 4 processors means system 820W
>
> (and then you also need to factor in purchase price).
>
>> An argument can be made that for calculations that lend themselves to
>> vectorization should be done on GPUs, instead of the main processors but
>> the last time I checked, GPU jobs are still memory is limited, and
>> moving data in and out of GPU memory can still take time, so I can see
>> situations where for large amounts of data using CPUs would be preferred
>> over GPUs.
> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
> which may or may not mean a difference.
>
> But what despite all of the above and the other replies, it is AMD who
> has been winning the HPC contracts of late, not Intel.
>
> * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

From deadline at eadline.org  Mon Jun 21 21:39:08 2021
From: deadline at eadline.org (Douglas Eadline)
Date: Mon, 21 Jun 2021 17:39:08 -0400
Subject: [Beowulf] AMD and AVX512
In-Reply-To: <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
Message-ID: <dba42c9b66732bd449cfcf76abedeef0.squirrel@mail.eadline.org>


> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>
>>The answer given, and I'm
>>not making this up, is that AMD listens to their users and gives the
>>users what they want, and right now they're not hearing any demand for
>>AVX512.
>>
>>Personally, I call BS on that one. I can't imagine anyone in the HPC
>>community saying "we'd like processors that offer only 1/2 the floating
>>point performance of Intel processors".
>
> I suspect that is marketing speak, which roughly translates to not
> that no one has asked for it, but rather requests haven't reached a
> threshold where the requests are viewed as significant enough.
>

Exactly, or "Right now cloud based servers are the biggest market.
These customers need as many cores/threads as possible on die with
"adequate" memory bandwidth. Oh, and they buy them by the boatload.
What did you say you do again?"


-- 
Doug


From pbisbal at pppl.gov  Tue Jun 22 16:01:57 2021
From: pbisbal at pppl.gov (Prentice Bisbal)
Date: Tue, 22 Jun 2021 12:01:57 -0400
Subject: [Beowulf] [External] Just a quick heads up: new Beowulf coming
 ...
In-Reply-To: <YM8w4bQYfo37rbK7@einval.com>
References: <YM8w4bQYfo37rbK7@einval.com>
Message-ID: <df624b68-abb7-aceb-3075-aac7a78afe81@pppl.gov>

I doubt it will impact us much. At worst, one or two people might find 
this list and post in appropriate questions. At best, it will be 
entertaining, similar to how when some posts a question about trees to 
https://www.reddit.com/r/trees, which is about smoking marijuana, or a 
marijuana enthusiast posts something to 
https://www.reddit.com/r/MarijuanaEnthusiasts, which is about trees. I 
would consider both of those links NSFW.

Prentice

On 6/20/21 8:13 AM, Andrew M.A. Cater wrote:
> The folks over at Devuan have chosen to name their next code release (based
> on upcoming Debian 11) beowulf. I don't think it will impact this list - but
> you never know.
>
> For anyone runnng Debian on HPC / in labs: the latest Debian point release
> 10.10 was yesterday. Debian 11 should be released in about six weeks on
> 31 July 2021.
>
> Thanks for the excellence in this list
>
> All best, as ever,
>
> Andy Cater
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

From pbisbal at pppl.gov  Tue Jun 22 16:08:50 2021
From: pbisbal at pppl.gov (Prentice Bisbal)
Date: Tue, 22 Jun 2021 12:08:50 -0400
Subject: [Beowulf] [External] Re:  AMD and AVX512
In-Reply-To: <CAPqNE2WnHFB8aqf1LgHT0FUm+R3Y8+BiujCCndmC4f5ywQRA5w@mail.gmail.com>
References: <d306ab4c-174c-e1ca-6a77-d146ceb8db9a@pppl.gov>
 <bm3scgh49r57va6rg109l0putl19h3sidt@4ax.com>
 <CAPqNE2WnHFB8aqf1LgHT0FUm+R3Y8+BiujCCndmC4f5ywQRA5w@mail.gmail.com>
Message-ID: <899a3ebb-3797-7317-4122-15c82ec184e9@pppl.gov>

Thanks for the resources. I will definitely read/watch them when I can 
block out some time. I see he uses LAMMPS as one of his benchmarks. I 
was considering adding LAMMPS to my testing regimen, since it's a code I 
have familiarity with, and my own background is in chemistry.

Prentice

On 6/20/21 1:38 AM, John Hearns wrote:
> Regarding benchmarking real world codes on AMD , every year Martyn 
> Guest presents a comprehensive set of benchmark?studies to the UK 
> Computing Insights Conference.
> I suggest a Sunday afternoon with the beverage of your choice is a 
> good time to settle down and take time to read these or watch the 
> presentation.
>
> 2019
> https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf 
> <https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf>
>
>
> 2020 Video session
> https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000 
> <https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000>
>
> Skylake / Cascade Lake / AMD Rome
>
> The slides for 2020 do exist - as I remember all the slides from all 
> talks are grouped together, but I cannot find them.
> Watch the video - it is an excellent presentation.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen <ghenriks at gmail.com 
> <mailto:ghenriks at gmail.com>> wrote:
>
>     On Wed, 16 Jun 2021 13:15:40 -0400, you wrote:
>
>     >The answer given, and I'm
>     >not making this up, is that AMD listens to their users and gives the
>     >users what they want, and right now they're not hearing any
>     demand for
>     >AVX512.
>     >
>     >Personally, I call BS on that one. I can't imagine anyone in the HPC
>     >community saying "we'd like processors that offer only 1/2 the
>     floating
>     >point performance of Intel processors".
>
>     I suspect that is marketing speak, which roughly translates to not
>     that no one has asked for it, but rather requests haven't reached a
>     threshold where the requests are viewed as significant enough.
>
>     > Sure, AMD can offer more cores,
>     >but with only AVX2, you'd need twice as many cores as Intel
>     processors,
>     >all other things being equal.
>
>     But of course all other things aren't equal.
>
>     AVX512 is a mess.
>
>     Look at the Wikipedia page(*) and note that AVX512 means different
>     things depending on the processor implementing it.
>
>     So what does the poor software developer target?
>
>     Or that it can for heat reasons cause CPU frequency reductions,
>     meaning real world performance may not match theoritical - thus easier
>     to just go with GPU's.
>
>     The result is that most of the world is quite happily (at least for
>     now) ignoring AVX512 and going with GPU's as necessary - particularly
>     given the convenient libraries that Nvidia offers.
>
>     > I compared a server with dual AMD EPYC >7H12 processors (128)
>     > quad Intel Xeon 8268 >processors (96 cores).
>
>     > From what I've heard, the AMD processors run much hotter than
>     the Intel
>     >processors, too, so I imagine a FLOPS/Watt comparison would be
>     even less
>     >favorable to AMD.
>
>     Spec sheets would indicate AMD runs hotter, but then again you
>     benchmarked twice as many Intel processors.
>
>     So, per spec sheets for you processors above:
>
>     AMD - 280W - 2 processors means system 560W
>     Intel - 205W - 4 processors means system 820W
>
>     (and then you also need to factor in purchase price).
>
>     >An argument can be made that for calculations that lend
>     themselves to
>     >vectorization should be done on GPUs, instead of the main
>     processors but
>     >the last time I checked, GPU jobs are still memory is limited, and
>     >moving data in and out of GPU memory can still take time, so I
>     can see
>     >situations where for large amounts of data using CPUs would be
>     preferred
>     >over GPUs.
>
>     AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3,
>     which may or may not mean a difference.
>
>     But what despite all of the above and the other replies, it is AMD who
>     has been winning the HPC contracts of late, not Intel.
>
>     * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
>     <https://en.wikipedia.org/wiki/Advanced_Vector_Extensions>
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>     <https://beowulf.org/cgi-bin/mailman/listinfo/beowulf>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210622/c0ebd495/attachment.htm>

From peter.st.john at gmail.com  Tue Jun 22 18:35:11 2021
From: peter.st.john at gmail.com (Peter St. John)
Date: Tue, 22 Jun 2021 14:35:11 -0400
Subject: [Beowulf] [External] Just a quick heads up: new Beowulf coming
 ...
In-Reply-To: <df624b68-abb7-aceb-3075-aac7a78afe81@pppl.gov>
References: <YM8w4bQYfo37rbK7@einval.com>
 <df624b68-abb7-aceb-3075-aac7a78afe81@pppl.gov>
Message-ID: <CAF4H3kfWGnsiU6YC-NQ7AU7+POJAGtJpj2rC2aN97XuVPqoUgg@mail.gmail.com>

how confusing,  I thought you were talking about *trees *
https://en.wikipedia.org/wiki/Tree_(data_structure)
:-)

On Tue, Jun 22, 2021 at 12:02 PM Prentice Bisbal via Beowulf <
beowulf at beowulf.org> wrote:

> I doubt it will impact us much. At worst, one or two people might find
> this list and post in appropriate questions. At best, it will be
> entertaining, similar to how when some posts a question about trees to
> https://www.reddit.com/r/trees, which is about smoking marijuana, or a
> marijuana enthusiast posts something to
> https://www.reddit.com/r/MarijuanaEnthusiasts, which is about trees. I
> would consider both of those links NSFW.
>
> Prentice
>
> On 6/20/21 8:13 AM, Andrew M.A. Cater wrote:
> > The folks over at Devuan have chosen to name their next code release
> (based
> > on upcoming Debian 11) beowulf. I don't think it will impact this list -
> but
> > you never know.
> >
> > For anyone runnng Debian on HPC / in labs: the latest Debian point
> release
> > 10.10 was yesterday. Debian 11 should be released in about six weeks on
> > 31 July 2021.
> >
> > Thanks for the excellence in this list
> >
> > All best, as ever,
> >
> > Andy Cater
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210622/f4cc97cf/attachment-0001.htm>

From james.p.lux at jpl.nasa.gov  Wed Jun 23 00:11:57 2021
From: james.p.lux at jpl.nasa.gov (Lux, Jim (US 7140))
Date: Wed, 23 Jun 2021 00:11:57 +0000
Subject: [Beowulf] [EXTERNAL] Re:  AMD and AVX512
In-Reply-To: <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com>
References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com>
 <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com>
 <CAP=T6ZPkezTK9+WAwNHtSyn7DXzPpTeJ_5Hyg+N36qggaWuBVg@mail.gmail.com>
 <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com>
Message-ID: <319DC392-FDB0-4789-93B2-D5C7370681FD@jpl.caltech.edu>


From: Beowulf <beowulf-bounces at beowulf.org> on behalf of Joe Landman <joe.landman at gmail.com>
Date: Monday, June 21, 2021 at 6:46 AM
To: Jonathan Engwall <engwalljonathanthereal at gmail.com>
Cc: "beowulf at beowulf.org" <beowulf at beowulf.org>
Subject: [EXTERNAL] Re: [Beowulf] AMD and AVX512

On 6/21/21 9:20 AM, Jonathan Engwall wrote:
I have followed this thinking "square peg, round hole."
You have got it again, Joe. Compilers are your problem.

<snip discussion of architecture>

To date, I don?t know that *compilers* pay much attention to things like IO (that?s buried in some library call no doubt).

>>Maybe, someday, we'll get a great HPC compiler for C/Fortran.

Wasn?t the Fortran compiler for the 7600 highly optimized? Did vector unrolling and all that. And those compilers for the FPS boxes?


I think you mean great HPC compilers for chips that are available and fast <grin>


I think, too that the comments about ARM vs x86 vs whatever are interesting.


We?ve moved a long way from clusters where the ethernet interconnect was rate limiting, and the nodes were single core, single memory, single disk (if any).   When you start getting into processors with hundreds of cores, or you start looking at ?nanojoules/instruction? (or is instruction even the right thing to be counting.. maybe it?s nanojoules/data operation ? where that could be a read/write from memory, disk, or interprocessor link).


Look at the (probably) specious claim that Tesla has the 5th fastest supercomputer - articles are very light on details, but I think it?s a whole bunch of GPUs ? but their ?number of cores? isn?t very big compared to even #100 on the ?Top 500? list.


However, it might well be that for Tesla?s specific processing load, that 5000 GPU cores *is* faster than most Top 500 clustes.

And, given the recent news about miners consuming all those joules ? maybe our metrics should be looking at more than raw speed.

Jim


(who has not just 1, but TWO, ARM based clusters on the shelf behind his desk.. Yes, Beaglebones, but it?s an ARM, it?s 4 nodes, and I use various cluster tools to manipulate them ? the connection fabric for one is kind of slow (802.11))


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20210623/9db66c0b/attachment.htm>