From harshscience777 at gmail.com Thu Jun 3 12:07:41 2021 From: harshscience777 at gmail.com (harsh_google lastname) Date: Thu, 3 Jun 2021 17:37:41 +0530 Subject: [Beowulf] Theoretical peak performance of DGX A100 Message-ID: I am calculating the theoretical peak (FP64) performance of the Nvidia DGX A100 system. Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS. Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64 ops / cycle = 4.6 TFLOPS. This gives a total of 82.2 TFLOPS per DGX-A100. Here is my problem. For any system with DGX A100 on top500.org, numbers just don't add up. For eg: Selene has 560 DGX boxes, but its theoretical peak is listed as 79.2 PFLOPS, whereas I expect it should be 46 PFLOPS (ie 82.2 TFLOPS x560). The same is true for any other DGX based system listed on top500. What am I missing here? Thanks! Harsh Hemani -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlos.bederian at unc.edu.ar Thu Jun 3 12:20:17 2021 From: carlos.bederian at unc.edu.ar (=?UTF-8?Q?Carlos_Bederi=C3=A1n?=) Date: Thu, 3 Jun 2021 09:20:17 -0300 Subject: [Beowulf] Theoretical peak performance of DGX A100 In-Reply-To: References: Message-ID: A100 does 19.5 FP64 TFLOPS using tensor cores. On Thu, Jun 3, 2021 at 9:08 AM harsh_google lastname < harshscience777 at gmail.com> wrote: > I am calculating the theoretical peak (FP64) performance of the Nvidia DGX > A100 system. > > Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS. > Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64 ops > / cycle = 4.6 TFLOPS. > This gives a total of 82.2 TFLOPS per DGX-A100. > > Here is my problem. For any system with DGX A100 on top500.org, numbers > just don't add up. For eg: Selene has 560 DGX boxes, but its theoretical > peak is listed as 79.2 PFLOPS, whereas I expect it should be 46 PFLOPS (ie > 82.2 TFLOPS x560). The same is true for any other DGX based system listed > on top500. What am I missing here? > > Thanks! > > Harsh Hemani > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From harshscience777 at gmail.com Thu Jun 3 12:22:30 2021 From: harshscience777 at gmail.com (harsh_google lastname) Date: Thu, 3 Jun 2021 17:52:30 +0530 Subject: [Beowulf] Theoretical peak performance of DGX A100 In-Reply-To: References: Message-ID: But that wouls bring the theoretical performance to 160 TFLOPS per box, which also doesn't match! On Thu, Jun 3, 2021, 5:50 PM Carlos Bederi?n wrote: > A100 does 19.5 FP64 TFLOPS using tensor cores. > > On Thu, Jun 3, 2021 at 9:08 AM harsh_google lastname < > harshscience777 at gmail.com> wrote: > >> I am calculating the theoretical peak (FP64) performance of the Nvidia >> DGX A100 system. >> >> Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS. >> Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64 ops >> / cycle = 4.6 TFLOPS. >> This gives a total of 82.2 TFLOPS per DGX-A100. >> >> Here is my problem. For any system with DGX A100 on top500.org, numbers >> just don't add up. For eg: Selene has 560 DGX boxes, but its theoretical >> peak is listed as 79.2 PFLOPS, whereas I expect it should be 46 PFLOPS (ie >> 82.2 TFLOPS x560). The same is true for any other DGX based system listed >> on top500. What am I missing here? >> >> Thanks! >> >> Harsh Hemani >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlos.bederian at unc.edu.ar Thu Jun 3 12:55:36 2021 From: carlos.bederian at unc.edu.ar (=?UTF-8?Q?Carlos_Bederi=C3=A1n?=) Date: Thu, 3 Jun 2021 09:55:36 -0300 Subject: [Beowulf] Theoretical peak performance of DGX A100 In-Reply-To: References: Message-ID: The Top500 has been listing wrong Rpeak values for most clusters for many years now, so I wouldn't dwell on it... Take a Skylake-based cluster like Frontera. Its listed Rpeak is 38,745.9 TFLOPS = 8008 nodes * 56 cores * 32 ops/cycle * 2.7GHz. But 2.7GHz is the regular base frequency, and to do 32 ops/cycle you need to use AVX-512. All-core AVX-512 frequencies for a Xeon 8280 are 1.8GHz base and 2.4GHz turbo, so the Rpeak is off by 12-33%. On Thu, Jun 3, 2021 at 9:22 AM harsh_google lastname < harshscience777 at gmail.com> wrote: > But that wouls bring the theoretical performance to 160 TFLOPS per box, > which also doesn't match! > > On Thu, Jun 3, 2021, 5:50 PM Carlos Bederi?n > wrote: > >> A100 does 19.5 FP64 TFLOPS using tensor cores. >> >> On Thu, Jun 3, 2021 at 9:08 AM harsh_google lastname < >> harshscience777 at gmail.com> wrote: >> >>> I am calculating the theoretical peak (FP64) performance of the Nvidia >>> DGX A100 system. >>> >>> Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS. >>> Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64 >>> ops / cycle = 4.6 TFLOPS. >>> This gives a total of 82.2 TFLOPS per DGX-A100. >>> >>> Here is my problem. For any system with DGX A100 on top500.org, numbers >>> just don't add up. For eg: Selene has 560 DGX boxes, but its theoretical >>> peak is listed as 79.2 PFLOPS, whereas I expect it should be 46 PFLOPS (ie >>> 82.2 TFLOPS x560). The same is true for any other DGX based system listed >>> on top500. What am I missing here? >>> >>> Thanks! >>> >>> Harsh Hemani >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From harshscience777 at gmail.com Thu Jun 3 13:29:50 2021 From: harshscience777 at gmail.com (harsh_google lastname) Date: Thu, 3 Jun 2021 18:59:50 +0530 Subject: [Beowulf] Theoretical peak performance of DGX A100 In-Reply-To: References: Message-ID: Cool, thanks! On Thu, Jun 3, 2021, 6:25 PM Carlos Bederi?n wrote: > The Top500 has been listing wrong Rpeak values for most clusters for many > years now, so I wouldn't dwell on it... > > Take a Skylake-based cluster like Frontera. Its listed Rpeak is 38,745.9 > TFLOPS = 8008 nodes * 56 cores * 32 ops/cycle * 2.7GHz. > But 2.7GHz is the regular base frequency, and to do 32 ops/cycle you need > to use AVX-512. All-core AVX-512 frequencies for a Xeon 8280 are 1.8GHz > base and 2.4GHz turbo, so the Rpeak is off by 12-33%. > > On Thu, Jun 3, 2021 at 9:22 AM harsh_google lastname < > harshscience777 at gmail.com> wrote: > >> But that wouls bring the theoretical performance to 160 TFLOPS per box, >> which also doesn't match! >> >> On Thu, Jun 3, 2021, 5:50 PM Carlos Bederi?n >> wrote: >> >>> A100 does 19.5 FP64 TFLOPS using tensor cores. >>> >>> On Thu, Jun 3, 2021 at 9:08 AM harsh_google lastname < >>> harshscience777 at gmail.com> wrote: >>> >>>> I am calculating the theoretical peak (FP64) performance of the Nvidia >>>> DGX A100 system. >>>> >>>> Now, A100 datasheet lists FP64 performance to be 9.7 TFLOPS. >>>> Two AMD 7742 CPUs will give 128 cores x 2.25 GHz base clock x 16 FP64 >>>> ops / cycle = 4.6 TFLOPS. >>>> This gives a total of 82.2 TFLOPS per DGX-A100. >>>> >>>> Here is my problem. For any system with DGX A100 on top500.org, >>>> numbers just don't add up. For eg: Selene has 560 DGX boxes, but its >>>> theoretical peak is listed as 79.2 PFLOPS, whereas I expect it should be 46 >>>> PFLOPS (ie 82.2 TFLOPS x560). The same is true for any other DGX based >>>> system listed on top500. What am I missing here? >>>> >>>> Thanks! >>>> >>>> Harsh Hemani >>>> _______________________________________________ >>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>>> Computing >>>> To change your subscription (digest mode or unsubscribe) visit >>>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Mon Jun 14 16:38:50 2021 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 14 Jun 2021 12:38:50 -0400 Subject: [Beowulf] odd vlan issue Message-ID: i got roped into troubleshooting an odd network issue. we have a mix of cisco (mostly nexus) gear spread over our facility. on one particular vlan it's operating as if it's a hub instead of switch. has anyone seen this before? i bounced around the net, my issue appears to be "unicast flooding", but the chief correction seems to be adjusting the ARP/CAM timeouts. but that hasn't working. i'm sure this is a loaded question given none of you can see the network, but i'm grasping at straws to come up with a reason why this is happening. what's weird is we have two dozen other vlans, which aren't affected. i cannot locate a difference in the config on the switches. any thoughts are appreciated at this point From lindahl at pbm.com Wed Jun 16 04:48:21 2021 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 16 Jun 2021 04:48:21 +0000 Subject: [Beowulf] odd vlan issue In-Reply-To: References: Message-ID: <20210616044821.GA10207@rd.bx9.net> On Mon, Jun 14, 2021 at 12:38:50PM -0400, Michael Di Domenico wrote: > i got roped into troubleshooting an odd network issue. we have a mix > of cisco (mostly nexus) gear spread over our facility. on one > particular vlan it's operating as if it's a hub instead of switch. I have run into this situation when I have servers that have incoming UDP traffic and never talk or do TCP. The switches have no idea where the server is, so they broadcast all of the incoming packets. An ARP reply or connecting with TCP tells the switch which port to use. -- greg From mdidomenico4 at gmail.com Wed Jun 16 11:38:27 2021 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 16 Jun 2021 07:38:27 -0400 Subject: [Beowulf] odd vlan issue In-Reply-To: <20210616044821.GA10207@rd.bx9.net> References: <20210616044821.GA10207@rd.bx9.net> Message-ID: thanks after two days of digging, i think i finally figured out that we have a layer 2 routing problem. i'm not the network guy so i'm not digging into it deeper, but it appears that there are either malfunctioning LACP trunks or more likely a misconfigured VPC connection inside the menagerie of switches the network team built. the network is too complicated to describe here, but the base issue is that there are two switches 'supposedly' operating jointly, but don't seem to be sharing their CAM/ARP tables correctly. for whatever reason packets get duped to the switch that does not have the destination machine and since there's no arp/cam entry the switch just blasts the packet out all the ports. its not clear why the packets are being sent to layer 2 devices where the device doesn't exist, but it's clear there's something broken in the spanning tree database. it's also not clear why it only affects one of the vlans and not all. but again, not the network guy... and for once it is the network... :) On Wed, Jun 16, 2021 at 12:48 AM Greg Lindahl wrote: > > On Mon, Jun 14, 2021 at 12:38:50PM -0400, Michael Di Domenico wrote: > > i got roped into troubleshooting an odd network issue. we have a mix > > of cisco (mostly nexus) gear spread over our facility. on one > > particular vlan it's operating as if it's a hub instead of switch. > > I have run into this situation when I have servers that have incoming > UDP traffic and never talk or do TCP. The switches have no idea where the > server is, so they broadcast all of the incoming packets. > > An ARP reply or connecting with TCP tells the switch which port to use. > > -- greg > > From rgt at wi.mit.edu Wed Jun 16 13:25:18 2021 From: rgt at wi.mit.edu (Robert Taylor) Date: Wed, 16 Jun 2021 09:25:18 -0400 Subject: [Beowulf] odd vlan issue In-Reply-To: References: <20210616044821.GA10207@rd.bx9.net> Message-ID: I?ve seen it happen with udp, in my case it was a syslog server, that hardly ever ?spoke? so eventually it?s MAC address disappears from the Cam table long before it leaves the arp table and the UDP packets get flooded hoping to find the server. I?ve also seen this happen in an HA environment, Where it?s possible that traffic can take an asymmetric path. It happens enough that Cisco wrote an article about it years ago. https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6000-series-switches/23563-143.html Hope this helps. rgt On Wed, Jun 16, 2021 at 7:38 AM Michael Di Domenico wrote: > thanks after two days of digging, i think i finally figured out that > we have a layer 2 routing problem. i'm not the network guy so i'm not > digging into it deeper, but it appears that there are either > malfunctioning LACP trunks or more likely a misconfigured VPC > connection inside the menagerie of switches the network team built. > > the network is too complicated to describe here, but the base issue is > that there are two switches 'supposedly' operating jointly, but don't > seem to be sharing their CAM/ARP tables correctly. for whatever > reason packets get duped to the switch that does not have the > destination machine and since there's no arp/cam entry the switch just > blasts the packet out all the ports. > > its not clear why the packets are being sent to layer 2 devices where > the device doesn't exist, but it's clear there's something broken in > the spanning tree database. it's also not clear why it only affects > one of the vlans and not all. > > but again, not the network guy... and for once it is the network... :) > > > > > > > > > > On Wed, Jun 16, 2021 at 12:48 AM Greg Lindahl wrote: > > > > On Mon, Jun 14, 2021 at 12:38:50PM -0400, Michael Di Domenico wrote: > > > i got roped into troubleshooting an odd network issue. we have a mix > > > of cisco (mostly nexus) gear spread over our facility. on one > > > particular vlan it's operating as if it's a hub instead of switch. > > > > I have run into this situation when I have servers that have incoming > > UDP traffic and never talk or do TCP. The switches have no idea where the > > server is, so they broadcast all of the incoming packets. > > > > An ARP reply or connecting with TCP tells the switch which port to use. > > > > -- greg > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Wed Jun 16 13:41:24 2021 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 16 Jun 2021 09:41:24 -0400 Subject: [Beowulf] odd vlan issue In-Reply-To: References: <20210616044821.GA10207@rd.bx9.net> Message-ID: On Wed, Jun 16, 2021 at 9:25 AM Robert Taylor wrote: > > I?ve seen it happen with udp, in my case it was a syslog server, that hardly ever ?spoke? so eventually it?s MAC address disappears from the Cam table long before it leaves the arp table and the UDP packets get flooded hoping to find the server. > > I?ve also seen this happen in an HA environment, Where it?s possible that traffic can take an asymmetric path. It happens enough that Cisco wrote an article about it years ago. > > https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6000-series-switches/23563-143.html the udp scenario is definitely not what's happening. but the unicast flooding appears to be. the article linked is what lead me to find the initial problem and prove the theory yesterday. the messed up asymmetric nature of our network is where the problem lies. but that's what happens when you give network guys an unlimited budget and they buy more/larger network gear then they can handle. From pbisbal at pppl.gov Wed Jun 16 17:15:40 2021 From: pbisbal at pppl.gov (Prentice Bisbal) Date: Wed, 16 Jun 2021 13:15:40 -0400 Subject: [Beowulf] AMD and AVX512 Message-ID: Did anyone else attend this webinar panel discussion with AMD hosted by HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your Success in HPC" https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/ I attended it, and noticed there was no mention of AMD supporting AVX512, so during the question and answer portion of the program, I asked when AMD processors will support AVX512. The answer given, and I'm not making this up, is that AMD listens to their users and gives the users what they want, and right now they're not hearing any demand for AVX512. Personally, I call BS on that one. I can't imagine anyone in the HPC community saying "we'd like processors that offer only 1/2 the floating point performance of Intel processors". Sure, AMD can offer more cores, but with only AVX2, you'd need twice as many cores as Intel processors, all other things being equal. Last fall I evaluated potential new cluster nodes for a large cluster purchase using the HPL benchmark. I compared a server with dual AMD EPYC 7H12 processors (128) cores to a server with quad Intel Xeon 8268 processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only 64% of the Xeon 8268 system, despite having 33% more cores. From what I've heard, the AMD processors run much hotter than the Intel processors, too, so I imagine a FLOPS/Watt comparison would be even less favorable to AMD. An argument can be made that for calculations that lend themselves to vectorization should be done on GPUs, instead of the main processors but the last time I checked, GPU jobs are still memory is limited, and moving data in and out of GPU memory can still take time, so I can see situations where for large amounts of data using CPUs would be preferred over GPUs. Your thoughts? -- Prentice From carlos.bederian at unc.edu.ar Wed Jun 16 17:52:59 2021 From: carlos.bederian at unc.edu.ar (=?UTF-8?Q?Carlos_Bederi=C3=A1n?=) Date: Wed, 16 Jun 2021 14:52:59 -0300 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: On Wed, Jun 16, 2021 at 2:16 PM Prentice Bisbal via Beowulf < beowulf at beowulf.org> wrote: > Last fall I evaluated potential new cluster nodes for a large cluster > purchase using the HPL benchmark. I compared a server with dual AMD EPYC > 7H12 processors (128) cores to a server with quad Intel Xeon 8268 > processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and > only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only > 64% of the Xeon 8268 system, despite having 33% more cores. > Most of the workloads we see on our clusters have arithmetic intensities much lower than LINPACK's, so all that extra compute gets starved by lack of memory bandwidth. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Wed Jun 16 17:53:44 2021 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 16 Jun 2021 13:53:44 -0400 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: AMD's argument is a little unsalesmen like, but i'd buy it as an explanation. avx512 uptake isn't a profound as intel would lead you to believe and the push to GPU's for vectors will probably remove the need for most of these high end vectors sooner or later (but that's my opinion, some chip changes need to happen first) i also think you're hpl numbers on the amd chip are low, you should be >4000 which would put you closer to intel, but intel will still edge out just because it has a higher base clock. On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf wrote: > > Did anyone else attend this webinar panel discussion with AMD hosted by > HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your > Success in HPC" > > https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/ > > I attended it, and noticed there was no mention of AMD supporting > AVX512, so during the question and answer portion of the program, I > asked when AMD processors will support AVX512. The answer given, and I'm > not making this up, is that AMD listens to their users and gives the > users what they want, and right now they're not hearing any demand for > AVX512. > > Personally, I call BS on that one. I can't imagine anyone in the HPC > community saying "we'd like processors that offer only 1/2 the floating > point performance of Intel processors". Sure, AMD can offer more cores, > but with only AVX2, you'd need twice as many cores as Intel processors, > all other things being equal. > > Last fall I evaluated potential new cluster nodes for a large cluster > purchase using the HPL benchmark. I compared a server with dual AMD EPYC > 7H12 processors (128) cores to a server with quad Intel Xeon 8268 > processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and > only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only > 64% of the Xeon 8268 system, despite having 33% more cores. > > From what I've heard, the AMD processors run much hotter than the Intel > processors, too, so I imagine a FLOPS/Watt comparison would be even less > favorable to AMD. > > An argument can be made that for calculations that lend themselves to > vectorization should be done on GPUs, instead of the main processors but > the last time I checked, GPU jobs are still memory is limited, and > moving data in and out of GPU memory can still take time, so I can see > situations where for large amounts of data using CPUs would be preferred > over GPUs. > > Your thoughts? > > -- > Prentice > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From e.scott.atchley at gmail.com Wed Jun 16 18:23:46 2021 From: e.scott.atchley at gmail.com (Scott Atchley) Date: Wed, 16 Jun 2021 14:23:46 -0400 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf < beowulf at beowulf.org> wrote: > Did anyone else attend this webinar panel discussion with AMD hosted by > HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your > Success in HPC" > > https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/ > > I attended it, and noticed there was no mention of AMD supporting > AVX512, so during the question and answer portion of the program, I > asked when AMD processors will support AVX512. The answer given, and I'm > not making this up, is that AMD listens to their users and gives the > users what they want, and right now they're not hearing any demand for > AVX512. > > Personally, I call BS on that one. I can't imagine anyone in the HPC > community saying "we'd like processors that offer only 1/2 the floating > point performance of Intel processors". Sure, AMD can offer more cores, > but with only AVX2, you'd need twice as many cores as Intel processors, > all other things being equal. > > Last fall I evaluated potential new cluster nodes for a large cluster > purchase using the HPL benchmark. I compared a server with dual AMD EPYC > 7H12 processors (128) cores to a server with quad Intel Xeon 8268 > processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and > only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only > 64% of the Xeon 8268 system, despite having 33% more cores. > > From what I've heard, the AMD processors run much hotter than the Intel > processors, too, so I imagine a FLOPS/Watt comparison would be even less > favorable to AMD. > > An argument can be made that for calculations that lend themselves to > vectorization should be done on GPUs, instead of the main processors but > the last time I checked, GPU jobs are still memory is limited, and > moving data in and out of GPU memory can still take time, so I can see > situations where for large amounts of data using CPUs would be preferred > over GPUs. > > Your thoughts? > > -- > Prentice > AMD has studied this quite a bit in DOE's FastForward-2 and PathForward. I think Carlos' comment is on track. Having a unit that cannot be fed data quick enough is pointless. It is application dependent. If your working set fits in cache, then the vector units work well. If not, you have to move data which stalls compute pipelines. NERSC saw only a 10% increase in performance when moving from low core count Xeon CPUs with AVX2 to Knights Landing with many cores and AVX-512 when it should have seen an order of magnitude increase. Although Knights Landing had MCDRAM (Micron's not-quite HBM), other constraints limited performance (e.g., lack of enough memory references in flight, coherence traffic). Fujitsu's ARM64 chip with 512b SVE in Fugaku does much better than Xeon with AVX-512 (or Knights Landing) because of the High Bandwidth Memory (HBM) attached and I assume a larger number of memory references in flight. The downside is the lack of memory capacity (only 32 GB per node). This shows that it is possible to get more performance with a CPU with a 512b vector engine. That said, it is not clear that even this CPU design can extract the most from the memory bandwidth. If you look at the increase in memory bandwidth from Summit to Fugaku, one would expect performance on real apps to increase by that amount as well. From the presentations that I have seen, that is not always the case. For some apps, the GPU architecture, with its coherence on demand rather than with every operation, can extract more performance. AMD will add 512b vectors if/when it makes sense on real apps. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pbisbal at pppl.gov Wed Jun 16 20:39:39 2021 From: pbisbal at pppl.gov (Prentice Bisbal) Date: Wed, 16 Jun 2021 16:39:39 -0400 Subject: [Beowulf] [External] Re: AMD and AVX512 In-Reply-To: References: Message-ID: <24cf21a9-57de-b720-18b1-82da7c75e53b@pppl.gov> Scott (and Michael and Carlos), Thanks for your excellent feedback. That's the kind of enlightening feedback I was looking for. Interesting that the HBM on Fugaku exceeds the needs of the processor. Prentice On 6/16/21 2:23 PM, Scott Atchley wrote: > On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf > > wrote: > > Did anyone else attend this webinar panel discussion with AMD > hosted by > HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your > Success in HPC" > > https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/ > > > I attended it, and noticed there was no mention of AMD supporting > AVX512, so during the question and answer portion of the program, I > asked when AMD processors will support AVX512. The answer given, > and I'm > not making this up, is that AMD listens to their users and gives the > users what they want, and right now they're not hearing any demand > for > AVX512. > > Personally, I call BS on that one. I can't imagine anyone in the HPC > community saying "we'd like processors that offer only 1/2 the > floating > point performance of Intel processors". Sure, AMD can offer more > cores, > but with only AVX2, you'd need twice as many cores as Intel > processors, > all other things being equal. > > Last fall I evaluated potential new cluster nodes for a large cluster > purchase using the HPL benchmark. I compared a server with dual > AMD EPYC > 7H12 processors (128) cores to a server with quad Intel Xeon 8268 > processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and > only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only > 64% of the Xeon 8268 system, despite having 33% more cores. > > ?From what I've heard, the AMD processors run much hotter than the > Intel > processors, too, so I imagine a FLOPS/Watt comparison would be > even less > favorable to AMD. > > An argument can be made that for calculations that lend themselves to > vectorization should be done on GPUs, instead of the main > processors but > the last time I checked, GPU jobs are still memory is limited, and > moving data in and out of GPU memory can still take time, so I can > see > situations where for large amounts of data using CPUs would be > preferred > over GPUs. > > Your thoughts? > > -- > Prentice > > > AMD has studied this quite a bit in DOE's FastForward-2 and > PathForward. I think Carlos' comment is on track. Having a unit that > cannot be fed data quick enough is pointless. It is application > dependent. If your working set fits in cache, then the vector units > work well. If not, you have to move data which stalls compute > pipelines. NERSC saw only a 10% increase in performance when moving > from low core count Xeon CPUs with AVX2 to Knights Landing with many > cores and AVX-512 when it should have seen an order of magnitude > increase. Although Knights Landing had MCDRAM (Micron's not-quite > HBM), other constraints limited performance (e.g., lack of enough > memory references in flight, coherence traffic). > > Fujitsu's ARM64 chip with 512b SVE in Fugaku does much better than > Xeon with AVX-512 (or Knights Landing) because of the High Bandwidth > Memory (HBM) attached and I assume a larger number of memory > references in flight. The downside is the lack of memory capacity > (only 32 GB per node). This shows that it is possible to get more > performance with a CPU with a 512b vector engine. That said, it is not > clear that even this CPU design can extract the most from the memory > bandwidth. If you look at the increase in memory bandwidth from Summit > to Fugaku, one would expect performance on real apps to increase by > that amount as well. From the presentations that I have seen, that is > not always the case. For some apps, the GPU architecture, with its > coherence on demand rather than with every operation, can extract more > performance. > > AMD will add 512b vectors if/when it makes sense on real apps. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pbisbal at pppl.gov Wed Jun 16 20:46:49 2021 From: pbisbal at pppl.gov (Prentice Bisbal) Date: Wed, 16 Jun 2021 16:46:49 -0400 Subject: [Beowulf] [External] Re: AMD and AVX512 In-Reply-To: References: Message-ID: <6cf3edb7-fe83-5c34-0101-64a3eec70a62@pppl.gov> i also think you're hpl numbers on the amd chip are low, you should be > 4000 which would put you closer to intel, but intel will still edge out just because it has a higher base clock. I think I could probably get better numbers out of the AMD chip now, too. I've done some testing since then compiler and library choice can make a noticeable difference for the AMD processors. Unfortunately, I no longer have access to that 7H12 system to test again. Prentice On 6/16/21 1:53 PM, Michael Di Domenico wrote: > AMD's argument is a little unsalesmen like, but i'd buy it as an > explanation. avx512 uptake isn't a profound as intel would lead you > to believe and the push to GPU's for vectors will probably remove the > need for most of these high end vectors sooner or later (but that's my > opinion, some chip changes need to happen first) > > i also think you're hpl numbers on the amd chip are low, you should be >> 4000 which would put you closer to intel, but intel will still edge > out just because it has a higher base clock. > > On Wed, Jun 16, 2021 at 1:15 PM Prentice Bisbal via Beowulf > wrote: >> Did anyone else attend this webinar panel discussion with AMD hosted by >> HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your >> Success in HPC" >> >> https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/ >> >> I attended it, and noticed there was no mention of AMD supporting >> AVX512, so during the question and answer portion of the program, I >> asked when AMD processors will support AVX512. The answer given, and I'm >> not making this up, is that AMD listens to their users and gives the >> users what they want, and right now they're not hearing any demand for >> AVX512. >> >> Personally, I call BS on that one. I can't imagine anyone in the HPC >> community saying "we'd like processors that offer only 1/2 the floating >> point performance of Intel processors". Sure, AMD can offer more cores, >> but with only AVX2, you'd need twice as many cores as Intel processors, >> all other things being equal. >> >> Last fall I evaluated potential new cluster nodes for a large cluster >> purchase using the HPL benchmark. I compared a server with dual AMD EPYC >> 7H12 processors (128) cores to a server with quad Intel Xeon 8268 >> processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and >> only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only >> 64% of the Xeon 8268 system, despite having 33% more cores. >> >> From what I've heard, the AMD processors run much hotter than the Intel >> processors, too, so I imagine a FLOPS/Watt comparison would be even less >> favorable to AMD. >> >> An argument can be made that for calculations that lend themselves to >> vectorization should be done on GPUs, instead of the main processors but >> the last time I checked, GPU jobs are still memory is limited, and >> moving data in and out of GPU memory can still take time, so I can see >> situations where for large amounts of data using CPUs would be preferred >> over GPUs. >> >> Your thoughts? >> >> -- >> Prentice >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From sdm900 at gmail.com Thu Jun 17 02:53:04 2021 From: sdm900 at gmail.com (Stu Midgley) Date: Thu, 17 Jun 2021 10:53:04 +0800 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: I've told AMD brass that we need AVX512 many many times. I've also told them that we need more memory bandwidth and that adding dimms is not the answer. We don't need more capacity - just more bandwidth. We have a stack load of KNL systems and have invested heavily in AVX512 (writing with intrinsics) and shifting those codes away from it would be considerable work. Bring on Sapphire Rapids :) On Thu, Jun 17, 2021 at 1:16 AM Prentice Bisbal via Beowulf < beowulf at beowulf.org> wrote: > Did anyone else attend this webinar panel discussion with AMD hosted by > HPCWire yesterday? It was titled "AMD HPC Solutions: Enabling Your > Success in HPC" > > https://www.hpcwire.com/amd-hpc-solutions-enabling-your-success-in-hpc/ > > I attended it, and noticed there was no mention of AMD supporting > AVX512, so during the question and answer portion of the program, I > asked when AMD processors will support AVX512. The answer given, and I'm > not making this up, is that AMD listens to their users and gives the > users what they want, and right now they're not hearing any demand for > AVX512. > > Personally, I call BS on that one. I can't imagine anyone in the HPC > community saying "we'd like processors that offer only 1/2 the floating > point performance of Intel processors". Sure, AMD can offer more cores, > but with only AVX2, you'd need twice as many cores as Intel processors, > all other things being equal. > > Last fall I evaluated potential new cluster nodes for a large cluster > purchase using the HPL benchmark. I compared a server with dual AMD EPYC > 7H12 processors (128) cores to a server with quad Intel Xeon 8268 > processors (96 cores). I measured 5,389 GFLOPS for the Xeon 8268, and > only 3,446.00 GFLOPS for the AMD 7H12. That's LINPACK score that only > 64% of the Xeon 8268 system, despite having 33% more cores. > > From what I've heard, the AMD processors run much hotter than the Intel > processors, too, so I imagine a FLOPS/Watt comparison would be even less > favorable to AMD. > > An argument can be made that for calculations that lend themselves to > vectorization should be done on GPUs, instead of the main processors but > the last time I checked, GPU jobs are still memory is limited, and > moving data in and out of GPU memory can still take time, so I can see > situations where for large amounts of data using CPUs would be preferred > over GPUs. > > Your thoughts? > > -- > Prentice > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -- Dr Stuart Midgley sdm900 at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ghenriks at gmail.com Sat Jun 19 15:49:06 2021 From: ghenriks at gmail.com (Gerald Henriksen) Date: Sat, 19 Jun 2021 11:49:06 -0400 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: >The answer given, and I'm >not making this up, is that AMD listens to their users and gives the >users what they want, and right now they're not hearing any demand for >AVX512. > >Personally, I call BS on that one. I can't imagine anyone in the HPC >community saying "we'd like processors that offer only 1/2 the floating >point performance of Intel processors". I suspect that is marketing speak, which roughly translates to not that no one has asked for it, but rather requests haven't reached a threshold where the requests are viewed as significant enough. > Sure, AMD can offer more cores, >but with only AVX2, you'd need twice as many cores as Intel processors, >all other things being equal. But of course all other things aren't equal. AVX512 is a mess. Look at the Wikipedia page(*) and note that AVX512 means different things depending on the processor implementing it. So what does the poor software developer target? Or that it can for heat reasons cause CPU frequency reductions, meaning real world performance may not match theoritical - thus easier to just go with GPU's. The result is that most of the world is quite happily (at least for now) ignoring AVX512 and going with GPU's as necessary - particularly given the convenient libraries that Nvidia offers. > I compared a server with dual AMD EPYC >7H12 processors (128) > quad Intel Xeon 8268 >processors (96 cores). > From what I've heard, the AMD processors run much hotter than the Intel >processors, too, so I imagine a FLOPS/Watt comparison would be even less >favorable to AMD. Spec sheets would indicate AMD runs hotter, but then again you benchmarked twice as many Intel processors. So, per spec sheets for you processors above: AMD - 280W - 2 processors means system 560W Intel - 205W - 4 processors means system 820W (and then you also need to factor in purchase price). >An argument can be made that for calculations that lend themselves to >vectorization should be done on GPUs, instead of the main processors but >the last time I checked, GPU jobs are still memory is limited, and >moving data in and out of GPU memory can still take time, so I can see >situations where for large amounts of data using CPUs would be preferred >over GPUs. AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, which may or may not mean a difference. But what despite all of the above and the other replies, it is AMD who has been winning the HPC contracts of late, not Intel. * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions From tjrc at sanger.ac.uk Sun Jun 20 01:04:03 2021 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Sun, 20 Jun 2021 01:04:03 +0000 Subject: [Beowulf] AMD and AVX512 [EXT] In-Reply-To: References: Message-ID: I think that?s a major important point. Even if the whole of the HPC market were clamouring for it (which they?re not, judging by this discussion) that?s still a very small proportion of the worldwide CPU market. We have to remember that we in the HPC community are a niche market. I recall at SC a couple of years ago someone from Intel pointing out that mobile devices and IoT were what was driving IT technology; the volume dwarfs everything else. Hence the drive to NVRAM - not to make things faster for HPC (although that was the benefit being presented through that talk), but the fundamental driver was to increase phone battery life. Tim -- Tim Cutts Head of Scientific Computing Wellcome Sanger Institute On 19 Jun 2021, at 16:49, Gerald Henriksen > wrote: I suspect that is marketing speak, which roughly translates to not that no one has asked for it, but rather requests haven't reached a threshold where the requests are viewed as significant enough. -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at gmail.com Sun Jun 20 05:38:06 2021 From: hearnsj at gmail.com (John Hearns) Date: Sun, 20 Jun 2021 06:38:06 +0100 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: Regarding benchmarking real world codes on AMD , every year Martyn Guest presents a comprehensive set of benchmark studies to the UK Computing Insights Conference. I suggest a Sunday afternoon with the beverage of your choice is a good time to settle down and take time to read these or watch the presentation. 2019 https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf 2020 Video session https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000 Skylake / Cascade Lake / AMD Rome The slides for 2020 do exist - as I remember all the slides from all talks are grouped together, but I cannot find them. Watch the video - it is an excellent presentation. On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen wrote: > On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: > > >The answer given, and I'm > >not making this up, is that AMD listens to their users and gives the > >users what they want, and right now they're not hearing any demand for > >AVX512. > > > >Personally, I call BS on that one. I can't imagine anyone in the HPC > >community saying "we'd like processors that offer only 1/2 the floating > >point performance of Intel processors". > > I suspect that is marketing speak, which roughly translates to not > that no one has asked for it, but rather requests haven't reached a > threshold where the requests are viewed as significant enough. > > > Sure, AMD can offer more cores, > >but with only AVX2, you'd need twice as many cores as Intel processors, > >all other things being equal. > > But of course all other things aren't equal. > > AVX512 is a mess. > > Look at the Wikipedia page(*) and note that AVX512 means different > things depending on the processor implementing it. > > So what does the poor software developer target? > > Or that it can for heat reasons cause CPU frequency reductions, > meaning real world performance may not match theoritical - thus easier > to just go with GPU's. > > The result is that most of the world is quite happily (at least for > now) ignoring AVX512 and going with GPU's as necessary - particularly > given the convenient libraries that Nvidia offers. > > > I compared a server with dual AMD EPYC >7H12 processors (128) > > quad Intel Xeon 8268 >processors (96 cores). > > > From what I've heard, the AMD processors run much hotter than the Intel > >processors, too, so I imagine a FLOPS/Watt comparison would be even less > >favorable to AMD. > > Spec sheets would indicate AMD runs hotter, but then again you > benchmarked twice as many Intel processors. > > So, per spec sheets for you processors above: > > AMD - 280W - 2 processors means system 560W > Intel - 205W - 4 processors means system 820W > > (and then you also need to factor in purchase price). > > >An argument can be made that for calculations that lend themselves to > >vectorization should be done on GPUs, instead of the main processors but > >the last time I checked, GPU jobs are still memory is limited, and > >moving data in and out of GPU memory can still take time, so I can see > >situations where for large amounts of data using CPUs would be preferred > >over GPUs. > > AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, > which may or may not mean a difference. > > But what despite all of the above and the other replies, it is AMD who > has been winning the HPC contracts of late, not Intel. > > * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at gmail.com Sun Jun 20 05:51:58 2021 From: hearnsj at gmail.com (John Hearns) Date: Sun, 20 Jun 2021 06:51:58 +0100 Subject: [Beowulf] AMD and AVX512 [EXT] In-Reply-To: References: Message-ID: That is a very interesting point! I never thought of that. Also mobile drives ARM development - yes I know the CPUs in Isambard and Fugaku will not be seen in your mobile phone but the ecosystem is propped up by having a diverse market and also the power saving priorities of mobile will influence HPC ARM CPUs. On Sun, 20 Jun 2021 at 02:04, Tim Cutts wrote: > I think that?s a major important point. Even if the whole of the HPC > market were clamouring for it (which they?re not, judging by this > discussion) that?s still a very small proportion of the worldwide CPU > market. We have to remember that we in the HPC community are a niche > market. I recall at SC a couple of years ago someone from Intel pointing > out that mobile devices and IoT were what was driving IT technology; the > volume dwarfs everything else. Hence the drive to NVRAM - not to make > things faster for HPC (although that was the benefit being presented > through that talk), but the fundamental driver was to increase phone > battery life. > > Tim > > -- > Tim Cutts > Head of Scientific Computing > Wellcome Sanger Institute > > > On 19 Jun 2021, at 16:49, Gerald Henriksen wrote: > > I suspect that is marketing speak, which roughly translates to not > that no one has asked for it, but rather requests haven't reached a > threshold where the requests are viewed as significant enough. > > > -- The Wellcome Sanger Institute is operated by Genome Research Limited, a > charity registered in England with number 1021457 and a company registered > in England with number 2742969, whose registered office is 215 Euston Road, > London, NW1 2BE. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amacater at einval.com Sun Jun 20 12:13:21 2021 From: amacater at einval.com (Andrew M.A. Cater) Date: Sun, 20 Jun 2021 12:13:21 +0000 Subject: [Beowulf] Just a quick heads up: new Beowulf coming ... Message-ID: The folks over at Devuan have chosen to name their next code release (based on upcoming Debian 11) beowulf. I don't think it will impact this list - but you never know. For anyone runnng Debian on HPC / in labs: the latest Debian point release 10.10 was yesterday. Debian 11 should be released in about six weeks on 31 July 2021. Thanks for the excellence in this list All best, as ever, Andy Cater From sdm900 at gmail.com Sun Jun 20 14:38:42 2021 From: sdm900 at gmail.com (Stu Midgley) Date: Sun, 20 Jun 2021 22:38:42 +0800 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: we should be upto about EV12 by now... On Sun, Jun 20, 2021 at 1:38 PM John Hearns wrote: > Regarding benchmarking real world codes on AMD , every year Martyn Guest > presents a comprehensive set of benchmark studies to the UK Computing > Insights Conference. > I suggest a Sunday afternoon with the beverage of your choice is a good > time to settle down and take time to read these or watch the presentation. > > 2019 > > https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf > > > 2020 Video session > > https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000 > > Skylake / Cascade Lake / AMD Rome > > The slides for 2020 do exist - as I remember all the slides from all talks > are grouped together, but I cannot find them. > Watch the video - it is an excellent presentation. > > > > > > > > > > > > > > > > > > > On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen wrote: > >> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: >> >> >The answer given, and I'm >> >not making this up, is that AMD listens to their users and gives the >> >users what they want, and right now they're not hearing any demand for >> >AVX512. >> > >> >Personally, I call BS on that one. I can't imagine anyone in the HPC >> >community saying "we'd like processors that offer only 1/2 the floating >> >point performance of Intel processors". >> >> I suspect that is marketing speak, which roughly translates to not >> that no one has asked for it, but rather requests haven't reached a >> threshold where the requests are viewed as significant enough. >> >> > Sure, AMD can offer more cores, >> >but with only AVX2, you'd need twice as many cores as Intel processors, >> >all other things being equal. >> >> But of course all other things aren't equal. >> >> AVX512 is a mess. >> >> Look at the Wikipedia page(*) and note that AVX512 means different >> things depending on the processor implementing it. >> >> So what does the poor software developer target? >> >> Or that it can for heat reasons cause CPU frequency reductions, >> meaning real world performance may not match theoritical - thus easier >> to just go with GPU's. >> >> The result is that most of the world is quite happily (at least for >> now) ignoring AVX512 and going with GPU's as necessary - particularly >> given the convenient libraries that Nvidia offers. >> >> > I compared a server with dual AMD EPYC >7H12 processors (128) >> > quad Intel Xeon 8268 >processors (96 cores). >> >> > From what I've heard, the AMD processors run much hotter than the Intel >> >processors, too, so I imagine a FLOPS/Watt comparison would be even less >> >favorable to AMD. >> >> Spec sheets would indicate AMD runs hotter, but then again you >> benchmarked twice as many Intel processors. >> >> So, per spec sheets for you processors above: >> >> AMD - 280W - 2 processors means system 560W >> Intel - 205W - 4 processors means system 820W >> >> (and then you also need to factor in purchase price). >> >> >An argument can be made that for calculations that lend themselves to >> >vectorization should be done on GPUs, instead of the main processors but >> >the last time I checked, GPU jobs are still memory is limited, and >> >moving data in and out of GPU memory can still take time, so I can see >> >situations where for large amounts of data using CPUs would be preferred >> >over GPUs. >> >> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, >> which may or may not mean a difference. >> >> But what despite all of the above and the other replies, it is AMD who >> has been winning the HPC contracts of late, not Intel. >> >> * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -- Dr Stuart Midgley sdm900 at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ghenriks at gmail.com Sun Jun 20 14:44:12 2021 From: ghenriks at gmail.com (Gerald Henriksen) Date: Sun, 20 Jun 2021 10:44:12 -0400 Subject: [Beowulf] AMD and AVX512 [EXT] In-Reply-To: References: Message-ID: On Sun, 20 Jun 2021 06:51:58 +0100, you wrote: >That is a very interesting point! I never thought of that. >Also mobile drives ARM development - yes I know the CPUs in Isambard and >Fugaku will not be seen in your mobile phone but the ecosystem is propped >up by having a diverse market and also the power saving priorities of >mobile will influence HPC ARM CPUs. I think the danger is in thinking of ARM (or going forward RISC-V) in the same way that we have traditionally considered CPU families like the x86 / x64 / Power families. One of things hobbling x64 is that is effectively 1 design that Intel (and to a lesser extent AMD) try to fit into multiple roles - often without success. Consider the now abandoned attempts to get Intel chips into phones and tablets. ARM has no such contraints - they are quite happy to develop new designs for specific markets that are entirely unsuitable for their existing strengths. Hence, as part of the ARM push into HPC, the new Neoverse V1 - a design for HPC that probably won't appear in phones. https://www.arm.com/blogs/blueprint/neoverse-v1 Or consider that the ARM ecosystem has shunned making multiple-bitness CPUs/SOCs - they essentially made a clean break with 64-bit only chips that sit alongside the 32-bit only chips - vendors choose the hardware for their needs and don't carry along legacy stuff that eats up silicon space and power. ARM is about taking ARM IP and creating custom designs for specific markets. From joe.landman at gmail.com Sun Jun 20 17:21:15 2021 From: joe.landman at gmail.com (Joe Landman) Date: Sun, 20 Jun 2021 13:21:15 -0400 Subject: [Beowulf] AMD and AVX512 In-Reply-To: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com> References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com> Message-ID: <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com> (Note:? not disagreeing at all with Gerald, actually agreeing strongly ... also, correct address this time!? Thanks Gerald!) On 6/19/21 11:49 AM, Gerald Henriksen wrote: > On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: > >> The answer given, and I'm >> not making this up, is that AMD listens to their users and gives the >> users what they want, and right now they're not hearing any demand for >> AVX512. More accurately, there is call for it.? From a very small segment of the market.? Ones who buy small quantities of processors (under 100k volume per purchase). That is, not a significant enough portion of the market to make a huge difference to the supplier (Intel). And more to the point, AI and HPC joining forces has put the spotlight on small matrix multiplies, often with lower precision.? I'm not sure (haven't read much on it recently) if AVX512 will be enabling/has enabled support for bfloat16/FP16 or similar.? These tend to go to GPUs and other accelerators. >> Personally, I call BS on that one. I can't imagine anyone in the HPC >> community saying "we'd like processors that offer only 1/2 the floating >> point performance of Intel processors". > I suspect that is marketing speak, which roughly translates to not > that no one has asked for it, but rather requests haven't reached a > threshold where the requests are viewed as significant enough. This, precisely.? AMD may be losing the AVX512 users to Intel. But that's a small/miniscule fraction of the overall users of its products.? The demand for this is quite constrained. Moreover, there are often significant performance consequences to using AVX512 (downclocking, pipeline stalls, etc.) whereby the cost of enabling it and using it, far outweighs the benefits of providing it, for the vast, overwhelming portion of the market. And, as noted above on the accelerator side, this use case (large vectors) are better handled by the accelerators.? There is a cost (engineering, code design, etc.) to using accelerators as well.? But it won't directly impact the CPUs. >> Sure, AMD can offer more cores, >> but with only AVX2, you'd need twice as many cores as Intel processors, >> all other things being equal. ... or you run the GPU versions of the code, which are likely getting more active developer attention.? AVX512 applies to only a miniscule number of codes/problems.? Its really not a panacea. More to the point, have you seen how "well" compilers use AVX2/SSE registers and do code gen?? Its not pretty in general. Would you want the compilers to purposefully spit out AVX512 code the way the do AVX2/SSE code now?? I've found one has to work very hard with intrinsics to get good performance out of AVX2, never mind AVX512. Put another way, we've been hearing about "smart" compilers for a while, and in all honesty, most can barely implement a standard correctly, never mind generate reasonably (near) optimal code for the target system.? This has been a problem my entire professional life, and while I wish they were better, at the end of the day, this is where human intelligence fits into the HPC/AI narrative. > But of course all other things aren't equal. > > AVX512 is a mess. Understated, and yes. > Look at the Wikipedia page(*) and note that AVX512 means different > things depending on the processor implementing it. I made comments previously about which ISA ARM folks were going to write to.? That is, different processors, likely implementing different instructions, differently ... you won't really have 1 equally good compiler for all these features.? You'll have a compiler that implements common denominators reasonably well. Which mitigates the benefits of the ISA/architecture. Intel has the same problem with AVX512.? I know, I know ... feature flags on the CPU (see last line of lscpu output).? And how often have certain (ahem) compilers ignored the flags, and used a different mechanism to determine CPU feature support, specifically targeting their competitor offerings to force (literally) low performance paths for those CPUs? > So what does the poor software developer target? Lowest common denominator.? Make the code work correctly first.? Then make it fast.? If fast is platform specific, ask how often with that platform be used. > Or that it can for heat reasons cause CPU frequency reductions, > meaning real world performance may not match theoritical - thus easier > to just go with GPU's. > > The result is that most of the world is quite happily (at least for > now) ignoring AVX512 and going with GPU's as necessary - particularly > given the convenient libraries that Nvidia offers. Yeah ... like it or not, that battle is over (for now). [...] > >> An argument can be made that for calculations that lend themselves to >> vectorization should be done on GPUs, instead of the main processors but >> the last time I checked, GPU jobs are still memory is limited, and >> moving data in and out of GPU memory can still take time, so I can see >> situations where for large amounts of data using CPUs would be preferred >> over GPUs. > AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, > which may or may not mean a difference. It does.? IO and memory bandwidth/latency are very important, and oft overlooked aspects of performance.? If you have a choice of doubling IO and memory bandwidth at lower latency (usable by everyone) vs adding an AVX512 unit or two (usable by a small fraction of a percent of all users), which would net you, as an architect, the best "bang for the buck"? > But what despite all of the above and the other replies, it is AMD who > has been winning the HPC contracts of late, not Intel. There's a reason for that.? I will admit I have a devil of a time trying to convince people that higher clock frequency for computing matters only to a small fraction of operations, especially ones waiting on (slow) RAM and (slower) IO.? Make the RAM and IO faster (lower latency, higher bandwidth), and the system will be far more performant. -- Joe Landman e:joe.landman at gmail.com t: @hpcjoe w:https://scalability.org g:https://github.com/joelandman l:https://www.linkedin.com/in/joelandman -------------- next part -------------- An HTML attachment was scrubbed... URL: From kus at free.net Sun Jun 20 17:28:25 2021 From: kus at free.net (Mikhail Kuzminsky) Date: Sun, 20 Jun 2021 20:28:25 +0300 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: I apologize - I should have written earlier, but I don't always work with my broken right hand. It seems to me that a reasonable basis for discussing AMD EPYC performance could be the specified performance data in the Daresburg University benchmark from M.Guest. Yes, newer versions of AMD EPYC and Xeon Scalable processors have appeared since then, and new compiler versions. However, Intel already had AVX-512 support, and AMD - AVX-256. Of course, peak performanceis is not so important as application performance. There are applications where performance is not limited to working with vectors - there AVX-512 may not be needed. And in AI tasks, working with vectors is actual - and GPUs are often used there. For AI, the Daresburg benchmark, on the other hand, is less relevant. And in Zen 4, AMD seemed to be going to support 512 bit vectors. But performance of linear algebra does not always require work with GPU. In quantum chemistry, you can get acceleration due to vectors on the V100, let's say a 2 times - how much more expensive is the GPU? Of course, support for 512 bit vectors is a plus, but you really need to look to application performance and cost (including power consumption). I prefer to see to the A64FX now, although there may need to be rebuild applications. Servers w/A64FX sold now, but the price is very important. In message from John Hearns (Sun, 20 Jun 2021 06:38:06 +0100): > Regarding benchmarking real world codes on AMD , every year Martyn >Guest > presents a comprehensive set of benchmark studies to the UK Computing > Insights Conference. > I suggest a Sunday afternoon with the beverage of your choice is a >good > time to settle down and take time to read these or watch the >presentation. > > 2019 > https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf > > > 2020 Video session > https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000 > > Skylake / Cascade Lake / AMD Rome > > The slides for 2020 do exist - as I remember all the slides from all >talks > are grouped together, but I cannot find them. > Watch the video - it is an excellent presentation. > > > > > > > > > > > > > > > > > > > On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen >wrote: > >> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: >> >> >The answer given, and I'm >> >not making this up, is that AMD listens to their users and gives the >> >users what they want, and right now they're not hearing any demand >>for >> >AVX512. >> > >> >Personally, I call BS on that one. I can't imagine anyone in the HPC >> >community saying "we'd like processors that offer only 1/2 the >>floating >> >point performance of Intel processors". >> >> I suspect that is marketing speak, which roughly translates to not >> that no one has asked for it, but rather requests haven't reached a >> threshold where the requests are viewed as significant enough. >> >> > Sure, AMD can offer more cores, >> >but with only AVX2, you'd need twice as many cores as Intel >>processors, >> >all other things being equal. >> >> But of course all other things aren't equal. >> >> AVX512 is a mess. >> >> Look at the Wikipedia page(*) and note that AVX512 means different >> things depending on the processor implementing it. >> >> So what does the poor software developer target? >> >> Or that it can for heat reasons cause CPU frequency reductions, >> meaning real world performance may not match theoritical - thus >>easier >> to just go with GPU's. >> >> The result is that most of the world is quite happily (at least for >> now) ignoring AVX512 and going with GPU's as necessary - particularly >> given the convenient libraries that Nvidia offers. >> >> > I compared a server with dual AMD EPYC >7H12 processors (128) >> > quad Intel Xeon 8268 >processors (96 cores). >> >> > From what I've heard, the AMD processors run much hotter than the >>Intel >> >processors, too, so I imagine a FLOPS/Watt comparison would be even >>less >> >favorable to AMD. >> >> Spec sheets would indicate AMD runs hotter, but then again you >> benchmarked twice as many Intel processors. >> >> So, per spec sheets for you processors above: >> >> AMD - 280W - 2 processors means system 560W >> Intel - 205W - 4 processors means system 820W >> >> (and then you also need to factor in purchase price). >> >> >An argument can be made that for calculations that lend themselves >>to >> >vectorization should be done on GPUs, instead of the main processors >>but >> >the last time I checked, GPU jobs are still memory is limited, and >> >moving data in and out of GPU memory can still take time, so I can >>see >> >situations where for large amounts of data using CPUs would be >>preferred >> >over GPUs. >> >> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, >> which may or may not mean a difference. >> >> But what despite all of the above and the other replies, it is AMD >>who >> has been winning the HPC contracts of late, not Intel. >> >> * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>Computing >> To change your subscription (digest mode or unsubscribe) visit >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> > > -- > ??? ????????? ???? ????????? ?? ??????? ? ??? ??????? > ? ????? ???????? ??????????? ??????????? > MailScanner, ? ?? ???????? > ??? ??? ?? ???????? ???????????? ????. > From sassy-work at sassy.formativ.net Sun Jun 20 22:45:26 2021 From: sassy-work at sassy.formativ.net (=?ISO-8859-1?Q?J=F6rg_Sa=DFmannshausen?=) Date: Sun, 20 Jun 2021 23:45:26 +0100 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: <4123509.Cr6JYYXRPx@deepblue> Dear all, same here, I should have joined the discussion earlier but currently I am recovering from a trapped ulnaris nerve OP, so long typing is something I need to avoid. As it is quite apt I think, I would like to inform you about this upcoming talk (copy&pasta): ********** *Performance Optimizations & Best Practices for AMD Rome and Milan CPUs in HPC Environments* - date & time: Fri July 2nd 2021 - 16:00-17:30 UTC - speakers: Evan Burness and Jithin Jose (Principal Program Managers for High- Performance Computing in Microsoft Azure) More information available at https://github.com/easybuilders/easybuild/wiki/ EasyBuild-tech-talks-IV:-AMD-Rome-&-Milan The talk will be presented via a Zoom session, which registered attendees can join, and will be streamed (+ recorded) via the EasyBuild YouTube channel. Q&A via the #tech-talks channel in the EasyBuild Slack. Please register (free or charge) if you plan to attend, via: https://webappsx.ugent.be/eventManager/events/ebtechtalkamdromemilan The Zoom link will only be shared with registered attendees. ********** These talks are really tech talks and not sales talks and all of the ones I been to were very informative and friendly. So that might be a good idea to ask some questions there? All the best J?rg Am Sonntag, 20. Juni 2021, 18:28:25 BST schrieb Mikhail Kuzminsky: > I apologize - I should have written earlier, but I don't always work > with my broken right hand. It seems to me that a reasonable basis for > discussing AMD EPYC performance could be the specified performance > data in the Daresburg University benchmark from M.Guest. Yes, newer > versions of AMD EPYC and Xeon Scalable processors have appeared since > then, and new compiler versions. However, Intel already had AVX-512 > support, and AMD - AVX-256. > Of course, peak performanceis is not so important as application > performance. There are applications where performance is not limited > to working with vectors - there AVX-512 may not be needed. And in AI > tasks, working with vectors is actual - and GPUs are often used there. > For AI, the Daresburg benchmark, on the other hand, is less relevant. > And in Zen 4, AMD seemed to be going to support 512 bit vectors. But > performance of linear algebra does not always require work with GPU. > In quantum chemistry, you can get acceleration due to vectors on the > V100, let's say a 2 times - how much more expensive is the GPU? > Of course, support for 512 bit vectors is a plus, but you really need > to look to application performance and cost (including power > consumption). I prefer to see to the A64FX now, although there may > need to be rebuild applications. Servers w/A64FX sold now, but the > price is very important. > > In message from John Hearns (Sun, 20 Jun 2021 > > 06:38:06 +0100): > > Regarding benchmarking real world codes on AMD , every year Martyn > > > >Guest > > > > presents a comprehensive set of benchmark studies to the UK Computing > > Insights Conference. > > I suggest a Sunday afternoon with the beverage of your choice is a > > > >good > > > > time to settle down and take time to read these or watch the > > > >presentation. > > > > 2019 > > https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn > > _Guest.pdf > > > > > > 2020 Video session > > https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49E > > hq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000 > > > > Skylake / Cascade Lake / AMD Rome > > > > The slides for 2020 do exist - as I remember all the slides from all > > > >talks > > > > are grouped together, but I cannot find them. > > Watch the video - it is an excellent presentation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen > > > >wrote: > >> On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: > >> >The answer given, and I'm > >> >not making this up, is that AMD listens to their users and gives the > >> >users what they want, and right now they're not hearing any demand > >> > >>for > >> > >> >AVX512. > >> > > >> >Personally, I call BS on that one. I can't imagine anyone in the HPC > >> >community saying "we'd like processors that offer only 1/2 the > >> > >>floating > >> > >> >point performance of Intel processors". > >> > >> I suspect that is marketing speak, which roughly translates to not > >> that no one has asked for it, but rather requests haven't reached a > >> threshold where the requests are viewed as significant enough. > >> > >> > Sure, AMD can offer more cores, > >> > > >> >but with only AVX2, you'd need twice as many cores as Intel > >> > >>processors, > >> > >> >all other things being equal. > >> > >> But of course all other things aren't equal. > >> > >> AVX512 is a mess. > >> > >> Look at the Wikipedia page(*) and note that AVX512 means different > >> things depending on the processor implementing it. > >> > >> So what does the poor software developer target? > >> > >> Or that it can for heat reasons cause CPU frequency reductions, > >> meaning real world performance may not match theoritical - thus > >> > >>easier > >> > >> to just go with GPU's. > >> > >> The result is that most of the world is quite happily (at least for > >> now) ignoring AVX512 and going with GPU's as necessary - particularly > >> given the convenient libraries that Nvidia offers. > >> > >> > I compared a server with dual AMD EPYC >7H12 processors (128) > >> > quad Intel Xeon 8268 >processors (96 cores). > >> > > >> > From what I've heard, the AMD processors run much hotter than the > >> > >>Intel > >> > >> >processors, too, so I imagine a FLOPS/Watt comparison would be even > >> > >>less > >> > >> >favorable to AMD. > >> > >> Spec sheets would indicate AMD runs hotter, but then again you > >> benchmarked twice as many Intel processors. > >> > >> So, per spec sheets for you processors above: > >> > >> AMD - 280W - 2 processors means system 560W > >> Intel - 205W - 4 processors means system 820W > >> > >> (and then you also need to factor in purchase price). > >> > >> >An argument can be made that for calculations that lend themselves > >> > >>to > >> > >> >vectorization should be done on GPUs, instead of the main processors > >> > >>but > >> > >> >the last time I checked, GPU jobs are still memory is limited, and > >> >moving data in and out of GPU memory can still take time, so I can > >> > >>see > >> > >> >situations where for large amounts of data using CPUs would be > >> > >>preferred > >> > >> >over GPUs. > >> > >> AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, > >> which may or may not mean a difference. > >> > >> But what despite all of the above and the other replies, it is AMD > >> > >>who > >> > >> has been winning the HPC contracts of late, not Intel. > >> > >> * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions > >> _______________________________________________ > >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > >> > >>Computing > >> > >> To change your subscription (digest mode or unsubscribe) visit > >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From engwalljonathanthereal at gmail.com Mon Jun 21 13:20:00 2021 From: engwalljonathanthereal at gmail.com (Jonathan Engwall) Date: Mon, 21 Jun 2021 06:20:00 -0700 Subject: [Beowulf] AMD and AVX512 In-Reply-To: <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com> References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com> <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com> Message-ID: I have followed this thinking "square peg, round hole." You have got it again, Joe. Compilers are your problem. On Sun, Jun 20, 2021, 10:21 AM Joe Landman wrote: > (Note: not disagreeing at all with Gerald, actually agreeing strongly ... > also, correct address this time! Thanks Gerald!) > > > On 6/19/21 11:49 AM, Gerald Henriksen wrote: > > On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: > > > The answer given, and I'm > not making this up, is that AMD listens to their users and gives the > users what they want, and right now they're not hearing any demand for > AVX512. > > More accurately, there is call for it. From a very small segment of the > market. Ones who buy small quantities of processors (under 100k volume per > purchase). > > That is, not a significant enough portion of the market to make a huge > difference to the supplier (Intel). > > And more to the point, AI and HPC joining forces has put the spotlight on > small matrix multiplies, often with lower precision. I'm not sure (haven't > read much on it recently) if AVX512 will be enabling/has enabled support > for bfloat16/FP16 or similar. These tend to go to GPUs and other > accelerators. > > Personally, I call BS on that one. I can't imagine anyone in the HPC > community saying "we'd like processors that offer only 1/2 the floating > point performance of Intel processors". > > I suspect that is marketing speak, which roughly translates to not > that no one has asked for it, but rather requests haven't reached a > threshold where the requests are viewed as significant enough. > > This, precisely. AMD may be losing the AVX512 users to Intel. But that's > a small/miniscule fraction of the overall users of its products. The > demand for this is quite constrained. Moreover, there are often > significant performance consequences to using AVX512 (downclocking, > pipeline stalls, etc.) whereby the cost of enabling it and using it, far > outweighs the benefits of providing it, for the vast, overwhelming portion > of the market. > > And, as noted above on the accelerator side, this use case (large vectors) > are better handled by the accelerators. There is a cost (engineering, code > design, etc.) to using accelerators as well. But it won't directly impact > the CPUs. > > Sure, AMD can offer more cores, > but with only AVX2, you'd need twice as many cores as Intel processors, > all other things being equal. > > ... or you run the GPU versions of the code, which are likely getting more > active developer attention. AVX512 applies to only a miniscule number of > codes/problems. Its really not a panacea. > > More to the point, have you seen how "well" compilers use AVX2/SSE > registers and do code gen? Its not pretty in general. Would you want the > compilers to purposefully spit out AVX512 code the way the do AVX2/SSE code > now? I've found one has to work very hard with intrinsics to get good > performance out of AVX2, never mind AVX512. > > Put another way, we've been hearing about "smart" compilers for a while, > and in all honesty, most can barely implement a standard correctly, never > mind generate reasonably (near) optimal code for the target system. This > has been a problem my entire professional life, and while I wish they were > better, at the end of the day, this is where human intelligence fits into > the HPC/AI narrative. > > But of course all other things aren't equal. > > AVX512 is a mess. > > Understated, and yes. > > Look at the Wikipedia page(*) and note that AVX512 means different > things depending on the processor implementing it. > > I made comments previously about which ISA ARM folks were going to write > to. That is, different processors, likely implementing different > instructions, differently ... you won't really have 1 equally good compiler > for all these features. You'll have a compiler that implements common > denominators reasonably well. Which mitigates the benefits of the > ISA/architecture. > > Intel has the same problem with AVX512. I know, I know ... feature flags > on the CPU (see last line of lscpu output). And how often have certain > (ahem) compilers ignored the flags, and used a different mechanism to > determine CPU feature support, specifically targeting their competitor > offerings to force (literally) low performance paths for those CPUs? > > > So what does the poor software developer target? > > Lowest common denominator. Make the code work correctly first. Then make > it fast. If fast is platform specific, ask how often with that platform be > used. > > > Or that it can for heat reasons cause CPU frequency reductions, > meaning real world performance may not match theoritical - thus easier > to just go with GPU's. > > The result is that most of the world is quite happily (at least for > now) ignoring AVX512 and going with GPU's as necessary - particularly > given the convenient libraries that Nvidia offers. > > Yeah ... like it or not, that battle is over (for now). > > [...] > > > An argument can be made that for calculations that lend themselves to > vectorization should be done on GPUs, instead of the main processors but > the last time I checked, GPU jobs are still memory is limited, and > moving data in and out of GPU memory can still take time, so I can see > situations where for large amounts of data using CPUs would be preferred > over GPUs. > > AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, > which may or may not mean a difference. > > It does. IO and memory bandwidth/latency are very important, and oft > overlooked aspects of performance. If you have a choice of doubling IO and > memory bandwidth at lower latency (usable by everyone) vs adding an AVX512 > unit or two (usable by a small fraction of a percent of all users), which > would net you, as an architect, the best "bang for the buck"? > > > But what despite all of the above and the other replies, it is AMD who > has been winning the HPC contracts of late, not Intel. > > There's a reason for that. I will admit I have a devil of a time trying > to convince people that higher clock frequency for computing matters only > to a small fraction of operations, especially ones waiting on (slow) RAM > and (slower) IO. Make the RAM and IO faster (lower latency, higher > bandwidth), and the system will be far more performant. > > > > -- > > Joe Landman > e: joe.landman at gmail.com > t: @hpcjoe > w: https://scalability.org > g: https://github.com/joelandman > l: https://www.linkedin.com/in/joelandman > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joe.landman at gmail.com Mon Jun 21 13:46:30 2021 From: joe.landman at gmail.com (Joe Landman) Date: Mon, 21 Jun 2021 09:46:30 -0400 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com> <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com> Message-ID: <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> On 6/21/21 9:20 AM, Jonathan Engwall wrote: > I have followed this thinking "square peg, round hole." > You have got it again, Joe. Compilers are your problem. Erp ... did I mess up again? System architecture has been a problem ... making a processing unit 10-100x as fast as its support components means you have to code with that in mind.? A simple `gfortran -O3 mycode.f` won't necessarily generate optimal code for the system ( but I swear ... -O3 ... it says it on the package!) Way back at Scalable, our secret sauce was largely increasing IO bandwidth and lowering IO latency while coupling computing more tightly to this massive IO/network pipe set, combined with intelligence in the kernel on how to better use the resources.? It was simply a better architecture.? We used the same CPUs.? We simply exploited the design better. End result was codes that ran on our systems with off-cpu work (storage, networking, etc.) could push our systems far harder than competitors.? And you didn't have to use a different ISA to get these benefits.? No recompilation needed, though we did show the folks who were interested, how to get even better performance. Architecture matters, as does implementation of that architecture.? There are costs to every decision within an architecture.? For AVX512, along comes lots of other baggage associated with downclocking, etc.? You have to do a cost-benefit analysis on whether or not it is worth paying for that baggage, with the benefits you get from doing so.? Some folks have made that decision towards AVX512, and have been enjoying the benefits of doing so (e.g. willing to pay the costs).? For the general audience, these costs represent a (significant) hurdle one must overcome. Here's where awesome compiler support would help.? FWIW, gcc isn't that great a compiler.? Its not performance minded for HPC. Its a reasonable general purpose standards compliant (for some subset of standards) compilation system.? LLVM is IMO a better compiler system, and its clang/flang are developing nicely, albeit still not really HPC focused.? Then you have variants built on that.? Like the Cray compiler, Nvidia compiler and AMD compiler. These are HPC focused, and actually do quite well with some codes (though the AMD version lags the Cray and Nvidia compilers). You've got the Intel compiler, which would be a good general compiler if it wasn't more of a marketing vehicle for Intel processors and their features (hey you got an AMD chip?? you will take the slowest code path even if you support the features needed for the high performance code path). Maybe, someday, we'll get a great HPC compiler for C/Fortran. -- Joe Landman e: joe.landman at gmail.com t: @hpcjoe w: https://scalability.org g: https://github.com/joelandman l: https://www.linkedin.com/in/joelandman -------------- next part -------------- An HTML attachment was scrubbed... URL: From amacater at einval.com Mon Jun 21 14:46:53 2021 From: amacater at einval.com (Andrew M.A. Cater) Date: Mon, 21 Jun 2021 14:46:53 +0000 Subject: [Beowulf] AMD and AVX512 In-Reply-To: <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com> <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com> <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> Message-ID: On Mon, Jun 21, 2021 at 09:46:30AM -0400, Joe Landman wrote: > On 6/21/21 9:20 AM, Jonathan Engwall wrote: > > I have followed this thinking "square peg, round hole." > > You have got it again, Joe. Compilers are your problem. > > > Erp ... did I mess up again? > > Here's where awesome compiler support would help.? FWIW, gcc isn't that > great a compiler.? Its not performance minded for HPC. Its a reasonable > general purpose standards compliant (for some subset of standards) > compilation system.? LLVM is IMO a better compiler system, and its > clang/flang are developing nicely, albeit still not really HPC focused.? > Then you have variants built on that.? Like the Cray compiler, Nvidia > compiler and AMD compiler. These are HPC focused, and actually do quite well > with some codes (though the AMD version lags the Cray and Nvidia compilers). > You've got the Intel compiler, which would be a good general compiler if it > wasn't more of a marketing vehicle for Intel processors and their features > (hey you got an AMD chip?? you will take the slowest code path even if you > support the features needed for the high performance code path). > > Maybe, someday, we'll get a great HPC compiler for C/Fortran. > The problem is that, maybe, the HPC market is still not _quite_ big enough to merit a dedicated set of compilers and is diverse enough in its problem sets that we still need a dozen or more specialist use cases to work well. You would think there would be a cross-over point where massively parallel scalable cloud infrastructure wold intersect with HPC but that doesn't seem to be happening. Parallelisation is the great bugbear anyway. Most of the experts I know on all of this are the regulars on this list: paging Greg Lindahl ... All the best, Andy Cater > > -- > Joe Landman > e: joe.landman at gmail.com > t: @hpcjoe > w: https://scalability.org > g: https://github.com/joelandman > l: https://www.linkedin.com/in/joelandman > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From engwalljonathanthereal at gmail.com Mon Jun 21 14:58:51 2021 From: engwalljonathanthereal at gmail.com (Jonathan Engwall) Date: Mon, 21 Jun 2021 07:58:51 -0700 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com> <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com> <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> Message-ID: AVX-512 is SIMD and in that respect compiled Intel routines will run almost automatically on Intel processors. It's not like I was answering the question. I realize or under realize the implementation problems. You need to do a side by side comparison of the die. On Mon, Jun 21, 2021, 7:47 AM Andrew M.A. Cater wrote: > On Mon, Jun 21, 2021 at 09:46:30AM -0400, Joe Landman wrote: > > On 6/21/21 9:20 AM, Jonathan Engwall wrote: > > > I have followed this thinking "square peg, round hole." > > > You have got it again, Joe. Compilers are your problem. > > > > > > Erp ... did I mess up again? > > > > Here's where awesome compiler support would help. FWIW, gcc isn't that > > great a compiler. Its not performance minded for HPC. Its a reasonable > > general purpose standards compliant (for some subset of standards) > > compilation system. LLVM is IMO a better compiler system, and its > > clang/flang are developing nicely, albeit still not really HPC focused. > > Then you have variants built on that. Like the Cray compiler, Nvidia > > compiler and AMD compiler. These are HPC focused, and actually do quite > well > > with some codes (though the AMD version lags the Cray and Nvidia > compilers). > > You've got the Intel compiler, which would be a good general compiler if > it > > wasn't more of a marketing vehicle for Intel processors and their > features > > (hey you got an AMD chip? you will take the slowest code path even if > you > > support the features needed for the high performance code path). > > > > Maybe, someday, we'll get a great HPC compiler for C/Fortran. > > > The problem is that, maybe, the HPC market is still not _quite_ big enough > to merit a dedicated set of compilers and is diverse enough in its problem > sets that we still need a dozen or more specialist use cases to work well. > > You would think there would be a cross-over point where massively parallel > scalable cloud infrastructure wold intersect with HPC but that doesn't > seem to be happening. Parallelisation is the great bugbear anyway. > > Most of the experts I know on all of this are the regulars on this list: > paging Greg Lindahl ... > > All the best, > > Andy Cater > > > > > -- > > Joe Landman > > e: joe.landman at gmail.com > > t: @hpcjoe > > w: https://scalability.org > > g: https://github.com/joelandman > > l: https://www.linkedin.com/in/joelandman > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sassy-work at sassy.formativ.net Mon Jun 21 17:11:37 2021 From: sassy-work at sassy.formativ.net (=?ISO-8859-1?Q?J=F6rg_Sa=DFmannshausen?=) Date: Mon, 21 Jun 2021 18:11:37 +0100 Subject: [Beowulf] AMD and AVX512 In-Reply-To: <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com> <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> Message-ID: <3326969.2YInZGPpZj@deepblue> Dear all > System architecture has been a problem ... making a processing unit > 10-100x as fast as its support components means you have to code with > that in mind. A simple `gfortran -O3 mycode.f` won't necessarily > generate optimal code for the system ( but I swear ... -O3 ... it says > it on the package!) From a computational Chemist perspective I agree. In an ideal world, you want to get the right hardware for the program you want to use. Some of the code is running entirely in memory, others is using disc space for offloading files. This is, in my humble opinion, also the big problem CPUs are facing. They are build to tackle all possible scenarios, from simple integer to floating point, from in-memory to disc I/O. In some respect it would have been better to stick with a separate math unit which then could be selected according to your workload you want to run on that server. I guess this is where the GPUs are trying to fit in here, or maybe ARM. I also agree with the compiler "problem". If you are starting to push some compilers too much, the code is running very fast but the results are simply wrong. Again, in an ideal world we have a compiler for the job for the given hardware which also depends on the job you want to run. The problem here is not: is that possible, the problem is more: how much does it cost? From what I understand, some big server farms are actually not using commodity HPC stuff but they are designing what they need themselves. Maybe the whole climate problem will finally push HPC into the more bespoken system where the components are fit for the job in question, say weather modeling for example, simply as that would be more energy efficient and faster. Before somebody comes along with: but but but it costs! Think about how much money is being spent simply to kill people, or at other wasteful project like Brexit etc. My 2 shillings for what it is worth! :D J?rg Am Montag, 21. Juni 2021, 14:46:30 BST schrieb Joe Landman: > On 6/21/21 9:20 AM, Jonathan Engwall wrote: > > I have followed this thinking "square peg, round hole." > > You have got it again, Joe. Compilers are your problem. > > Erp ... did I mess up again? > > System architecture has been a problem ... making a processing unit > 10-100x as fast as its support components means you have to code with > that in mind. A simple `gfortran -O3 mycode.f` won't necessarily > generate optimal code for the system ( but I swear ... -O3 ... it says > it on the package!) > > Way back at Scalable, our secret sauce was largely increasing IO > bandwidth and lowering IO latency while coupling computing more tightly > to this massive IO/network pipe set, combined with intelligence in the > kernel on how to better use the resources. It was simply a better > architecture. We used the same CPUs. We simply exploited the design > better. > > End result was codes that ran on our systems with off-cpu work (storage, > networking, etc.) could push our systems far harder than competitors. > And you didn't have to use a different ISA to get these benefits. No > recompilation needed, though we did show the folks who were interested, > how to get even better performance. > > Architecture matters, as does implementation of that architecture. > There are costs to every decision within an architecture. For AVX512, > along comes lots of other baggage associated with downclocking, etc. > You have to do a cost-benefit analysis on whether or not it is worth > paying for that baggage, with the benefits you get from doing so. Some > folks have made that decision towards AVX512, and have been enjoying the > benefits of doing so (e.g. willing to pay the costs). For the general > audience, these costs represent a (significant) hurdle one must overcome. > > Here's where awesome compiler support would help. FWIW, gcc isn't that > great a compiler. Its not performance minded for HPC. Its a reasonable > general purpose standards compliant (for some subset of standards) > compilation system. LLVM is IMO a better compiler system, and its > clang/flang are developing nicely, albeit still not really HPC focused. > Then you have variants built on that. Like the Cray compiler, Nvidia > compiler and AMD compiler. These are HPC focused, and actually do quite > well with some codes (though the AMD version lags the Cray and Nvidia > compilers). You've got the Intel compiler, which would be a good general > compiler if it wasn't more of a marketing vehicle for Intel processors > and their features (hey you got an AMD chip? you will take the slowest > code path even if you support the features needed for the high > performance code path). > > Maybe, someday, we'll get a great HPC compiler for C/Fortran. From bdobbins at gmail.com Mon Jun 21 18:39:06 2021 From: bdobbins at gmail.com (Brian Dobbins) Date: Mon, 21 Jun 2021 12:39:06 -0600 Subject: [Beowulf] AMD and AVX512 In-Reply-To: <3326969.2YInZGPpZj@deepblue> References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com> <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> <3326969.2YInZGPpZj@deepblue> Message-ID: Hi all, This is, in my humble opinion, also the big problem CPUs are facing. They > are > build to tackle all possible scenarios, from simple integer to floating > point, > from in-memory to disc I/O. In some respect it would have been better to > stick > with a separate math unit which then could be selected according to your > workload you want to run on that server. I guess this is where the GPUs > are > trying to fit in here, or maybe ARM. > I recall a few years ago the rumors that the Argonne "A18" system was going to use the 'Configurable Spatial Accelerators' that Intel was developing, with the idea being you *could* reconfigure based on the needs of the code. In principle, it sounds like the Holy Grail, but in practice it seems quite difficult, and I don't believe I've heard much more about the CSA approach since. WikiChip on the CSA: https://en.wikichip.org/wiki/intel/configurable_spatial_accelerator NextPlatform article: https://www.nextplatform.com/2018/08/30/intels-exascale-dataflow-engine-drops-x86-and-von-neuman/ I have to imagine that research hasn't gone fully quiet, especially with Intel's moves towards oneAPI and their FPGA experiences, but I haven't seen anything about it in a while. Of course.... > I also agree with the compiler "problem". If you are starting to push some > compilers too much, the code is running very fast but the results are > simply > wrong. Again, in an ideal world we have a compiler for the job for the > given > hardware which also depends on the job you want to run. > ... It exacerbates the compiler issues, *I think*. I hesitate to say it does so definitively, since the patent write-up talks about how the CSA architecture uses a representation very similar to what the (now old) Intel compilers created as an IR (intermediate representation). In my opinion, having a compiler that can 'do everything' is like having an AI that can do everything - we're good at very, *very* specific use-cases, but not generality. So configurable systems are a big challenge. (I'm *way* out of my depth on compilers, though - maybe they're improving massively?) > Maybe the whole climate problem will finally push HPC into the more > bespoken > system where the components are fit for the job in question, say weather > modeling for example, simply as that would be more energy efficient and > faster. > I can't speak to whether climate research will influence hardware, but back to the *original* theme of this thread, I actually had some data -very *limited* data, mind you!- on how NCAR's climate model, CESM, run in an 'F2000climo' case (one of many, many cases, and very atmospheric focused) at 2-degree atmosphere resolution (*very* coarse) on a 36-core Xeon Skylake performs across AVX2, AVX512 and AVX512+FMA. By default, FMA is turned off in these cases due to numerical sensitivity. So, that's a *very* specific case, but on the off chance people are curious, here's what it looks like - note that this is *noisy* data, because the model also does a lot of I/O, hence why I tend to look at median times, in blue below: SKX (AWS C5N.18xlarge) Performance Comparison CESM Case: F2000climo @ f19_g17 resolution (36 cores each component / 10 model day run, skipping 1st and last) Flags AVX2 (no FMA) AVX512 (no FMA) AVX512 + FMA Min 60.18 60.24 59.16 Max 66.26 60.47 59.40 Median 60.28 60.38 59.32 The take-away? We're not really benefiting *at all* (at this resolution, for this compset, etc) from AVX512 here. Maybe at higher resolution? Maybe with more vertical levels, or chemistry, or something like that? *Maybe*, but differences seem indistinguishable from noise here, and possibly negative! Now, give us more *memory bandwidth*, and that's fantastic. Could this code be rewritten to take better advantage of larger vectors? Sure, and some *really* capable people do work on that sort of stuff, and it helps, but as an *evolution* in performance, not a revolution in it. (Also, I'm always horrified by presenting one-off tests as examples of anything, but it's the only data I have on-hand! Other cases may indeed vary.) Before somebody comes along with: but but but it costs! Think about how > much > money is being spent simply to kill people, or at other wasteful project > like > Brexit etc. > One can only hope. When it comes to spending on research, I recall the quote: "If you think education is expensive, try ignorance!" Cheers, - Brian Am Montag, 21. Juni 2021, 14:46:30 BST schrieb Joe Landman: > > On 6/21/21 9:20 AM, Jonathan Engwall wrote: > > > I have followed this thinking "square peg, round hole." > > > You have got it again, Joe. Compilers are your problem. > > > > Erp ... did I mess up again? > > > > System architecture has been a problem ... making a processing unit > > 10-100x as fast as its support components means you have to code with > > that in mind. A simple `gfortran -O3 mycode.f` won't necessarily > > generate optimal code for the system ( but I swear ... -O3 ... it says > > it on the package!) > > > > Way back at Scalable, our secret sauce was largely increasing IO > > bandwidth and lowering IO latency while coupling computing more tightly > > to this massive IO/network pipe set, combined with intelligence in the > > kernel on how to better use the resources. It was simply a better > > architecture. We used the same CPUs. We simply exploited the design > > better. > > > > End result was codes that ran on our systems with off-cpu work (storage, > > networking, etc.) could push our systems far harder than competitors. > > And you didn't have to use a different ISA to get these benefits. No > > recompilation needed, though we did show the folks who were interested, > > how to get even better performance. > > > > Architecture matters, as does implementation of that architecture. > > There are costs to every decision within an architecture. For AVX512, > > along comes lots of other baggage associated with downclocking, etc. > > You have to do a cost-benefit analysis on whether or not it is worth > > paying for that baggage, with the benefits you get from doing so. Some > > folks have made that decision towards AVX512, and have been enjoying the > > benefits of doing so (e.g. willing to pay the costs). For the general > > audience, these costs represent a (significant) hurdle one must overcome. > > > > Here's where awesome compiler support would help. FWIW, gcc isn't that > > great a compiler. Its not performance minded for HPC. Its a reasonable > > general purpose standards compliant (for some subset of standards) > > compilation system. LLVM is IMO a better compiler system, and its > > clang/flang are developing nicely, albeit still not really HPC focused. > > Then you have variants built on that. Like the Cray compiler, Nvidia > > compiler and AMD compiler. These are HPC focused, and actually do quite > > well with some codes (though the AMD version lags the Cray and Nvidia > > compilers). You've got the Intel compiler, which would be a good general > > compiler if it wasn't more of a marketing vehicle for Intel processors > > and their features (hey you got an AMD chip? you will take the slowest > > code path even if you support the features needed for the high > > performance code path). > > > > Maybe, someday, we'll get a great HPC compiler for C/Fortran. > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pbisbal at pppl.gov Mon Jun 21 20:56:08 2021 From: pbisbal at pppl.gov (Prentice Bisbal) Date: Mon, 21 Jun 2021 16:56:08 -0400 Subject: [Beowulf] [External] Re: AMD and AVX512 In-Reply-To: References: Message-ID: <5433f49a-7331-4e10-1b4f-e1e5c6878920@pppl.gov> Thanks for the input. I have looked at that Wikipedia page before, but never checked it that closely. I just looked mainly to see what processors supported what extensions. After taking a closer look at AVX-512, and all the different subdivisions, I see exactly what you're saying. It's a mess! Compare that to AVX and AVX2, where it's an all-or-nothing thing. Makes a lot more sense. Prentice On 6/19/21 11:49 AM, Gerald Henriksen wrote: > On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: > >> The answer given, and I'm >> not making this up, is that AMD listens to their users and gives the >> users what they want, and right now they're not hearing any demand for >> AVX512. >> >> Personally, I call BS on that one. I can't imagine anyone in the HPC >> community saying "we'd like processors that offer only 1/2 the floating >> point performance of Intel processors". > I suspect that is marketing speak, which roughly translates to not > that no one has asked for it, but rather requests haven't reached a > threshold where the requests are viewed as significant enough. > >> Sure, AMD can offer more cores, >> but with only AVX2, you'd need twice as many cores as Intel processors, >> all other things being equal. > But of course all other things aren't equal. > > AVX512 is a mess. > > Look at the Wikipedia page(*) and note that AVX512 means different > things depending on the processor implementing it. > > So what does the poor software developer target? > > Or that it can for heat reasons cause CPU frequency reductions, > meaning real world performance may not match theoritical - thus easier > to just go with GPU's. > > The result is that most of the world is quite happily (at least for > now) ignoring AVX512 and going with GPU's as necessary - particularly > given the convenient libraries that Nvidia offers. > >> I compared a server with dual AMD EPYC >7H12 processors (128) >> quad Intel Xeon 8268 >processors (96 cores). >> From what I've heard, the AMD processors run much hotter than the Intel >> processors, too, so I imagine a FLOPS/Watt comparison would be even less >> favorable to AMD. > Spec sheets would indicate AMD runs hotter, but then again you > benchmarked twice as many Intel processors. > > So, per spec sheets for you processors above: > > AMD - 280W - 2 processors means system 560W > Intel - 205W - 4 processors means system 820W > > (and then you also need to factor in purchase price). > >> An argument can be made that for calculations that lend themselves to >> vectorization should be done on GPUs, instead of the main processors but >> the last time I checked, GPU jobs are still memory is limited, and >> moving data in and out of GPU memory can still take time, so I can see >> situations where for large amounts of data using CPUs would be preferred >> over GPUs. > AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, > which may or may not mean a difference. > > But what despite all of the above and the other replies, it is AMD who > has been winning the HPC contracts of late, not Intel. > > * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From deadline at eadline.org Mon Jun 21 21:39:08 2021 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 21 Jun 2021 17:39:08 -0400 Subject: [Beowulf] AMD and AVX512 In-Reply-To: References: Message-ID: > On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: > >>The answer given, and I'm >>not making this up, is that AMD listens to their users and gives the >>users what they want, and right now they're not hearing any demand for >>AVX512. >> >>Personally, I call BS on that one. I can't imagine anyone in the HPC >>community saying "we'd like processors that offer only 1/2 the floating >>point performance of Intel processors". > > I suspect that is marketing speak, which roughly translates to not > that no one has asked for it, but rather requests haven't reached a > threshold where the requests are viewed as significant enough. > Exactly, or "Right now cloud based servers are the biggest market. These customers need as many cores/threads as possible on die with "adequate" memory bandwidth. Oh, and they buy them by the boatload. What did you say you do again?" -- Doug From pbisbal at pppl.gov Tue Jun 22 16:01:57 2021 From: pbisbal at pppl.gov (Prentice Bisbal) Date: Tue, 22 Jun 2021 12:01:57 -0400 Subject: [Beowulf] [External] Just a quick heads up: new Beowulf coming ... In-Reply-To: References: Message-ID: I doubt it will impact us much. At worst, one or two people might find this list and post in appropriate questions. At best, it will be entertaining, similar to how when some posts a question about trees to https://www.reddit.com/r/trees, which is about smoking marijuana, or a marijuana enthusiast posts something to https://www.reddit.com/r/MarijuanaEnthusiasts, which is about trees. I would consider both of those links NSFW. Prentice On 6/20/21 8:13 AM, Andrew M.A. Cater wrote: > The folks over at Devuan have chosen to name their next code release (based > on upcoming Debian 11) beowulf. I don't think it will impact this list - but > you never know. > > For anyone runnng Debian on HPC / in labs: the latest Debian point release > 10.10 was yesterday. Debian 11 should be released in about six weeks on > 31 July 2021. > > Thanks for the excellence in this list > > All best, as ever, > > Andy Cater > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From pbisbal at pppl.gov Tue Jun 22 16:08:50 2021 From: pbisbal at pppl.gov (Prentice Bisbal) Date: Tue, 22 Jun 2021 12:08:50 -0400 Subject: [Beowulf] [External] Re: AMD and AVX512 In-Reply-To: References: Message-ID: <899a3ebb-3797-7317-4122-15c82ec184e9@pppl.gov> Thanks for the resources. I will definitely read/watch them when I can block out some time. I see he uses LAMMPS as one of his benchmarks. I was considering adding LAMMPS to my testing regimen, since it's a code I have familiarity with, and my own background is in chemistry. Prentice On 6/20/21 1:38 AM, John Hearns wrote: > Regarding benchmarking real world codes on AMD , every year Martyn > Guest presents a comprehensive set of benchmark?studies to the UK > Computing Insights Conference. > I suggest a Sunday afternoon with the beverage of your choice is a > good time to settle down and take time to read these or watch the > presentation. > > 2019 > https://www.scd.stfc.ac.uk/SiteAssets/Pages/CIUK-2019-Presentations/Martyn_Guest.pdf > > > > 2020 Video session > https://ukri.zoom.us/rec/share/ajvsxdJ8RM1wzpJtnlcypw4OyrZ9J27nqsfAG7eW49Ehq_Z5igat_7gj21Ge8gWu.78Cd9I1DNIjVViPV?startTime=1607008552000 > > > Skylake / Cascade Lake / AMD Rome > > The slides for 2020 do exist - as I remember all the slides from all > talks are grouped together, but I cannot find them. > Watch the video - it is an excellent presentation. > > > > > > > > > > > > > > > > > > > On Sat, 19 Jun 2021 at 16:49, Gerald Henriksen > wrote: > > On Wed, 16 Jun 2021 13:15:40 -0400, you wrote: > > >The answer given, and I'm > >not making this up, is that AMD listens to their users and gives the > >users what they want, and right now they're not hearing any > demand for > >AVX512. > > > >Personally, I call BS on that one. I can't imagine anyone in the HPC > >community saying "we'd like processors that offer only 1/2 the > floating > >point performance of Intel processors". > > I suspect that is marketing speak, which roughly translates to not > that no one has asked for it, but rather requests haven't reached a > threshold where the requests are viewed as significant enough. > > > Sure, AMD can offer more cores, > >but with only AVX2, you'd need twice as many cores as Intel > processors, > >all other things being equal. > > But of course all other things aren't equal. > > AVX512 is a mess. > > Look at the Wikipedia page(*) and note that AVX512 means different > things depending on the processor implementing it. > > So what does the poor software developer target? > > Or that it can for heat reasons cause CPU frequency reductions, > meaning real world performance may not match theoritical - thus easier > to just go with GPU's. > > The result is that most of the world is quite happily (at least for > now) ignoring AVX512 and going with GPU's as necessary - particularly > given the convenient libraries that Nvidia offers. > > > I compared a server with dual AMD EPYC >7H12 processors (128) > > quad Intel Xeon 8268 >processors (96 cores). > > > From what I've heard, the AMD processors run much hotter than > the Intel > >processors, too, so I imagine a FLOPS/Watt comparison would be > even less > >favorable to AMD. > > Spec sheets would indicate AMD runs hotter, but then again you > benchmarked twice as many Intel processors. > > So, per spec sheets for you processors above: > > AMD - 280W - 2 processors means system 560W > Intel - 205W - 4 processors means system 820W > > (and then you also need to factor in purchase price). > > >An argument can be made that for calculations that lend > themselves to > >vectorization should be done on GPUs, instead of the main > processors but > >the last time I checked, GPU jobs are still memory is limited, and > >moving data in and out of GPU memory can still take time, so I > can see > >situations where for large amounts of data using CPUs would be > preferred > >over GPUs. > > AMD's latest chips support PCI 4 while Intel is still stuck on PCI 3, > which may or may not mean a difference. > > But what despite all of the above and the other replies, it is AMD who > has been winning the HPC contracts of late, not Intel. > > * - https://en.wikipedia.org/wiki/Advanced_Vector_Extensions > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.st.john at gmail.com Tue Jun 22 18:35:11 2021 From: peter.st.john at gmail.com (Peter St. John) Date: Tue, 22 Jun 2021 14:35:11 -0400 Subject: [Beowulf] [External] Just a quick heads up: new Beowulf coming ... In-Reply-To: References: Message-ID: how confusing, I thought you were talking about *trees * https://en.wikipedia.org/wiki/Tree_(data_structure) :-) On Tue, Jun 22, 2021 at 12:02 PM Prentice Bisbal via Beowulf < beowulf at beowulf.org> wrote: > I doubt it will impact us much. At worst, one or two people might find > this list and post in appropriate questions. At best, it will be > entertaining, similar to how when some posts a question about trees to > https://www.reddit.com/r/trees, which is about smoking marijuana, or a > marijuana enthusiast posts something to > https://www.reddit.com/r/MarijuanaEnthusiasts, which is about trees. I > would consider both of those links NSFW. > > Prentice > > On 6/20/21 8:13 AM, Andrew M.A. Cater wrote: > > The folks over at Devuan have chosen to name their next code release > (based > > on upcoming Debian 11) beowulf. I don't think it will impact this list - > but > > you never know. > > > > For anyone runnng Debian on HPC / in labs: the latest Debian point > release > > 10.10 was yesterday. Debian 11 should be released in about six weeks on > > 31 July 2021. > > > > Thanks for the excellence in this list > > > > All best, as ever, > > > > Andy Cater > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.p.lux at jpl.nasa.gov Wed Jun 23 00:11:57 2021 From: james.p.lux at jpl.nasa.gov (Lux, Jim (US 7140)) Date: Wed, 23 Jun 2021 00:11:57 +0000 Subject: [Beowulf] [EXTERNAL] Re: AMD and AVX512 In-Reply-To: <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> References: <61925435-be88-efec-dfdc-8895f3e79616@gmail.com> <0f1a31ff-b922-1fad-9dbf-9d87ca9c44ca@gmail.com> <5fad1acd-31d1-b624-34cd-9e53742c251b@gmail.com> Message-ID: <319DC392-FDB0-4789-93B2-D5C7370681FD@jpl.caltech.edu> From: Beowulf on behalf of Joe Landman Date: Monday, June 21, 2021 at 6:46 AM To: Jonathan Engwall Cc: "beowulf at beowulf.org" Subject: [EXTERNAL] Re: [Beowulf] AMD and AVX512 On 6/21/21 9:20 AM, Jonathan Engwall wrote: I have followed this thinking "square peg, round hole." You have got it again, Joe. Compilers are your problem. To date, I don?t know that *compilers* pay much attention to things like IO (that?s buried in some library call no doubt). >>Maybe, someday, we'll get a great HPC compiler for C/Fortran. Wasn?t the Fortran compiler for the 7600 highly optimized? Did vector unrolling and all that. And those compilers for the FPS boxes? I think you mean great HPC compilers for chips that are available and fast I think, too that the comments about ARM vs x86 vs whatever are interesting. We?ve moved a long way from clusters where the ethernet interconnect was rate limiting, and the nodes were single core, single memory, single disk (if any). When you start getting into processors with hundreds of cores, or you start looking at ?nanojoules/instruction? (or is instruction even the right thing to be counting.. maybe it?s nanojoules/data operation ? where that could be a read/write from memory, disk, or interprocessor link). Look at the (probably) specious claim that Tesla has the 5th fastest supercomputer - articles are very light on details, but I think it?s a whole bunch of GPUs ? but their ?number of cores? isn?t very big compared to even #100 on the ?Top 500? list. However, it might well be that for Tesla?s specific processing load, that 5000 GPU cores *is* faster than most Top 500 clustes. And, given the recent news about miners consuming all those joules ? maybe our metrics should be looking at more than raw speed. Jim (who has not just 1, but TWO, ARM based clusters on the shelf behind his desk.. Yes, Beaglebones, but it?s an ARM, it?s 4 nodes, and I use various cluster tools to manipulate them ? the connection fabric for one is kind of slow (802.11)) -------------- next part -------------- An HTML attachment was scrubbed... URL: