From brockp at umich.edu Mon Aug 3 05:56:04 2009 From: brockp at umich.edu (Brock Palen) Date: Mon, 3 Aug 2009 08:56:04 -0400 Subject: [Beowulf] Lustre Featured on Podcast Message-ID: Thanks to Andreas for taking an hour out to talk with Jeff Squyres and myself (Brock Palen) about the Lustre cluster filesystem on our podcast www.rce-cast.com, You can find the whole show at: http://www.rce-cast.com/index.php/Podcast/rce-14-lustre-cluster-filesystem.html Thanks again! If any of you have requests of topics you would like to hear please let me know! Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 From niftyompi at niftyegg.com Mon Aug 3 22:29:50 2009 From: niftyompi at niftyegg.com (NiftyOMPI Tom Mitchell) Date: Mon, 3 Aug 2009 22:29:50 -0700 Subject: [Beowulf] Fabric design consideration In-Reply-To: <3E9B990982B6404CAC116FE42F1FC97240F7BCE1@USFMB1.forest.usf.edu> References: <3E9B990982B6404CAC116FE42F1FC97240F7BCE1@USFMB1.forest.usf.edu> Message-ID: <88815dc10908032229n35dc509clba0b1a52ab6af8f1@mail.gmail.com> On Thu, Jul 30, 2009 at 8:18 AM, Smith, Brian wrote: > Hi, All, > > I've been re-evaluating our existing InfiniBand fabric design for our HPC systems since I've been tasked with determining how we will add more systems in the future as more and more researchers opt to add capacity to our central system. ?We've already gotten to the point where we've used up all available ports on the 144 port SilverStorm 9120 chassis that we have and we need to expand capacity. ?One option that we've been floating around -- that I'm not particularly fond of, btw -- is to purchase a second chassis and link them together over 24 ports, two per spline. ?While a good deal of our workload would be ok with 5:1 blocking and 6 hops (3 across each chassis), I've determined that, for the money, we're definitely not getting the best solution. > > The plan that I've put together involves using the SilverStorm as the core in a spine-leaf design. ?We'll go ahead and purchase a batch of 24 port QDR switches, two for each rack, to connect our 156 existing nodes (with up to 50 additional on the way). ?Each leaf will have 6 links back to the spine for 3:1 blocking and 5 hops (2 for the leafs, 3 for the spine). ?This will allow us to scale the fabric out to 432 total nodes before having to purchase another spine switch. ?At that point, half of the six uplinks will go to the first spine, half to the second. ?In theory, it looks like we can scale this design -- with future plans to migrate to a 288 port chassis -- to quite a large number of nodes. ?Also, just to address this up front, we have a very generic workload, with a mix of md, abinitio, cfd, fem, blast, rf, etc. > > If the good folks on this list would be kind enough to give me your input regarding these options or possibly propose a third (or forth) option, I'd very much appreciate it. > > Brian Smith I think the hop count is a smaller design issue than cable length for QDR. Cable length and the physical layout of hosts in the machine room may prove to be the critical issue in your design. Also since routing is static some seemingly obvious assumptions about routing, links, cross sectional bandwidth and blocking can be non-obvious. Also less obvious to a group like this is your storage, job mix and batch system. For example in a single rack with a pair of QDR 24 port switches. You might wish to have two or three links connecting those 24 port switches directly at QDR rates. Then the remaining three or four links would connect (DDR?) back to the 144 switch. If the batch system was 'rack aware' jobs that could run on a single rack would and jobs that had ranks scattered about would see a lightly loaded central switch. Adding QDR to the mix as you scale out to 400+ nodes using newer multi core processor nodes could be fun. When you knock on vendor doors ask about optical links... QDR optical links may let you reach beyond some classic fabrics layouts as your machine room and cpu core count grows. -- NiftyOMPI T o m M i t c h e l l From brockp at umich.edu Tue Aug 4 11:48:21 2009 From: brockp at umich.edu (Brock Palen) Date: Tue, 4 Aug 2009 14:48:21 -0400 Subject: [Beowulf] force factory rest of sfs7000 (topspin 120) Message-ID: <635DE2F6-3A2C-4A58-91F1-072288667650@umich.edu> We have a cisco sfs7000 (maybe still under support waiting on cisco) also known as a topspin 120, IB switch. We cannot login with the password we (thought) had it set to. I have looked online and find little tonight about forcing the switch back to factory defaults without a login. Serial console works fine, just can't login. We can screw in firmware a little by stopping boot, just don't know what to do from there. If anyone has directions how to force sfs7000 to factory defaults, or password recovery help would be great. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 From Greg at keller.net Wed Aug 5 08:41:23 2009 From: Greg at keller.net (Greg Keller) Date: Wed, 5 Aug 2009 10:41:23 -0500 Subject: [Beowulf] Re NiftyOMPI Tom Mitchell In-Reply-To: <200908041900.n74J08Em000968@bluewest.scyld.com> References: <200908041900.n74J08Em000968@bluewest.scyld.com> Message-ID: <843B6DC4-1123-4AE3-A52E-2D187C67C201@Keller.net> > Brian, A 3rd option: upgrade your Chassis to 288 ports. The beauty of SS/ Qlogic switches is they all use the same components. The Chassis/ Backplane are relatively dumb and cheap. You can re-use your spine switches and leaf switches. You don't even need to add the additional spine switches if 2:1 blocking is OK. Be very careful which ports you use to link the switches together if you do try and splice 2 chassis together. SMs can have trouble mapping many configrations, and you're probably best off dedicating line cards as "Uplink" or "Compute" (but don't mix/match) if I recall the layouts correctly. With these "multi-tiered" switches the SM sometimes can't figure out which way is up if you mix the ports apparently. A 4th Option: 36 Port QDR + DDR Also note that the QDR switches are based on 36 port chips and not a huge price jump (per port), so with a "Hybrid" cable for the uplinks, you may be able to purchase the newer technology and block the heck out of it. So adding 48 additional nodes could be as easy as: Disconnect 48 nodes for uplinks from the core switch Connect 4 x 36 port QDR with 12 uplinks to each Connect 48 old, and 48 new nodes to the 36 port QDR "edge" This leaves you with 96 nodes on each side of a 48 port Option 3 is the cleanest, and generically my favorite if you can get a chassis for a reasonable price. Cheers! Greg > Date: Mon, 3 Aug 2009 22:29:50 -0700 > From: NiftyOMPI Tom Mitchell > Subject: Re: [Beowulf] Fabric design consideration > To: "Smith, Brian" > Cc: "beowulf at beowulf.org" > Message-ID: > <88815dc10908032229n35dc509clba0b1a52ab6af8f1 at mail.gmail.com> > Content-Type: text/plain; charset=UTF-8 > > On Thu, Jul 30, 2009 at 8:18 AM, Smith, Brian > wrote: >> Hi, All, >> >> I've been re-evaluating our existing InfiniBand fabric design for >> our HPC systems since I've been tasked with determining how we will >> add more systems in the future as more and more researchers opt to >> add capacity to our central system. We've already gotten to the >> point where we've used up all available ports on the 144 port >> SilverStorm 9120 chassis that we have and we need to expand >> capacity. One option that we've been floating around -- that I'm >> not particularly fond of, btw -- is to purchase a second chassis >> and link them together over 24 ports, two per spline. While a good >> deal of our workload would be ok with 5:1 blocking and 6 hops (3 >> across each chassis), I've determined that, for the money, we're >> definitely not getting the best solution. >> >> The plan that I've put together involves using the SilverStorm as >> the core in a spine-leaf design. We'll go ahead and purchase a >> batch of 24 port QDR switches, two for each rack, to connect our >> 156 existing nodes (with up to 50 additional on the way). Each >> leaf will have 6 links back to the spine for 3:1 blocking and 5 >> hops (2 for the leafs, 3 for the spine). This will allow us to >> scale the fabric out to 432 total nodes before having to purchase >> another spine switch. At that point, half of the six uplinks will >> go to the first spine, half to the second. In theory, it looks >> like we can scale this design -- with future plans to migrate to a >> 288 port chassis -- to quite a large number of nodes. Also, just >> to address this up front, we have a very generic workload, with a >> mix of md, abinitio, cfd, fem, blast, rf, etc. >> >> If the good folks on this list would be kind enough to give me your >> input regarding these options or possibly propose a third (or >> forth) option, I'd very much appreciate it. >> >> Brian Smith > > I think the hop count is a smaller design issue than cable length for > QDR. Cable length and the > physical layout of hosts in the machine room may prove to be the > critical issue in > your design. Also since routing is static some seemingly obvious > assumptions about > routing, links, cross sectional bandwidth and blocking can be non- > obvious. > > Also less obvious to a group like this is your storage, job mix and > batch system. > > For example in a single rack with a pair of QDR 24 port switches. You > might wish > to have two or three links connecting those 24 port switches directly > at QDR rates. > Then the remaining three or four links would connect (DDR?) back to > the 144 switch. > If the batch system was 'rack aware' jobs that could run on a single > rack would and > jobs that had ranks scattered about would see a lightly loaded > central switch. > > Adding QDR to the mix as you scale out to 400+ nodes using newer multi > core processor > nodes could be fun. > > When you knock on vendor doors ask about optical links... QDR optical > links may let you reach > beyond some classic fabrics layouts as your machine room and cpu core > count grows. > > -- > NiftyOMPI > T o m M i t c h e l l > > > > ------------------------------ > > Message: 2 > Date: Tue, 4 Aug 2009 14:48:21 -0400 > From: Brock Palen > Subject: [Beowulf] force factory rest of sfs7000 (topspin 120) > To: Bewoulf > Message-ID: <635DE2F6-3A2C-4A58-91F1-072288667650 at umich.edu> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > We have a cisco sfs7000 (maybe still under support waiting on cisco) > also known as a topspin 120, IB switch. > > We cannot login with the password we (thought) had it set to. I have > looked online and find little tonight about forcing the switch back to > factory defaults without a login. > > Serial console works fine, just can't login. We can screw in firmware > a little by stopping boot, just don't know what to do from there. If > anyone has directions how to force sfs7000 to factory defaults, or > password recovery help would be great. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > > > ------------------------------ > > _______________________________________________ > Beowulf mailing list > Beowulf at beowulf.org > http://www.beowulf.org/mailman/listinfo/beowulf > > > End of Beowulf Digest, Vol 66, Issue 3 > ************************************** From deadline at eadline.org Wed Aug 5 11:36:53 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed, 5 Aug 2009 14:36:53 -0400 (EDT) Subject: [Beowulf] BProc In-Reply-To: References: Message-ID: <33345.192.168.1.213.1249497413.squirrel@mail.eadline.org> If you would like to read more from Don, take a look at newly posted interview: Don Becker On The State Of HPC http://www.linux-mag.com/cache/7449/1.html -- Doug From douglas.guptill at dal.ca Wed Aug 5 11:52:50 2009 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Wed, 5 Aug 2009 15:52:50 -0300 Subject: [Beowulf] BProc In-Reply-To: <33345.192.168.1.213.1249497413.squirrel@mail.eadline.org> References: <33345.192.168.1.213.1249497413.squirrel@mail.eadline.org> Message-ID: <20090805185250.GA12440@dome> On Wed, Aug 05, 2009 at 02:36:53PM -0400, Douglas Eadline wrote: > > If you would like to read more from Don, take a look > at newly posted interview: > > Don Becker On The State Of HPC > > http://www.linux-mag.com/cache/7449/1.html Love it. Thanks, Douglas. From kus at free.net Fri Aug 7 11:51:21 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 07 Aug 2009 22:51:21 +0400 Subject: [Beowulf] numactl & SuSE11.1 Message-ID: I've OpenSuSE 11.1 (kernel 2.6.22.5-31) installed on dual Nehalem (Xeon E5520) server. numactl -- show says libnuma: Warning : /sys not mounted or invalid. Assuming one node: No such file or directory /sys/devices/system/node contains 2 directories, but they are node0 and node2 (instead node1 which I expected). How is possible to correct this situation ? Mikhail Kuzminsky Computer Assistant to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From rpnabar at gmail.com Fri Aug 7 16:59:24 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 7 Aug 2009 18:59:24 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem Message-ID: Is it a bad mistake to configure a Nehalem (2 sockets quad core giving a total of 8 cores; E5520) with 16 GB RAM (4 DIMMs of 4GB each)? I know (I think) that the optimized memory for Nehalems is in banks of 6 due to the way the architecture is? I have often seen Nehalems coming with 24 GB memory as 6 DIMMs of 4 GB each. Our code requirements dictate 2 GB / core is enough. Should I be paying for the additional RAM to make it 24 GB? Also, are there any other tips for the Nehalems in general to coax out max performance? Maybe some compiler flags or BIOS settings etc? The only thing I did so far was to put the BIOS power setting into a "max performance" mode. In the past I've gotten about 5% additional performance by changing the power profile to "performance" using cpu-freq-set on my AMD Opteron Barcelonas. Any similar gotchas for the Nehalems and HPC? -- Rahul From gus at ldeo.columbia.edu Fri Aug 7 17:58:39 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 07 Aug 2009 20:58:39 -0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: Message-ID: <4A7CCDBF.7070904@ldeo.columbia.edu> Hi Rahul, list In case you haven't read it, this Nehalem memory guide from Dell has good information and the memory configuration details: http://www.delltechcenter.com/page/04-08-2009+-+Nehalem+and+Memory+Configurations A researcher here bought a Nehalem workstation (not a cluster) with 24GB RAM also. We followed the article recommendation, which was also what the vendor suggested. Maybe 24GB is more than needed, but presumably avoids the performance penalty that would hit a 16GB configuration. Since the computer will mostly run Matlab jobs, and Matlab has no bounds when it comes to memory, it may not have been a waste anyway. Some people are reporting good results when using the Nehalem hypethreading feature (activated on the BIOS). When the code permits, this virtually doubles the number of cores on Nehalems. That feature works very well on IBM PPC-6 processors (IBM calls it "simultaneous multi-threading" SMT, IIRR), and scales by a factor of >1.5, at least with the atmospheric model I tried. This may be a useful way to explore your 24GB, say, by running 12 processes on a 8-core node (50% oversubscribed), instead of the 8 processes that you run today on the Barcelonas. As for compiler flags, if you are using Intel these are probably good: -wS (which gives you SSE4, but check if there is something fancier now for Nehalem) -fast, although some of our codes had problems with the -ipo that is part of -fast, and I had to reduce it to -ip plus the other bits and pieces of -fast. I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Rahul Nabar wrote: > Is it a bad mistake to configure a Nehalem (2 sockets quad core giving > a total of 8 cores; E5520) with 16 GB RAM (4 DIMMs of 4GB each)? I > know (I think) that the optimized memory for Nehalems is in banks of 6 > due to the way the architecture is? I have often seen Nehalems coming > with 24 GB memory as 6 DIMMs of 4 GB each. > > Our code requirements dictate 2 GB / core is enough. Should I be > paying for the additional RAM to make it 24 GB? > > Also, are there any other tips for the Nehalems in general to coax out > max performance? Maybe some compiler flags or BIOS settings etc? The > only thing I did so far was to put the BIOS power setting into a "max > performance" mode. > > In the past I've gotten about 5% additional performance by changing > the power profile to "performance" using cpu-freq-set on my AMD > Opteron Barcelonas. Any similar gotchas for the Nehalems and HPC? > From davidramirezmolina at gmail.com Fri Aug 7 12:42:58 2009 From: davidramirezmolina at gmail.com (David Ramirez) Date: Fri, 7 Aug 2009 14:42:58 -0500 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? Message-ID: Due to space constraints I am considering implementing a 8-node (+ master) HPC cluster project using small form computers. Knowing that Shuttle is a reputable brand, with several years in the market, I wonder if any of you out there has already used them on clusters and how has been your experience (performance, reliability etc.) -- | David Ramirez Molina | davidramirezmolina at gmail.com | Houston, Texas - USA Ancora Imparo (A?n aprendo) - Michelangelo a los 80 a?os -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenyon1 at iit.edu Fri Aug 7 12:55:38 2009 From: chenyon1 at iit.edu (Yong Chen) Date: Fri, 07 Aug 2009 19:55:38 +0000 (GMT) Subject: [Beowulf] [hpc-announce] Call for Attendance: P2S2-2009 Workshop Message-ID: Dear Colleagues, The Second International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) will be held in Vienna, Austria, on Sept. 22nd, 2009 in conjunction with The 38th International Conference on Parallel Processing (ICPP-2009). The workshop program has been finalized and can be found here: http://www.mcs.anl.gov/events/workshops/p2s2/pro.html (listed below for your reference). We welcome you attend the P2S2-2009 workshop and look forward to seeing you in Vienna, Austria! =============================================================================== Session 1: Opening Time: 09:00 - 10:30, Location: Room F3 (89), Chair: Pavan Balaji, Argonne National Laboratory Opening Remarks (D. K. Panda, Pavan Balaji and Abhinav Vishnu) Invited Keynote by Dr. Pete Beckman, Argonne National Laboratory, "Challenges for System Software on Exascale Platforms" 10:30 - 11:00 Coffee Break Session 2: Software for Large-scale Systems Time: 11:00 - 12:30, Location: Room F3 (89), Chair: Tom Peterka, Argonne National Laboratory 1. "Characterizing the Performance of Big Memory on Blue Gene Linux" Kazutomo Yoshii, Kamil Iskra, P. Chris Broekema, Harish Naik and Pete Beckman 2. "Optimization of Preconditioned Parallel Iterative Solvers for Finite-Element Applications using Hybrid Parallel Programming Models on T2K Open Supercomputer (Todai Combined Cluster)" Kengo Nakajima 3. "Analyzing Checkpointing Trends for Applications on Peta-scale Systems" Harish Naik, Rinku Gupta and Pete Beckman 12:30 - 14:00 Lunch Session 3: Communication and I/O Time: 14:00 - 15:30, Location: Room F3 (89), Chair: Abhinav Vishnu, Pacific Northwest National Laboratory 1. "Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand" Tejus Gangadharappa, Matthew Koop and Dhabaleswar K Panda 2. "CkDirect: Unsynchronized One-Sided Communication in a Message-Driven Paradigm" Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele and Laxmikant Kale 3. "Exploiting Latent I/O Asynchrony in Petascale Science Applications" Patrick Widener, Matthew Wolf, Hasan Abbasi, Scott McManus, Mary Payne, Patrick Bridges and Karsten Schwan 4. "Gears4Net - An Asynchronous Programming Model" Martin Saternus, Torben Weis, Sebastian Holzapfel and Arno Wacker 15:30 - 16:00 Coffee Break Session 4: Software for Multicore Architectures Time: 16:00 - 17:30, Location: Room F3 (89), Chair: Ron Brightwell, Sandia National Laboratory 1. "Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms" Changjun Hu, Yali Liu and Jianjiang Li 2. "Open Source Software Support for the OpenMP Runtime API for Profiling" Oscar Hernandez, Van Bui, Richard Kufrin and Barbara Chapman 3. "Just-In-Time Renaming and Lazy Write-Back on the Cell/B.E." Pieter Bellens, Rosa Badia and Jesus Labarta From landman at scalableinformatics.com Sat Aug 8 09:51:58 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sat, 08 Aug 2009 12:51:58 -0400 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: References: Message-ID: <4A7DAD2E.4060004@scalableinformatics.com> David Ramirez wrote: > Due to space constraints I am considering implementing a 8-node (+ > master) HPC cluster project using small form computers. Knowing that > Shuttle is a reputable brand, with several years in the market, I wonder > if any of you out there has already used them on clusters and how has > been your experience (performance, reliability etc.) The down sides (from watching others do this) 1) no ECC ram. You will get bit-flips. ECC protects you (to a degree) against some bit-flippage. If you can get ECC memory (and turn on the ECC support in BIOS), by all means, do so. 2) power. One customer from a while ago did this, and found that the power supplies on the units were not able to supply a machine running the processor and memory (and disk/network etc) at nearly full load for many hours. You have to make sure your entire computing infrastructure (in the box) fits in *under* the power budget from the supply. This may be easier these days using "gamer" rigs which have power to handle GPU cards, but keep this in mind anyway. 3) networks. Sadly, the NICs on the hobby machines aren't usually up to the level of quality on the server systems. You might not get PXE capability (though these days, I haven't seen many boards without it). Just evaluate your options carefully with the specs in hand. You will have design tradeoffs due to the space constraint, just keep in mind your goals as you evaluate them. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From gerry.creager at tamu.edu Sat Aug 8 10:08:37 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sat, 08 Aug 2009 12:08:37 -0500 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: <4A7DAD2E.4060004@scalableinformatics.com> References: <4A7DAD2E.4060004@scalableinformatics.com> Message-ID: <4A7DB115.1020409@tamu.edu> Joe Landman wrote: > David Ramirez wrote: >> Due to space constraints I am considering implementing a 8-node (+ >> master) HPC cluster project using small form computers. Knowing that >> Shuttle is a reputable brand, with several years in the market, I >> wonder if any of you out there has already used them on clusters and >> how has been your experience (performance, reliability etc.) > > The down sides (from watching others do this) > > 1) no ECC ram. You will get bit-flips. ECC protects you (to a degree) > against some bit-flippage. If you can get ECC memory (and turn on the > ECC support in BIOS), by all means, do so. > > 2) power. One customer from a while ago did this, and found that the > power supplies on the units were not able to supply a machine running > the processor and memory (and disk/network etc) at nearly full load for > many hours. You have to make sure your entire computing infrastructure > (in the box) fits in *under* the power budget from the supply. This may > be easier these days using "gamer" rigs which have power to handle GPU > cards, but keep this in mind anyway. > > 3) networks. Sadly, the NICs on the hobby machines aren't usually up to > the level of quality on the server systems. You might not get PXE > capability (though these days, I haven't seen many boards without it). Adding to Joe's comments... and having tried this a couple of years ago, the NICs are not up to the drill. Plan to add Intel gigabit NICs. While they'll likely be TOE NICs, learn how to tune them and to stop TOE functionality on them. I've nothing kind to say about Broadcom NICs, and am even less kind in HPC/HTC environments with hobbyist chipset implementations. > Just evaluate your options carefully with the specs in hand. You will > have design tradeoffs due to the space constraint, just keep in mind > your goals as you evaluate them. I'd be trying to find ways to get 1u systems and, if 8 is the number, you'll find they don't take up much room. gerry -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From bill at cse.ucdavis.edu Sat Aug 8 12:53:01 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Sat, 08 Aug 2009 12:53:01 -0700 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: <4A7DB115.1020409@tamu.edu> References: <4A7DAD2E.4060004@scalableinformatics.com> <4A7DB115.1020409@tamu.edu> Message-ID: <4A7DD79D.4070804@cse.ucdavis.edu> Gerry Creager wrote: > I'd be trying to find ways to get 1u systems and, if 8 is the number, > you'll find they don't take up much room. Doubly so if you get one of the 2 nodes in 1U or 4 nodes in 2U. From hahn at mcmaster.ca Sat Aug 8 15:42:45 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 8 Aug 2009 18:42:45 -0400 (EDT) Subject: [Beowulf] numactl & SuSE11.1 In-Reply-To: References: Message-ID: > I've OpenSuSE 11.1 (kernel 2.6.22.5-31) installed on dual Nehalem (Xeon > E5520) server. > numactl -- show > says > libnuma: Warning : /sys not mounted or invalid. Assuming one node: No such > file or directory > > /sys/devices/system/node contains 2 directories, but they are node0 and node2 > (instead node1 which I expected). sounds like the kernel isn't grokking the cpu; given that 2.6.22.5 dates from 08/22/2007, that's not all that surprising... > How is possible to correct this situation ? I'm guessing a new kernel would do it - since all numactl's can grok amd's opterons, they ought to deal with intel's ;) From hahn at mcmaster.ca Sat Aug 8 15:47:47 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 8 Aug 2009 18:47:47 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: Message-ID: > Is it a bad mistake to configure a Nehalem (2 sockets quad core giving > a total of 8 cores; E5520) with 16 GB RAM (4 DIMMs of 4GB each)? I there's no ambiguity here: unpopulated channels decrease bandwidth and/or concurrency. (does anyone know whether nehalem can "ungang" memory channels like opteron can? it would be fascinating to see benchmarks showing a benefit to higher memory concurrency for a manycore workload...) > Our code requirements dictate 2 GB / core is enough. Should I be > paying for the additional RAM to make it 24 GB? ram is, historically and relatively, cheap. otoh, can your code get by with 1.5G/core? actually, I tend to see some association with smallish memory footprints (2G/core is definitely not large) with cache-friendliness. this would argue that the higher bandwidth may not make much difference to your code... regards, mark hahn. From rpnabar at gmail.com Sat Aug 8 16:50:40 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 8 Aug 2009 18:50:40 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <4A7CCDBF.7070904@ldeo.columbia.edu> References: <4A7CCDBF.7070904@ldeo.columbia.edu> Message-ID: On Fri, Aug 7, 2009 at 7:58 PM, Gus Correa wrote: > Some people are reporting good results when using the > Nehalem hypethreading feature (activated on the BIOS). > When the code permits, this virtually doubles the number > of cores on Nehalems. > That feature works very well on IBM PPC-6 processors > (IBM calls it "simultaneous multi-threading" SMT, IIRR), > and scales by a factor of >1.5, at least with the atmospheric > model I tried. Thanks for all the useful comments, Gus! Hyperthreading is confusing the hell out of me. I expected to see 8 cores in cat /proc/cpuinfo Now I see 16. (This means I must have left hyperthreading on I guess; I ought to go to the server room; reboot and check the BIOS) This is confusing my benchmarking too. Let's say I ran an MPI job with -np 4. If there was no other job on this machine would hyperthreading bring the other CPUs into play as well? The reason I ask is this: I have noticed that a single 4 core job is slower than two 4 core jobs run simultanously. This seems puzzling to me. -- Rahul From tegner at renget.se Sat Aug 8 22:42:28 2009 From: tegner at renget.se (Jon Tegner) Date: Sun, 09 Aug 2009 07:42:28 +0200 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: References: Message-ID: <4A7E61C4.1040409@renget.se> David Ramirez wrote: > Due to space constraints I am considering implementing a 8-node (+ > master) HPC cluster project using small form computers. Knowing that > Shuttle is a reputable brand, with several years in the market, I > wonder if any of you out there has already used them on clusters and > how has been your experience (performance, reliability etc.) Not exactly what you asked for, but slightly related anyway: I have been working on small AND silent lately. You can check some pictures at www.renget.se/bilder/clm1s.jpg www.renget.se/bilder/clm2s.jpg www.renget.se/bilder/clm3s.jpg www.renget.se/bilder/clm4s.jpg You can judge the size from the mainboards (24.5x24.5 cm) or by the standard 3.5 HD. There are no fans in this system (except for the power bricks), and it is reasonably small. Cooling is achieved by transferring heat from cpus to a cooling channel, and this heat is then removed by natural convection. The orientation of the boards also improves the cooling of the other components on the boards. The absence of fans (and small size) makes a system like this suitable for use in an office (1u system are generally very noisy). I'm working on an article for ClusterMonkey, and I'll fill in missing details on that forum. Regards, /jon From tjrc at sanger.ac.uk Sun Aug 9 00:17:43 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Sun, 9 Aug 2009 08:17:43 +0100 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: <4A7DB115.1020409@tamu.edu> References: <4A7DAD2E.4060004@scalableinformatics.com> <4A7DB115.1020409@tamu.edu> Message-ID: <73994BB4-4E0F-43C8-9B1E-8BB718FD6BCA@sanger.ac.uk> If space is a constraint, but up-front cost less so, you might want to consider a small blade chassis; something like an HP c-3000, which can take 8 blades. Especially if all you want is a GigE interconnect, which will fit in the same box. Potentially that will get you 64 cores in 6U, and essentialy no extra space required for infrastructure. Presumably other blade vendors do similar things. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From a.travis at abdn.ac.uk Sun Aug 9 04:30:11 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Sun, 09 Aug 2009 12:30:11 +0100 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: <4A7DAD2E.4060004@scalableinformatics.com> References: <4A7DAD2E.4060004@scalableinformatics.com> Message-ID: <4A7EB343.3040403@abdn.ac.uk> Joe Landman wrote: > David Ramirez wrote: >> Due to space constraints I am considering implementing a 8-node (+ >> master) HPC cluster project using small form computers. Knowing that >> Shuttle is a reputable brand, with several years in the market, I wonder >> if any of you out there has already used them on clusters and how has >> been your experience (performance, reliability etc.) > > The down sides (from watching others do this) > > 1) no ECC ram. You will get bit-flips. ECC protects you (to a degree) > against some bit-flippage. If you can get ECC memory (and turn on the > ECC support in BIOS), by all means, do so. Hello, Joe and David. I agree about the ECC RAM, but I used to have six IWill dual Opteron 246 SFF computers in a Beowulf cluster and these do have ECC memory. > 2) power. One customer from a while ago did this, and found that the > power supplies on the units were not able to supply a machine running > the processor and memory (and disk/network etc) at nearly full load for > many hours. You have to make sure your entire computing infrastructure > (in the box) fits in *under* the power budget from the supply. This may > be easier these days using "gamer" rigs which have power to handle GPU > cards, but keep this in mind anyway. Absolutely, and that is why I said "I used to have" above! The Iwill's have custom PSU's that are badly overrun, and when they die it's very expensive to get them repaired. Eventually, they can't be repaired: I've had several returned now as "beyond economic repair", and I've decided to retire the IWill's. It's a pity, because the IWill's are nice machines. However, even with 55W Opteron 248HE's fitted the IWill Zmaxdp can't keep the CPU's cool under load unless they have their extremely noisy fans running at full speed. I've kept one Zmaxd2 for desktop use with dual Opteron 246HE's and it's fine, unless you make it work hard ;-) > 3) networks. Sadly, the NICs on the hobby machines aren't usually up to > the level of quality on the server systems. You might not get PXE > capability (though these days, I haven't seen many boards without it). Well, the IWill's are/were server-grade machines with GBit NIC's and they do PXE boot. > Just evaluate your options carefully with the specs in hand. You will > have design tradeoffs due to the space constraint, just keep in mind > your goals as you evaluate them. I really would avoid SFF systems as compute nodes: I've just used Tyan ATX FF S3970 motherboards in pedestal cases on industrial shelving and you bear in mind that standard Shuttle cases are only 50% the size of an ATX case. You can get four ATX cases in the space occupied by your eight Shuttle SFF computers... Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From Bogdan.Costescu at iwr.uni-heidelberg.de Sun Aug 9 06:50:59 2009 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Sun, 9 Aug 2009 15:50:59 +0200 (CEST) Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: References: Message-ID: On Fri, 7 Aug 2009, David Ramirez wrote: > Due to space constraints I am considering implementing a 8-node (+ master) > HPC cluster project using small form computers. Knowing that Shuttle is a > reputable brand, with several years in the market, I wonder if any of you > out there has already used them on clusters and how has been your experience > (performance, reliability etc.) I've built a cluster of 80 nodes, which will turn 5 this month. Using Shuttle SB75G2, supports ECC, has a GigE on board (Broadcom) and the power supply is more than enough for the CPU (PIV Northwood 3.2GHz), one SATA HDD, a low power and performance graphics card (there's no on board graphics unfortunately) and an extra GigE card (Intel E1000). The decision for adding an extra NIC was not due to problems with the Broadcom chip, but simply to have dedicated networks; the Broadcom is able to do PXE just fine and this is the way these nodes have booted since setting them up. I was pleasantly surprised by the reliability of these computers. Given their tightness, they require attention and good skills when building them, f.e. using good quality thermal paste to avoid local thermal problems and routing cables to avoid transport thermal problems. About 70 of the 80 are still running well today, most of the failed ones stopped working correctly after the 3 years of warranty so I didn't make much effort to find out what is wrong - the main problem being instability under combined CPU and I/O load. Of course, when RAM and HDDs failed and were easy to recognize as causes, they were replaced as needed. As I wrote earlier on this list, the main disadvantage of such SFFs is the lack of IPMI support. There is no serial console support in the BIOS, so changing BIOS settings is a pain. Power control can be achieved with a PDU, but I didn't choose this way because I knew that the nodes should be always up and I wouldn't have to press the power buttons too often ;-) Another thing to keep in mind is that, due to their tightness, they are quite sensitive to the external temperature - if the A/C fails, expect a sharp raise in internal temperature, so setting up monitoring, both environmental and for the builtin sensors, is recommended. Good luck! -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.costescu at iwr.uni-heidelberg.de From ljdursi at scinet.utoronto.ca Sun Aug 9 08:52:00 2009 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Sun, 9 Aug 2009 11:52:00 -0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: Message-ID: <432C3A61-6B51-4887-BE0E-C6848BB8E4BF@scinet.utoronto.ca> On 7-Aug-09, at 7:59PM, Rahul Nabar wrote: > Is it a bad mistake to configure a Nehalem (2 sockets quad core giving > a total of 8 cores; E5520) with 16 GB RAM (4 DIMMs of 4GB each)? It depends. You'll have to do the timings with your codes; with mine (a uniform grid explicit hydrodynamics code; memory limited, with extremely regular memory access patterns) I saw a pretty robust 10% performance difference between a 16GB `unbalanced' and an 18GB `balanced' memory configuration. You'll have to do the measurements and decide if the resulting performance gain is worth the cost... - Jonathan -- Jonathan Dursi From gus at ldeo.columbia.edu Sun Aug 9 19:34:07 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Sun, 09 Aug 2009 22:34:07 -0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> Message-ID: <4A7F871F.1050805@ldeo.columbia.edu> Hi Rahul, list See answers inline. Rahul Nabar wrote: > On Fri, Aug 7, 2009 at 7:58 PM, Gus Correa wrote: >> Some people are reporting good results when using the >> Nehalem hypethreading feature (activated on the BIOS). >> When the code permits, this virtually doubles the number >> of cores on Nehalems. >> That feature works very well on IBM PPC-6 processors >> (IBM calls it "simultaneous multi-threading" SMT, IIRR), >> and scales by a factor of >1.5, at least with the atmospheric >> model I tried. > > > Thanks for all the useful comments, Gus! Hyperthreading is confusing > the hell out of me. So it is to me. The good news is that according to all reports I read, hyperthreading in Nehalem works well (by contrast with the old version on Pentium-4 and the corresponding Xeons). I expected to see 8 cores in cat /proc/cpuinfo Now > I see 16. (This means I must have left hyperthreading on I guess; I > ought to go to the server room; reboot and check the BIOS) > Most likely it is on. Maybe it is the BIOS default, or the vendor set it up this way. Unfortunately I don't have access to the Nehalem machine. So, I can't check the /proc/cpuinfo here, play with MPI, etc. I helped a grad student configure it, for his thesis research, but the researcher who he works for is a PITA. Bad politics. > This is confusing my benchmarking too. Let's say I ran an MPI job with > -np 4. If there was no other job on this machine would hyperthreading > bring the other CPUs into play as well? > Which MPI do you use? IIRR, you have Gigabit Ethernet, right? (not Infiniband) If you use OpenMPI, you can set the processor affinity, i.e. bind each MPI process to one "processor" (which was once a CPU, then became a core, and now is probably a virtual processor associated to the hyperthreaded Nehalem core). In my experience (and other people's also) this improves performance. On the Opteron Shanghais we have, "top" shows the process number always paired with the "procesor", which in this case is a core, when processor affinity is set. I presume with Nehalem the thing will work, although the processes will be paired with the multithreaded core. In OpenMPI all this takes is to add the flag: -mca mpi_paffinity_alone 1 to the mpiexec line. OpenMPI has finer grained control of processor affinity through a file where you make the process-to-processor association. However, the setup above may be good enough for jazz, and is quite simple. Up to MPICH2 1.0.8p1 there was no such a thing in MPICH. However, I haven't checked their latest greatest version 1.1. They may have something now. > The reason I ask is this: I have noticed that a single 4 core job is > slower than two 4 core jobs run simultanously. This seems puzzling to > me. > It is possible that this is the result of not setting processor affinity. The Linux scheduler may not switch processes across cores/processors efficiently. You may check this out by logging in to a node and using "top", hitting "1" (to show all cores/hyperthreads), hitting "f" to change the displayed fields, then hitting "j" (check, not sure if it is "j") to show the processor/core/hyperthreaded core). I would guess you can pair 6 hyperthreaded cores on each socket to 6 processes. This would give a symmetric and probably load balanced distribution of work. This would also handle 12 processes per node, and fully utilize your 24GB of memory, on your production jobs that require 2GB/process. (Not sure you actually have 24GB or 16GB, though. You didn't say how much memory you bought.) I would be curious to learn what you get, with processor affinity on Nehalems. I would guess it should work, like in physical cores. At least on the IBM PPC-6 it does work and improves the performance. I read somebody telling that it works well also with Nehalems, specifically with an ocean model, getting a decent scaling around 1.4 using 16 processes per node, IIRR. I hope this helps. Good luck! Gus Correa From rpnabar at gmail.com Sun Aug 9 20:33:09 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 9 Aug 2009 22:33:09 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <4A7F871F.1050805@ldeo.columbia.edu> References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Sun, Aug 9, 2009 at 9:34 PM, Gus Correa wrote: > Most likely it is on. > Maybe it is the BIOS default, or the vendor set it up this way. > > Unfortunately I don't have access to the Nehalem machine. > So, I can't check the /proc/cpuinfo here, play with MPI, etc. > I helped a grad student configure it, for his thesis research, > but the researcher who he works for is a PITA. ?Bad politics. Is there a way of finding out within Linux if Hyperthreading is on or not? I know there is a BIOS setting but one of the machines I am testing is remote and I do not have access to BIOS. I'll ask them though but I am impatient to figure out! Alternatively /proc/cpuinfo shows a bunch, say 16, cores. Is there a way to find out if all of these are real cores or hyperthreaded? -- Rahul From rpnabar at gmail.com Sun Aug 9 20:42:25 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 9 Aug 2009 22:42:25 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <4A7F871F.1050805@ldeo.columbia.edu> References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Sun, Aug 9, 2009 at 9:34 PM, Gus Correa wrote: > See answers inline. Thanks! > So it is to me. > The good news is that according to all reports I read, > hyperthreading in Nehalem works well What I am more concerned about is its implications on benchmarking and schedulers. (a) I am seeing strange scaling behaviours with Nehlem cores. eg A specific DFT (Density Functional Theory) code we use is maxing out performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are actually slower than 2 and 4 cores (depending on setup) Just doesn't make sense to me. We are indeed doing something wrong. And no, it isn't just bad parallelization of this code since we have ran it on AMDs and of course performance increases with cores on a single server for sure. (b) We usually set up Torque / PBS / maui to also allow partial server requests. i.e. somebody could say just get 4 cores on a server. The other four cores could go to another job or stay empty. Question is with hyperthreading this compartmentalization is lost isn't it? So userA who got 4 cores could end up leeching on the other 4 cores too? Or am I wrong? > > Which MPI do you use? OpenMPI > IIRR, you have Gigabit Ethernet, right? (not Infiniband) Yes. That's right. No infiniband. > If you use OpenMPI, you can set the processor affinity, > i.e. bind each MPI process to one "processor" (which was once > a CPU, then became a core, and now is probably a virtual > processor associated to the hyperthreaded Nehalem core). > In my experience (and other people's also) this improves > performance. Yup, good point. I have done this with Barcelonas (AMD) and had a 5% boost. Let me try it with the Nehalems too. > > It is possible that this is the result of not setting > processor affinity. > The Linux scheduler may not switch processes > across cores/processors efficiently. So let me double check my understanding. On this Nehalem if I set the processor affinity is that akin to disabling hyperthreading too? Or are these two independent concepts? > (Not sure you actually have 24GB or 16GB, though. > You didn't say how much memory you bought.) I am running two tests. machineA has 24 GB machineB has 16GB. But other things change too. machineA has the X5550 whereas machineB has the E5520. I'll post the results once I have them for the Nehalems! Thanks again, Gus. All very helpful. -- Rahul From tomislav.maric at gmx.com Mon Aug 10 04:48:02 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Mon, 10 Aug 2009 13:48:02 +0200 Subject: [Beowulf] Re: [Paraview] VTK under ParaView In-Reply-To: References: <4A7FEBD4.4070404@gmx.com> Message-ID: <4A8008F2.3000506@gmx.com> J?r?me wrote: > Hi, > > ParaView comes with its own VTK sources. You can find in the source tree > : ./Paraview3/VTK > The VTK binaries will be put in the ParaView binary tree : ./ParaViewBin/bin > > Obviously, the paths depend on your calling way, and on your CMake settings > > Hope that helps > > Jerome Thank you for the advice, I think I have solved a problem by istalling the software with .deb package. Best regards, Tomislav From hahn at mcmaster.ca Mon Aug 10 05:33:27 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 10 Aug 2009 08:33:27 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: > Is there a way of finding out within Linux if Hyperthreading is on or > not? in /proc/cpuinfo, I believe it's a simple as siblings > cpu cores. that is, I'm guessing one of your nehalem's shows as having 8 siblings and 4 cpu cores. From hahn at mcmaster.ca Mon Aug 10 05:41:09 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 10 Aug 2009 08:41:09 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: > (a) I am seeing strange scaling behaviours with Nehlem cores. eg A > specific DFT (Density Functional Theory) code we use is maxing out > performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are > actually slower than 2 and 4 cores (depending on setup) this is on the machine which reports 16 cores, right? I'm guessing that the kernel is compiled without numa and/or ht, so enumerates virtual cpus first. that would mean that when otherwise idle, a 2-core proc will get virtual cores within the same physical core. and that your 8c test is merely keeping the first socket busy. > other four cores could go to another job or stay empty. Question is > with hyperthreading this compartmentalization is lost isn't it? So > userA who got 4 cores could end up leeching on the other 4 cores too? > Or am I wrong? the kernel/scheduler is smart enough to do mostly the right thing WRT virtual cores. when compiled properly... >> It is possible that this is the result of not setting >> processor affinity. >> The Linux scheduler may not switch processes >> across cores/processors efficiently. > > So let me double check my understanding. On this Nehalem if I set the > processor affinity is that akin to disabling hyperthreading too? Or > are these two independent concepts? processor affinity just means restricting the set of cores a proc can run on. it's orthogonal to the question of choosing the _right_ cores. From hahn at mcmaster.ca Mon Aug 10 08:51:31 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 10 Aug 2009 11:51:31 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <20090810154348.GC6915@alice05> References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> <20090810154348.GC6915@alice05> Message-ID: > Googling for 'dmidecode Hyper Thread' I found this 2004 article: the info in /proc/cpuinfo has definitely changed since 2004. From mdidomenico4 at gmail.com Mon Aug 10 09:26:59 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 10 Aug 2009 12:26:59 -0400 Subject: [Beowulf] sun x4100's with infiniband Message-ID: just cause i've posted it everywhere else, figured i'd make one last ditch effort and see if anyone on this list might know the answer... I have several Sun x4100 with Infiniband servers which appear to be running at 200MB/sec instead of 800MB/sec. It's a freshly reformatted cluster converting from solaris to linux. During the conversion we reset the bios settings with "load optimal defaults" and cleared all the BMC/BIOS events logs and such. Does anyone know which bios setting got changed during the process which dropped the bandwidth? From rpnabar at gmail.com Mon Aug 10 09:29:42 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 11:29:42 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, Aug 10, 2009 at 7:33 AM, Mark Hahn wrote: > in /proc/cpuinfo, I believe it's a simple as siblings > cpu cores. > that is, I'm guessing one of your nehalem's shows as having 8 siblings > and 4 cpu cores. Yes. That works. Also looking at the "physical id" helps. I was confused by the ht flag. Apparently that is not relevant/. It only indicates whether the CPU can report hyperthreading or not. No wonder all my boxes have that "ht" flag. -- Rahul From rpnabar at gmail.com Mon Aug 10 09:41:06 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 11:41:06 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem Message-ID: A while ago Tiago Marques had provided some benchmarking info in a thread ( http://www.beowulf.org/archive/2009-May/025739.html ) and some recent tests that I've been doing made me interested in this snippet again: >One of the codes, VASP, is very bandwidth limited and loves to run in a >number of cores multiple of 3. The 5400s are also very bandwith - memory and >FSB - limited which causes that they sometimes don't scale well above 6 >cores. They are very fast per core, as someone mentioned, when compared to >AMD cores. >These are the times I get from a benchmark I usually run in VASP: > >VASP on Core i7: > - 1 core = 162.453s, 162.778s (no HT) > - 2 cores = 100s,102s (no HT) > - 3 cores = 77.835s, 78.195s (no HT) > - 4 cores = 87.63s, 87.322s (no HT) > - 6 cores = *76.56s, 76.4s* > - 6 cores DDR3-1600 CAS9 - 69.654s, 68.816s, 67.7s > >HT doesn't add much but DDR3-1600 does. Still, ~78s is very fast with a >quad-core because our dual 5400s can only do *91s* at best, even using >tweaks like CPU affinity, which brings it down from 95s, by distributing >only 3 threads per socket and not 4/2 or having 4 of them constantly jumping >from socket to socket. Apparently it shows that the Nehalems for VASP scale well only to 3 cores? Putting 4 cores on the job actually causes the runtime to increase? This seems pretty bizzare to me at first sight but this seems close to what I am getting as well. Any other people seen similar scaling? (I am trying the cpu affinity flags now to see if that makes a difference) How would you explain this? In the past I've seen the codes scale well to core numbers higher than this. -- Rahul From rpnabar at gmail.com Mon Aug 10 09:43:22 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 11:43:22 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn wrote: >> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A >> specific DFT (Density Functional Theory) code we use is maxing out >> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >> actually slower than 2 and 4 cores (depending on setup) > > this is on the machine which reports 16 cores, right? ?I'm guessing > that the kernel is compiled without numa and/or ht, so enumerates virtual > cpus first. ?that would mean that when otherwise idle, a 2-core > proc will get virtual cores within the same physical core. ?and that your 8c > test is merely keeping the first socket busy. No. On both machines. The one reporting 16 cores and the other reporting 8. i.e. one hyperthreaded and the other not. Both having 8 physical cores. What is bizarre is I tried using -np 16. THat ought to definitely utilize all cores, right? I'd have expected the 16 core performance to be the best. BUt no the performance peaks at a smaller number of cores. -- Rahul From hahn at mcmaster.ca Mon Aug 10 10:04:56 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 10 Aug 2009 13:04:56 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: >> this is on the machine which reports 16 cores, right? ?I'm guessing >> that the kernel is compiled without numa and/or ht, so enumerates virtual >> cpus first. ?that would mean that when otherwise idle, a 2-core >> proc will get virtual cores within the same physical core. ?and that your 8c >> test is merely keeping the first socket busy. > > No. On both machines. The one reporting 16 cores and the other > reporting 8. i.e. one hyperthreaded and the other not. Both having 8 > physical cores. > > What is bizarre is I tried using -np 16. THat ought to definitely > utilize all cores, right? I'd have expected the 16 core performance to > be the best. BUt no the performance peaks at a smaller number of > cores. I think I would still invoke kernel miscompilation, since if the kernel isn't aware of the memory/core/socket topology, it probably makes quite poor affinity-oblivious allocations. this is the machine where numactl doesn't do anything sensible, right? From kus at free.net Mon Aug 10 10:43:56 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Mon, 10 Aug 2009 21:43:56 +0400 Subject: [Beowulf] numactl & SuSE11.1 In-Reply-To: <4A7E038A.6080406@gmail.com> Message-ID: I'm sorry for my mistake: the problem is on Nehalem Xeon under SuSE -11.1, but w/kernel 2.6.27.7-9 (w/Supermicro X8DT mobo). For Opteron 2350 w/SuSE 10.3 (w/ more old 2.6.22.5-31 -I erroneously inserted this string in my previous message) numactl works OK (w/Tyan mobo). NUMA is enabled in BIOS. Of course, CONFIG_NUMA (and CONFIG_NUMA_EMU) are setted to "y" in both kernels. Unfortunately I (i.e. root) can't change files in /sys/devices/system/node (or rename directory node2 to node1) :-( - as it's possible w/some files in /proc filesystem. It's interesting, that extraction from dmesg show, that IT WAS NODE1, but then node2 is appear ! ACPI: SRAT BF79A4B0, 0150 (r1 041409 OEMSRAT 1 INTL 1) ACPI: SSDT BF79FAC0, 249F (r1 DpgPmm CpuPm 12 INTL 20051117) ACPI: Local APIC address 0xfee00000 SRAT: PXM 0 -> APIC 0 -> Node 0 SRAT: PXM 0 -> APIC 2 -> Node 0 SRAT: PXM 0 -> APIC 4 -> Node 0 SRAT: PXM 0 -> APIC 6 -> Node 0 SRAT: PXM 1 -> APIC 16 -> Node 1 SRAT: PXM 1 -> APIC 18 -> Node 1 SRAT: PXM 1 -> APIC 20 -> Node 1 SRAT: PXM 1 -> APIC 22 -> Node 1 SRAT: Node 0 PXM 0 0-a0000 SRAT: Node 0 PXM 0 100000-c0000000 SRAT: Node 0 PXM 0 100000000-1c0000000 SRAT: Node 2 PXM 257 1c0000000-340000000 (here !!) NUMA: Allocated memnodemap from 1c000 - 22880 NUMA: Using 20 for the hash shift. Bootmem setup node 0 0000000000000000-00000001c0000000 NODE_DATA [0000000000022880 - 000000000003a87f] bootmap [000000000003b000 - 0000000000072fff] pages 38 (8 early reservations) ==> bootmem [0000000000 - 01c0000000] #0 [0000000000 - 0000001000] BIOS data page ==> [0000000000 - 0000001000] #1 [0000006000 - 0000008000] TRAMPOLINE ==> [0000006000 - 0000008000] #2 [0000200000 - 0000bf27b8] TEXT DATA BSS ==> [0000200000 - 0000bf27b8] #3 [0037a3b000 - 0037fef104] RAMDISK ==> [0037a3b000 - 0037fef104] #4 [000009cc00 - 0000100000] BIOS reserved ==> [000009cc00 - 0000100000] #5 [0000010000 - 0000013000] PGTABLE ==> [0000010000 - 0000013000] #6 [0000013000 - 000001c000] PGTABLE ==> [0000013000 - 000001c000] #7 [000001c000 - 0000022880] MEMNODEMAP ==> [000001c000 - 0000022880] Bootmem setup node 2 00000001c0000000-0000000340000000 NODE_DATA [00000001c0000000 - 00000001c0017fff] bootmap [00000001c0018000 - 00000001c0047fff] pages 30 (8 early reservations) ==> bootmem [01c0000000 - 0340000000] #0 [0000000000 - 0000001000] BIOS data page #1 [0000006000 - 0000008000] TRAMPOLINE #2 [0000200000 - 0000bf27b8] TEXT DATA BSS #3 [0037a3b000 - 0037fef104] RAMDISK #4 [000009cc00 - 0000100000] BIOS reserved #5 [0000010000 - 0000013000] PGTABLE #6 [0000013000 - 000001c000] PGTABLE #7 [000001c000 - 0000022880] MEMNODEMAP found SMP MP-table at [ffff8800000ff780] 000ff780 [ffffe20000000000-ffffe20006ffffff] PMD -> [ffff880028200000-ffff88002e1fffff] on node 0 [ffffe20007000000-ffffe2000cffffff] PMD -> [ffff8801c0200000-ffff8801c61fffff] on node 2 Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From renato.oferenda at gmail.com Mon Aug 10 08:43:49 2009 From: renato.oferenda at gmail.com (Renato Callado Borges) Date: Mon, 10 Aug 2009 12:43:49 -0300 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <20090810154348.GC6915@alice05> On Mon, Aug 10, 2009 at 08:33:27AM -0400, Mark Hahn wrote: >> Is there a way of finding out within Linux if Hyperthreading is on or >> not? > > in /proc/cpuinfo, I believe it's a simple as siblings > cpu cores. > that is, I'm guessing one of your nehalem's shows as having 8 siblings > and 4 cpu cores. Googling for 'dmidecode Hyper Thread' I found this 2004 article: http://www.linux.com/archive/articles/41088 And it says: "I would have liked to just read /proc/cpuinfo to determine if Hyper-Threading is enabled, but currently that info is not exported to that file. /proc/cpuinfo just displays the number of physical CPUs in the system and ignores Hyper-Threading. The process of using x86info is similar to the process of using dmidecode: execute and parse the output. In this case, x86info will say _The physical package supports 2 logical processors_ if Hyper-Threading is enabled on a standard Xeon system." Installed x86info in my box, ran it and (correctly) it says my box' physical package supports 1 logical processor. (It's a Pentium 4). -- []'s, RCB. From jlb17 at duke.edu Mon Aug 10 12:09:48 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Mon, 10 Aug 2009 15:09:48 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote > On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn wrote: >>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A >>> specific DFT (Density Functional Theory) code we use is maxing out >>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >>> actually slower than 2 and 4 cores (depending on setup) >> >> this is on the machine which reports 16 cores, right? ?I'm guessing >> that the kernel is compiled without numa and/or ht, so enumerates virtual >> cpus first. ?that would mean that when otherwise idle, a 2-core >> proc will get virtual cores within the same physical core. ?and that your 8c >> test is merely keeping the first socket busy. > > No. On both machines. The one reporting 16 cores and the other > reporting 8. i.e. one hyperthreaded and the other not. Both having 8 > physical cores. > > What is bizarre is I tried using -np 16. THat ought to definitely > utilize all cores, right? I'd have expected the 16 core performance to > be the best. BUt no the performance peaks at a smaller number of > cores. Well, as there are only 8 "real" cores, running a computationally intensive process across 16 should *definitely* do worse than across 8. However, it's not so surprising that you're seeing peak performance with 2-4 threads. Nehalem can actually overclock itself when only some of the cores are busy -- it's called Turbo Mode. That *could* be what you're seeing. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From gus at ldeo.columbia.edu Mon Aug 10 12:40:15 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 10 Aug 2009 15:40:15 -0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <4A80779F.4080300@ldeo.columbia.edu> Joshua Baker-LePain wrote: > On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote > >> On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn wrote: >>>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A >>>> specific DFT (Density Functional Theory) code we use is maxing out >>>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >>>> actually slower than 2 and 4 cores (depending on setup) >>> >>> this is on the machine which reports 16 cores, right? I'm guessing >>> that the kernel is compiled without numa and/or ht, so enumerates >>> virtual >>> cpus first. that would mean that when otherwise idle, a 2-core >>> proc will get virtual cores within the same physical core. and that >>> your 8c >>> test is merely keeping the first socket busy. >> >> No. On both machines. The one reporting 16 cores and the other >> reporting 8. i.e. one hyperthreaded and the other not. Both having 8 >> physical cores. >> >> What is bizarre is I tried using -np 16. THat ought to definitely >> utilize all cores, right? I'd have expected the 16 core performance to >> be the best. BUt no the performance peaks at a smaller number of >> cores. > > Well, as there are only 8 "real" cores, running a computationally > intensive process across 16 should *definitely* do worse than across 8. > However, it's not so surprising that you're seeing peak performance with > 2-4 threads. Nehalem can actually overclock itself when only some of > the cores are busy -- it's called Turbo Mode. That *could* be what > you're seeing. > Hi Rahul, Joshua, list If Rahul is running these tests with his production jobs, which he says require 2GB/process, and if he has 24GB/node (or is it 16GB/node?), then with 16 processes running on a node memory paging probably kicked in, because the physical memory is less than 32GB. Would this be the reason for the drop in performance, Rahul? In any case, Joshua is right that you can't expect linear scaling from 8 to 16 processes on a node. What I saw on an IBM machine with PPC-6 and SMT (similar to Intel hyperthreading) was a speedup of around 1.4, rather than 2. Still a great deal! If I understand right, hyperthreading opportunistically uses idle execution units on a core to schedule a second thread to use them. As clever and efficient as it is, I would guess this mechanism cannot produce as much work as two physical cores. There is an article about it in Tom's Hardware: http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-5.html My $0.02 of guesses Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rpnabar at gmail.com Mon Aug 10 13:02:51 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 15:02:51 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain wrote: > Well, as there are only 8 "real" cores, running a computationally intensive > process across 16 should *definitely* do worse than across 8. However, it's > not so surprising that you're seeing peak performance with 2-4 threads. > ?Nehalem can actually overclock itself when only some of the cores are busy > -- it's called Turbo Mode. ?That *could* be what you're seeing. That could very well be it! Is there any way to test if the CPU has overclocked itself? Or can I turn the "turbo mode" off and check? -- Rahul From jlb17 at duke.edu Mon Aug 10 13:07:00 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Mon, 10 Aug 2009 16:07:00 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, 10 Aug 2009 at 3:02pm, Rahul Nabar wrote > On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain wrote: >> Well, as there are only 8 "real" cores, running a computationally intensive >> process across 16 should *definitely* do worse than across 8. However, it's >> not so surprising that you're seeing peak performance with 2-4 threads. >> ?Nehalem can actually overclock itself when only some of the cores are busy >> -- it's called Turbo Mode. ?That *could* be what you're seeing. > > That could very well be it! Is there any way to test if the CPU has > overclocked itself? > > Or can I turn the "turbo mode" off and check? You *should* be able to turn off turbo mode in the BIOS. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From Craig.Tierney at noaa.gov Mon Aug 10 13:20:36 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Mon, 10 Aug 2009 14:20:36 -0600 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <4A808114.1020502@noaa.gov> Joshua Baker-LePain wrote: > On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote > >> On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn wrote: >>>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A >>>> specific DFT (Density Functional Theory) code we use is maxing out >>>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >>>> actually slower than 2 and 4 cores (depending on setup) >>> >>> this is on the machine which reports 16 cores, right? I'm guessing >>> that the kernel is compiled without numa and/or ht, so enumerates >>> virtual >>> cpus first. that would mean that when otherwise idle, a 2-core >>> proc will get virtual cores within the same physical core. and that >>> your 8c >>> test is merely keeping the first socket busy. >> >> No. On both machines. The one reporting 16 cores and the other >> reporting 8. i.e. one hyperthreaded and the other not. Both having 8 >> physical cores. >> >> What is bizarre is I tried using -np 16. THat ought to definitely >> utilize all cores, right? I'd have expected the 16 core performance to >> be the best. BUt no the performance peaks at a smaller number of >> cores. > > Well, as there are only 8 "real" cores, running a computationally > intensive process across 16 should *definitely* do worse than across 8. > However, it's not so surprising that you're seeing peak performance with > 2-4 threads. Nehalem can actually overclock itself when only some of > the cores are busy -- it's called Turbo Mode. That *could* be what > you're seeing. > We are seeing that the chips will overclock themselves even with all cores running. The percent increase in speed can be from 2-10% per node. I have never had a run (single node HPL) run as slow as it does when Turbo is turned off. However, with all the variation per node, there isn't much of a win for large jobs as they will generally slow down to the slowest node. Craig > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Craig Tierney (craig.tierney at noaa.gov) From bill at cse.ucdavis.edu Mon Aug 10 13:22:49 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Mon, 10 Aug 2009 13:22:49 -0700 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <4A808199.4050605@cse.ucdavis.edu> Joshua Baker-LePain wrote: > Well, as there are only 8 "real" cores, running a computationally > intensive process across 16 should *definitely* do worse than across 8. I've seen many cases where that isn't true. The P4 rarely justified turning on HT because throughput would often be lower. With the nehalem often it helps, the best way to tell is to try it. > However, it's not so surprising that you're seeing peak performance with > 2-4 threads. Nehalem can actually overclock itself when only some of > the cores are busy -- it's called Turbo Mode. That *could* be what > you're seeing. Indeed. From tom.elken at qlogic.com Mon Aug 10 14:07:23 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Mon, 10 Aug 2009 14:07:23 -0700 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <35AAF1E4A771E142979F27B51793A48885F2481BA4@AVEXMB1.qlogic.org> > Well, as there are only 8 "real" cores, running a computationally > intensive process across 16 should *definitely* do worse than across 8. Not typically. At the SPEC website there are quite a few SPEC MPI2007 (which is an average across 13 HPC applications) results on Nehalem. Summary: IBM, SGI and Platform have some comparisons on clusters with "SMT On" of running 1 rank for every core compared to running 2 ranks on every core. In general, on low core-counts, like up to 32 there is about an 8% advantage for running 2 ranks per core. At larger core counts, IBM published a pair of results on 64 cores where the 64-rank performance was equal to the 128-rank performance. Not all of these applications scale linearly, so on some of them you lose efficiency at 128 ranks compared to 64 ranks. Details: Results from this year are mostly on Nehalem: http://www.spec.org/mpi2007/results/res2009q3/ (IBM) http://www.spec.org/mpi2007/results/res2009q2/ (Platform) http://www.spec.org/mpi2007/results/res2009q1/ (SGI) (Intel has results with Turbo mode turned on and off in the q2 and q3 results, for a different comparison) Or you can pick out the Xeon 'X5570' and 'X5560' results from the list of all results: http://www.spec.org/mpi2007/results/mpi2007.html In the result index, when " Compute Threads Enabled" = 2x "Compute Cores Enabled", then you know SMT is turned on. In these cases, you can then check that when " MPI Ranks" = " Compute Threads Enabled" then you are running 2 ranks per core. -Tom > However, it's not so surprising that you're seeing peak performance > with > 2-4 threads. Nehalem can actually overclock itself when only some of > the > cores are busy -- it's called Turbo Mode. That *could* be what you're > seeing. > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF From rpnabar at gmail.com Mon Aug 10 15:28:59 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 17:28:59 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: On Mon, Aug 10, 2009 at 12:48 PM, Bruno Coutinho wrote: > This is often caused by cache competition or memory bandwidth saturation. > If it was cache competition, rising from 4 to 6 threads would make it worse. > As the code became faster with DDR3-1600 and much slower with Xeon 5400, > this code is memory bandwidth bound. > Tweaking CPU affinity to avoid thread jumping among cores of the will not > help much, as the big bottleneck is memory bandwidth. > To this code, CPU affinity will only help in NUMA machines to maintain > memory access in local memory. > > > If the machine has enough bandwidth to feed the cores, it will scale. Exactly! But I thought this was the big advance with the Nehalem that it has removed the CPU<->Cache<->RAM bottleneck. So if the code scaled with the AMD Barcelona then it would continue to scale with the Nehalem right? I'm posting a copy of my scaling plot here if it helps. http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg To remove most possible confounding factors this particular Nehlem plot is produced with the following settings: Hyperthreading OFF 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration X5550 Even if we explained away the bizzare performance of the 4 node case to the Turbo effect what is most confusing is how the 8 core data point could be so much slower than the corresponding 8 core point on a old AMD Barcelona. Something's wrong here that I just do not understand. BTW, any other VASP users here? Anybody have any Nehalem experience? -- Rahul From h-bugge at online.no Tue Aug 11 00:43:03 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Tue, 11 Aug 2009 09:43:03 +0200 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <35AAF1E4A771E142979F27B51793A48885F2481BA4@AVEXMB1.qlogic.org> References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> <35AAF1E4A771E142979F27B51793A48885F2481BA4@AVEXMB1.qlogic.org> Message-ID: <72371DA9-DC23-4F99-A6FE-F3BC9854041E@online.no> On Aug 10, 2009, at 23:07 , Tom Elken wrote: > Summary: > IBM, SGI and Platform have some comparisons on clusters with "SMT > On" of running 1 rank for every core compared to running 2 ranks on > every core. In general, on low core-counts, like up to 32 there is > about an 8% advantage for running 2 ranks per core. At larger core > counts, IBM published a pair of results on 64 cores where the 64- > rank performance was equal to the 128-rank performance. Not all of > these applications scale linearly, so on some of them you lose > efficiency at 128 ranks compared to 64 ranks. > > Details: Results from this year are mostly on Nehalem: > http://www.spec.org/mpi2007/results/res2009q3/ (IBM) > http://www.spec.org/mpi2007/results/res2009q2/ (Platform) > http://www.spec.org/mpi2007/results/res2009q1/ (SGI) > (Intel has results with Turbo mode turned on and off > in the q2 and q3 results, for a different comparison) > > Or you can pick out the Xeon 'X5570' and 'X5560' results from the > list of all results: > http://www.spec.org/mpi2007/results/mpi2007.html > > In the result index, when > " Compute Threads Enabled" = 2x "Compute Cores Enabled", then you > know SMT is turned on. > In these cases, you can then check that when > " MPI Ranks" = " Compute Threads Enabled" then you are running 2 > ranks per core. Tom, Thanks for the neatly compiled information above. I can just add, that I have conducted a fairly detailed analysis of Nehalem compared to HarperTown in my paper An evaluation of Intel?s core i7 architecture using a comparative approach presented at ISC?09. Here, I look at different aspect of the memory hierarchy of the two processors. The benefits from hyperthreading on the said 13 SPEC MPI2007 applications are also studied, although using only a single node, where the advantage is more pronounced Thanks, H?kon -------------- next part -------------- An HTML attachment was scrubbed... URL: From deadline at eadline.org Tue Aug 11 06:04:27 2009 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 11 Aug 2009 09:04:27 -0400 (EDT) Subject: [Beowulf] The True Cost of HPC Cluster Ownership Message-ID: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> All, I posted this on ClusterMonkey the other week. It is actually derived from a white paper I wrote for SiCortex. I'm sure those on this list have some experience/opinions with these issues (and other cluster issues!) The True Cost of HPC Cluster Ownership http://www.clustermonkey.net//content/view/262/1/ -- Doug From Daniel.Pfenniger at unige.ch Tue Aug 11 07:38:19 2009 From: Daniel.Pfenniger at unige.ch (Daniel Pfenniger) Date: Tue, 11 Aug 2009 16:38:19 +0200 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> Message-ID: <4A81825B.4000303@unige.ch> Douglas Eadline wrote: > All, > > I posted this on ClusterMonkey the other week. > It is actually derived from a white paper I wrote for > SiCortex. I'm sure those on this list have some > experience/opinions with these issues (and other > cluster issues!) > > The True Cost of HPC Cluster Ownership > > http://www.clustermonkey.net//content/view/262/1/ > This article sounds unbalanced and self-serving. While it I clear that self-made clusters imply added new costs in regard of turn-key clusters, they also empower the buyer using standard and open solutions by an increased independence from the vendor, and increases also its knowledge for future choices. This aspect is hard to measure in monetary terms, but certainly very important for some users. I have experienced all kinds of clusters (turn-key, mostly self-assembled, and partly vendor assembled and tested), and my conclusion is that the best is when the user has at least the choice to determine the degree of vendor integration/lock-in. Bad choices occur because people are badly informed, and the article is so biased that it doesn't improve objective information on this regard, just serves as increasing fear and doubt. Dan From gerry.creager at tamu.edu Tue Aug 11 08:27:48 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue, 11 Aug 2009 10:27:48 -0500 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A81825B.4000303@unige.ch> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> Message-ID: <4A818DF4.8040807@tamu.edu> Daniel Pfenniger wrote: > Douglas Eadline wrote: >> All, >> >> I posted this on ClusterMonkey the other week. >> It is actually derived from a white paper I wrote for >> SiCortex. I'm sure those on this list have some >> experience/opinions with these issues (and other >> cluster issues!) >> >> The True Cost of HPC Cluster Ownership >> >> http://www.clustermonkey.net//content/view/262/1/ >> > > This article sounds unbalanced and self-serving. I thought it read a bit like a chronicle of my recent experiences. > While it I clear that self-made clusters imply added new costs > in regard of turn-key clusters, they also empower the buyer > using standard and open solutions by an increased independence from > the vendor, and increases also its knowledge for future choices. > This aspect is hard to measure in monetary terms, but certainly very > important for some users. > > I have experienced all kinds of clusters (turn-key, mostly > self-assembled, and partly vendor assembled and tested), and my conclusion > is that the best is when the user has at least the choice to determine > the degree of vendor integration/lock-in. Bad choices occur > because people are badly informed, and the article is so biased > that it doesn't improve objective information on this regard, just > serves as increasing fear and doubt. In our experiences over the last two clusters, one was delivered n a bunch of flat boxes andwe spent several weeks racking, stacking, cabling, loading, testing, tweaking and then releasing. Or, was that months. In our more recent cluster, we took delivery of a purported tested system, then lived through 2+ weeks of vendor cabling (pretty, if an extended time to achieve), hardware failures, replacements, BIOS upgrades (wholesale; shouldn't have been needed), and more hardware failures. Next time, I want to either get a complete turnkey system or buy from the various sources and just do it myself, knowing the pitfalls. gerry -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From dnlombar at ichips.intel.com Tue Aug 11 08:40:56 2009 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Tue, 11 Aug 2009 08:40:56 -0700 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <20090811154056.GA6987@nlxdcldnl2.cl.intel.com> On Mon, Aug 10, 2009 at 01:02:51PM -0700, Rahul Nabar wrote: > On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain wrote: > > Well, as there are only 8 "real" cores, running a computationally intensive > > process across 16 should *definitely* do worse than across 8. Some workloads will benefit materially from SMT, some are neutral, and some will degrade. For those that degrade, simply not oversubscribing the physical cores will get best performance. > > However, it's > > not so surprising that you're seeing peak performance with 2-4 threads. > > ?Nehalem can actually overclock itself when only some of the cores are busy > > -- it's called Turbo Mode. ?That *could* be what you're seeing. > > That could very well be it! Is there any way to test if the CPU has > overclocked itself? There's an application note on the subect at: Be aware this document is very technical, talking about MSRs & performance counters. > Or can I turn the "turbo mode" off and check? That would work, but... Alternately, take a look at -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From landman at scalableinformatics.com Tue Aug 11 09:16:49 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 11 Aug 2009 12:16:49 -0400 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A818DF4.8040807@tamu.edu> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> Message-ID: <4A819971.9060905@scalableinformatics.com> Gerry Creager wrote: > Daniel Pfenniger wrote: >> Douglas Eadline wrote: [...] >> This article sounds unbalanced and self-serving. > > I thought it read a bit like a chronicle of my recent experiences. I think that this article is fine, not unbalanced. What I like to point out to customers and partners is There is a cost to *EVERYTHING* Heinlein called it TANSTAAFL. Every single decision you make carries with it a set of costs. What purchasing agents, looking at the absolute rock bottom prices do not seem to grasp, is that those costs can *easily* swamp any purported gains from a lower price, and raise the actual landed price, due to expending valuable resource time (Gerry et al) for months on end working to solve problems that *should* have been solved previously. There is a cost to going cheap. This cost is time, and loss of productivity. If your time (your students time) is free, and you don't need to pay for consequences (loss of grants, loss of revenue, loss of productivity, ...) in delayed delivery of results from computing or storage systems, then, by all means, roll these things yourself, and deal with the myriad of debugging issues in making the complex beasts actually work. You have hardware stack issues, software stack issues, interaction issues, ... What I am saying is that Doug is onto something here. It ain't easy. Doug simply expressed that it isn't. As for the article being self serving? I dunno, I don't think so. Doug runs a consultancy called Basement Supercomputing that provides services for such folks. I didn't see overt advertisements, or even, really, covert "hire us" messages. I think this was fine as a white paper, and Doug did note that it started life as one. My $0.02 -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From tjrc at sanger.ac.uk Tue Aug 11 09:32:53 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Tue, 11 Aug 2009 17:32:53 +0100 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A81825B.4000303@unige.ch> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> Message-ID: <74DEA3F5-CFF1-4D66-8C33-046CF8114732@sanger.ac.uk> On 11 Aug 2009, at 3:38 pm, Daniel Pfenniger wrote: > Douglas Eadline wrote: >> All, >> I posted this on ClusterMonkey the other week. >> It is actually derived from a white paper I wrote for >> SiCortex. I'm sure those on this list have some >> experience/opinions with these issues (and other >> cluster issues!) >> The True Cost of HPC Cluster Ownership >> http://www.clustermonkey.net//content/view/262/1/ > > This article sounds unbalanced and self-serving. > > While it I clear that self-made clusters imply added new costs > in regard of turn-key clusters, they also empower the buyer > using standard and open solutions by an increased independence from > the vendor, and increases also its knowledge for future choices. > This aspect is hard to measure in monetary terms, but certainly very > important for some users. I agree. Some of the biggest IT problems I've encountered have been a direct result of vendor lock-in. Companies get bought, and products crushed. The wind changes direction, and products get dropped, side- lined, changed more or less on a whim. Rash promises made by vendors which never come true. In our case it was the dismembering of DEC during its various acquisitions which hurt us, and that saga contains examples of most of the above. And then the customer has to start again, which can be enormously expensive in terms of researching new ways to go, and migrating services. Just one part of that saga (the abandonment of the AdvFS filesystem) cost us more than six months of continuous work to get past, just copying the data onto something else. That was years ago; with the petabytes of data we have now, it would be even worse. Once bitten, twice shy. If you've made the investment in house to have a vendor-agnostic setup, which we now have, we have complete freedom to choose whatever tin vendor we like, at least as far as our compute nodes go. Our configuration, deployment and management software stack works on anything, so it's very little skin off our nose to change vendor. Regards, Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From gerry.creager at tamu.edu Tue Aug 11 10:00:52 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue, 11 Aug 2009 12:00:52 -0500 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A819971.9060905@scalableinformatics.com> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: <4A81A3C4.7000901@tamu.edu> +1 Joe Landman wrote: > Gerry Creager wrote: >> Daniel Pfenniger wrote: >>> Douglas Eadline wrote: > > [...] > >>> This article sounds unbalanced and self-serving. >> >> I thought it read a bit like a chronicle of my recent experiences. > > I think that this article is fine, not unbalanced. What I like to point > out to customers and partners is > > There is a cost to *EVERYTHING* > > Heinlein called it TANSTAAFL. Every single decision you make carries > with it a set of costs. > > What purchasing agents, looking at the absolute rock bottom prices do > not seem to grasp, is that those costs can *easily* swamp any purported > gains from a lower price, and raise the actual landed price, due to > expending valuable resource time (Gerry et al) for months on end working > to solve problems that *should* have been solved previously. > > There is a cost to going cheap. This cost is time, and loss of > productivity. If your time (your students time) is free, and you don't > need to pay for consequences (loss of grants, loss of revenue, loss of > productivity, ...) in delayed delivery of results from computing or > storage systems, then, by all means, roll these things yourself, and > deal with the myriad of debugging issues in making the complex beasts > actually work. You have hardware stack issues, software stack issues, > interaction issues, ... > > What I am saying is that Doug is onto something here. It ain't easy. > Doug simply expressed that it isn't. > > As for the article being self serving? I dunno, I don't think so. Doug > runs a consultancy called Basement Supercomputing that provides services > for such folks. I didn't see overt advertisements, or even, really, > covert "hire us" messages. I think this was fine as a white paper, and > Doug did note that it started life as one. > > My $0.02 > -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From bill at cse.ucdavis.edu Tue Aug 11 10:06:32 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 11 Aug 2009 10:06:32 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: <4A81A518.2030805@cse.ucdavis.edu> Rahul Nabar wrote: > Exactly! But I thought this was the big advance with the Nehalem that > it has removed the CPU<->Cache<->RAM bottleneck. Not sure I'd say removed, but they have made a huge improvement. To the point where a single socket intel is better than a dual socket barcelona. > So if the code scaled > with the AMD Barcelona then it would continue to scale with the > Nehalem right? That is a gross over simplification. So sure with a microbenchmark testing memory bandwidth only that wouldn't be a terrible approximation. Something like vasp is far from a simple micro benchmark. > I'm posting a copy of my scaling plot here if it helps. > > http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg Looks to me like you fit in the barcelona 512KB L2 cache (and get good scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling). Were the binaries compiled specifically to target both architectures? As a first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's compiler for intel. But portland group does a good job at both in most cases. > Hyperthreading OFF > 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration > X5550 I"m curious about the hyperthreading on data point as well. > Even if we explained away the bizzare performance of the 4 node case > to the Turbo effect what is most confusing is how the 8 core data > point could be so much slower than the corresponding 8 core point on a > old AMD Barcelona. A doubling of the can have that effect. The Intel L3 can no come anywhere close to feeding 4 cores running flat out. From kus at free.net Tue Aug 11 10:19:08 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Tue, 11 Aug 2009 21:19:08 +0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: Message-ID: In message from Rahul Nabar (Sun, 9 Aug 2009 22:42:25 -0500): >(a) I am seeing strange scaling behaviours with Nehlem cores. eg A >specific DFT (Density Functional Theory) code we use is maxing out >performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >actually slower than 2 and 4 cores (depending on setup) If this results are for HyperThreading "ON", it may be not too strange because of "virtual cores" competition. But if this results are for switched off Hyperthreading - it's strange. I have usual good DFT scaling w/number of cores on G03 - about in 7 times for 8 cores. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From Daniel.Pfenniger at unige.ch Tue Aug 11 10:28:22 2009 From: Daniel.Pfenniger at unige.ch (Daniel Pfenniger) Date: Tue, 11 Aug 2009 19:28:22 +0200 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A819971.9060905@scalableinformatics.com> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: <4A81AA36.8050503@unige.ch> Joe Landman wrote: > Gerry Creager wrote: >> Daniel Pfenniger wrote: >>> Douglas Eadline wrote: > > [...] > >>> This article sounds unbalanced and self-serving. >> >> I thought it read a bit like a chronicle of my recent experiences. Mine were not so bad, so I found the tone too pessimistic. > I think that this article is fine, not unbalanced. What I like to point > out to customers and partners is > > There is a cost to *EVERYTHING* Well, not really surprising. The point is to be quantitative, not subjective (fear, etc.). Each solution has a cost and alert people will choose the best one for them, not for the vendor. If many people choose IKEA furniture over traditional vendors it is because the cost differential is favourable for them, even taking all the overheads into account. When commodity clusters came in the 90's the gain was easily a factor 10 at purchase. In my case the maintenance and licenses costs of turn-key locked-in hardware added 20-25% of purchase cost every year. With such a high cost we could have hired an engineer full-time instead, but it was not possible because of the locked-in nature of such machines. The self-made solution was clearly the best. Today one finds intermediate solutions where the hardware is composed of compatible elements, and the software is open source. Some vendors offer almost ready to run and tested hardware for a reasonable margin, adding less than a factor 2 to the original hardware cost, without horrendous maintenance fee and restrictive license. The locked-in effect is low, yet not completely zero. This is probably the best solution for many budget-conscious users. > > Heinlein called it TANSTAAFL. Every single decision you make carries > with it a set of costs. > > What purchasing agents, looking at the absolute rock bottom prices do > not seem to grasp, is that those costs can *easily* swamp any purported > gains from a lower price, and raise the actual landed price, due to > expending valuable resource time (Gerry et al) for months on end working > to solve problems that *should* have been solved previously. > > There is a cost to going cheap. This cost is time, and loss of > productivity. If your time (your students time) is free, and you don't > need to pay for consequences (loss of grants, loss of revenue, loss of > productivity, ...) in delayed delivery of results from computing or > storage systems, then, by all means, roll these things yourself, and > deal with the myriad of debugging issues in making the complex beasts > actually work. You have hardware stack issues, software stack issues, > interaction issues, ... You forget to mention that turn-key locked-in systems in my experience entail inefficiency costs because the user cannot decide what to do when completely ignoring what is going on. Many problems may be solved in minutes when the user controls the cluster, but may need days or weeks for fixes from the vendor. A balanced presentation should weight all the aspects of running a cluster. > > What I am saying is that Doug is onto something here. It ain't easy. > Doug simply expressed that it isn't. > As for the article being self serving? I dunno, I don't think so. Doug > runs a consultancy called Basement Supercomputing that provides services > for such folks. I didn't see overt advertisements, or even, really, > covert "hire us" messages. I think this was fine as a white paper, and > Doug did note that it started life as one. You may have noticed that this article was originally written on demand of SiCortex... Dan From Craig.Tierney at noaa.gov Tue Aug 11 10:40:03 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Tue, 11 Aug 2009 11:40:03 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: <4A81ACF3.60802@noaa.gov> Rahul Nabar wrote: > On Mon, Aug 10, 2009 at 12:48 PM, Bruno Coutinho wrote: >> This is often caused by cache competition or memory bandwidth saturation. >> If it was cache competition, rising from 4 to 6 threads would make it worse. >> As the code became faster with DDR3-1600 and much slower with Xeon 5400, >> this code is memory bandwidth bound. >> Tweaking CPU affinity to avoid thread jumping among cores of the will not >> help much, as the big bottleneck is memory bandwidth. >> To this code, CPU affinity will only help in NUMA machines to maintain >> memory access in local memory. >> >> >> If the machine has enough bandwidth to feed the cores, it will scale. > > Exactly! But I thought this was the big advance with the Nehalem that > it has removed the CPU<->Cache<->RAM bottleneck. So if the code scaled > with the AMD Barcelona then it would continue to scale with the > Nehalem right? > > I'm posting a copy of my scaling plot here if it helps. > > http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg > > To remove most possible confounding factors this particular Nehlem > plot is produced with the following settings: > > Hyperthreading OFF > 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration > X5550 > > Even if we explained away the bizzare performance of the 4 node case > to the Turbo effect what is most confusing is how the 8 core data > point could be so much slower than the corresponding 8 core point on a > old AMD Barcelona. > > Something's wrong here that I just do not understand. BTW, any other > VASP users here? Anybody have any Nehalem experience? > Rahul, What are you doing to ensure that you have both memory and processor affinity enabled? Craig > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Craig Tierney (craig.tierney at noaa.gov) From landman at scalableinformatics.com Tue Aug 11 11:01:37 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 11 Aug 2009 14:01:37 -0400 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A81AA36.8050503@unige.ch> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> <4A81AA36.8050503@unige.ch> Message-ID: <4A81B201.9000206@scalableinformatics.com> Daniel Pfenniger wrote: >> There is a cost to *EVERYTHING* > > Well, not really surprising. The point is to be quantitative, > not subjective (fear, etc.). Each solution has a cost and alert > people will choose the best one for them, not for the vendor. Sadly, not always (choosing the best one for them). *Many* times the solution is dictated to them via some group with an agreement with some vendor. Decisions about which one are best are often seconded behind which brand to select. I've had too many conversations that went "we agree your solution is better but we can't buy it because you aren't brand X". Which is not a good reason for selection or omission of a vendor. > If many people choose IKEA furniture over traditional vendors > it is because the cost differential is favourable for them, > even taking all the overheads into account. Agreed. But furniture is not a computer (though I guess it could be ...) > When commodity clusters came in the 90's the gain was easily a > factor 10 at purchase. In my case the maintenance and licenses > costs of turn-key locked-in hardware added 20-25% of purchase > cost every year. With such a high cost we could have hired an > engineer full-time instead, but it was not possible because of > the locked-in nature of such machines. The self-made solution > was clearly the best. For some users, this is the right route. For Guy Coates and his team, for you, and a number of others. I agree it can be good. But there are far too many people that think a cluster is a pile-o-PCs + a cheap switch + a cluster distro. Its the "how do I make it work when it fails" aspect we tend to see people worrying online about. I am arguing for commodity systems. But some gear is just plain junk. Not all switches are created equal. Some inexpensive switches do a far better job than some of the expensive ones. Some brand name machines are wholly inappropriate as compute nodes, yet they are used. A big part of this process is making reasonable selections. Understanding the issues with all of these, understanding the interplay. I am not arguing for vendor locking (believe it or not). I simply argue for sane choices. > Today one finds intermediate solutions where the hardware is > composed of compatible elements, and the software is open source. > Some vendors offer almost ready to run and tested hardware for > a reasonable margin, adding less than a factor 2 to the original > hardware cost, without horrendous maintenance fee and restrictive > license. The locked-in effect is low, yet not completely zero. > This is probably the best solution for many budget-conscious > users. Yes. This is what we stress. We unfortunately have run into purchasing groups that like to try to save a buck, and will buy almost-but-not-quite-the-same-thing for the clusters we have put together, which makes it very hard to pre-build, and pre-test. Worse, when we see what they have purchased, and see that it really didn't come close to the spec we used, well .... I fail to see how being required to purchase the right thing after purchasing the wrong thing that you can't return, saves you money. We have had this happen too many times. >> Heinlein called it TANSTAAFL. Every single decision you make carries >> with it a set of costs. >> >> What purchasing agents, looking at the absolute rock bottom prices do >> not seem to grasp, is that those costs can *easily* swamp any >> purported gains from a lower price, and raise the actual landed price, >> due to expending valuable resource time (Gerry et al) for months on >> end working to solve problems that *should* have been solved previously. >> >> There is a cost to going cheap. This cost is time, and loss of >> productivity. If your time (your students time) is free, and you >> don't need to pay for consequences (loss of grants, loss of revenue, >> loss of productivity, ...) in delayed delivery of results from >> computing or storage systems, then, by all means, roll these things >> yourself, and deal with the myriad of debugging issues in making the >> complex beasts actually work. You have hardware stack issues, >> software stack issues, interaction issues, ... > > You forget to mention that turn-key locked-in systems in my experience > entail > inefficiency costs because the user cannot decide what to do when I can't mention your experience as I don't have a clue as to what you have experienced. Vendor lock in is IMO not a great thing. It increases costs, makes systems more expensive to support, reduces choices later on. Yet, we run head first into vendor lock-in in many purchasing departments. They prefer buying from one vendor with whom they have struck agreements. Which don't work to their benefit, but do for the vendors. > completely ignoring what is going on. Many problems may be solved in > minutes when the user controls the cluster, but may need days > or weeks for fixes from the vendor. A balanced presentation should > weight all the aspects of running a cluster. Yes. Doug's presentation did show you one aspect, and if you want more to "balance" the joy of clustered systems, certainly, his work can be expanded and amplified upon. > >> >> What I am saying is that Doug is onto something here. It ain't easy. >> Doug simply expressed that it isn't. > >> As for the article being self serving? I dunno, I don't think so. >> Doug runs a consultancy called Basement Supercomputing that provides >> services for such folks. I didn't see overt advertisements, or even, >> really, covert "hire us" messages. I think this was fine as a white >> paper, and Doug did note that it started life as one. > > You may have noticed that this article was originally written on demand > of SiCortex... That wasn't lost on me :(. Actually one of the things we are actively talking about relative to our high performance storage is "Freedom from bricking". If a theoretical bus hits the company a day after you get your boxes from us, our units are still supportable, and you can pay another organization to support them. We aren't aware of other vendors doing what we are doing that could (honestly) make such a claim. Even the ones that use the (marketing) label of "Open storage solutions". Yeah. Open. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From kus at free.net Tue Aug 11 11:12:32 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Tue, 11 Aug 2009 22:12:32 +0400 Subject: [Beowulf] numactl & SuSE11.1 In-Reply-To: Message-ID: It's interesting, that for this hard&software configuration disabling of NUMA in BIOS gives more high STREAM results in comparison w/"NUMA enabled". I.e. for NUMA "off": 8723/8232/10388/10317 MB/s for NUMA "on": 5620/5217/6795/6767 MB/s (both for OMP_NUM_THREADS=1 and ifort 11.1 compiler). The situation for Opteron's is opposite: NUMA mode gives more high throughput. In message from "Mikhail Kuzminsky" (Mon, 10 Aug 2009 21:43:56 +0400): >I'm sorry for my mistake: >the problem is on Nehalem Xeon under SuSE -11.1, but w/kernel >2.6.27.7-9 (w/Supermicro X8DT mobo). For Opteron 2350 w/SuSE 10.3 (w/ >more old 2.6.22.5-31 -I erroneously inserted this string in my >previous message) numactl works OK (w/Tyan mobo). > >NUMA is enabled in BIOS. Of course, CONFIG_NUMA (and CONFIG_NUMA_EMU) >are setted to "y" in both kernels. > >Unfortunately I (i.e. root) can't change files in >/sys/devices/system/node (or rename directory node2 to node1) :-( - as >it's possible w/some files in /proc filesystem. It's interesting, that >extraction from dmesg show, that IT WAS NODE1, but then node2 is >appear ! > >ACPI: SRAT BF79A4B0, 0150 (r1 041409 OEMSRAT 1 INTL 1) >ACPI: SSDT BF79FAC0, 249F (r1 DpgPmm CpuPm 12 INTL 20051117) >ACPI: Local APIC address 0xfee00000 >SRAT: PXM 0 -> APIC 0 -> Node 0 >SRAT: PXM 0 -> APIC 2 -> Node 0 >SRAT: PXM 0 -> APIC 4 -> Node 0 >SRAT: PXM 0 -> APIC 6 -> Node 0 >SRAT: PXM 1 -> APIC 16 -> Node 1 >SRAT: PXM 1 -> APIC 18 -> Node 1 >SRAT: PXM 1 -> APIC 20 -> Node 1 >SRAT: PXM 1 -> APIC 22 -> Node 1 >SRAT: Node 0 PXM 0 0-a0000 >SRAT: Node 0 PXM 0 100000-c0000000 >SRAT: Node 0 PXM 0 100000000-1c0000000 >SRAT: Node 2 PXM 257 1c0000000-340000000 >(here !!) > >NUMA: Allocated memnodemap from 1c000 - 22880 >NUMA: Using 20 for the hash shift. >Bootmem setup node 0 0000000000000000-00000001c0000000 > NODE_DATA [0000000000022880 - 000000000003a87f] > bootmap [000000000003b000 - 0000000000072fff] pages 38 >(8 early reservations) ==> bootmem [0000000000 - 01c0000000] > #0 [0000000000 - 0000001000] BIOS data page ==> [0000000000 - >0000001000] > #1 [0000006000 - 0000008000] TRAMPOLINE ==> [0000006000 - >0000008000] > #2 [0000200000 - 0000bf27b8] TEXT DATA BSS ==> [0000200000 - >0000bf27b8] > #3 [0037a3b000 - 0037fef104] RAMDISK ==> [0037a3b000 - >0037fef104] > #4 [000009cc00 - 0000100000] BIOS reserved ==> [000009cc00 - >0000100000] > #5 [0000010000 - 0000013000] PGTABLE ==> [0000010000 - >0000013000] > #6 [0000013000 - 000001c000] PGTABLE ==> [0000013000 - >000001c000] > #7 [000001c000 - 0000022880] MEMNODEMAP ==> [000001c000 - >0000022880] >Bootmem setup node 2 00000001c0000000-0000000340000000 > NODE_DATA [00000001c0000000 - 00000001c0017fff] > bootmap [00000001c0018000 - 00000001c0047fff] pages 30 >(8 early reservations) ==> bootmem [01c0000000 - 0340000000] > #0 [0000000000 - 0000001000] BIOS data page > #1 [0000006000 - 0000008000] TRAMPOLINE > #2 [0000200000 - 0000bf27b8] TEXT DATA BSS > #3 [0037a3b000 - 0037fef104] RAMDISK > #4 [000009cc00 - 0000100000] BIOS reserved > #5 [0000010000 - 0000013000] PGTABLE > #6 [0000013000 - 000001c000] PGTABLE > #7 [000001c000 - 0000022880] MEMNODEMAP >found SMP MP-table at [ffff8800000ff780] 000ff780 > [ffffe20000000000-ffffe20006ffffff] PMD -> >[ffff880028200000-ffff88002e1fffff] on node 0 > [ffffe20007000000-ffffe2000cffffff] PMD -> >[ffff8801c0200000-ffff8801c61fffff] on node 2 Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From mathog at caltech.edu Tue Aug 11 11:46:05 2009 From: mathog at caltech.edu (David Mathog) Date: Tue, 11 Aug 2009 11:46:05 -0700 Subject: [Beowulf] The True Cost of HPC Cluster Ownership Message-ID: Joe Landman wrote: > I am arguing for commodity systems. But some gear is just plain junk. > Not all switches are created equal. Some inexpensive switches do a far > better job than some of the expensive ones. Some brand name machines > are wholly inappropriate as compute nodes, yet they are used. > > A big part of this process is making reasonable selections. > Understanding the issues with all of these, understanding the interplay. A lot of this issue boils down to a lack of available information, or perhaps, the cost of obtaining the information. Consider for instance the switches you cited above. How is the average site going to decide which switch is better before purchase? On paper, going by the published specs, they will often look identical. The two companies may be equally reputable. Still, one device may be a piece of junk and the other best in class. On very rare occasions there will be an independent review available. Only a large site is likely to have the resources to obtain samples of each switch and test them extensively. The best most of us can do is ask around if "switch XYZ is OK" before making the leap. With compute nodes performance information is more readily available, often in reviews, but again, rarely any reliability information. And we have all seen models which crunch nicely but have innate reliability problems that don't turn up in a 3 day review, and then bite hard during continuous use. Again, large sites can obtain a test unit and beat on it for a few months, but small sites usually cannot. At least in this case knowledge does build up over time in the community, so if one waits for a machine to be in the field for a year, it may be possible to ask around and find out if it is a good idea to buy some. (But don't wait too long, the sales life for computer models is not very long!) For this reason, unless a site is very well funded, buying cutting edge compute nodes is a rather large gamble. If the resources to run these tests isn't present in house, one may essentially buy the expertise by paying enough to a reputable company to run the tests. Either way, knowing costs money. Ideally there would be accepted standards for testing performance and reliability of each class of equipment, and the manufacturers would run these tests themselves, or farm it out to neutral entities, and then publish this information. It would certainly be a compelling sales tool, at least from my perspective. In practice, it usually seems like the manufacturers spend more time hiding equipment defects than they do in proving and publishing its strengths. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rpnabar at gmail.com Tue Aug 11 11:57:14 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 13:57:14 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A81ACF3.60802@noaa.gov> References: <4A81ACF3.60802@noaa.gov> Message-ID: On Tue, Aug 11, 2009 at 12:40 PM, Craig Tierney wrote: > What are you doing to ensure that you have both memory and processor > affinity enabled? > All I was using now was the flag: --mca mpi_paffinity_alone 1 Is there anything else I ought to be doing as well? -- Rahul From rpnabar at gmail.com Tue Aug 11 12:04:34 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 14:04:34 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A81A518.2030805@cse.ucdavis.edu> References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley wrote: > Looks to me like you fit in the barcelona 512KB L2 cache (and get good > scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling). Thanks Bill! I never realized that the L2 cache of the Nehalem is actually smaller than that of the Barcelona! I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe. THe size of the L2 cache is fixed across the steppings of the Nehlem isn't it? > Were the binaries compiled specifically to target both architectures? ?As a > first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's > compiler for intel. ?But portland group does a good job at both in most cases. We used the intel compilers. One of my fellow grad students did the actual compilation for VASP but I believe he used the "correct" [sic] flags to the best of our knowledge. I could post them on the list perhaps. There was no cross-compilation. We compiled a fresh binary for the Nehalem. > I"m curious about the hyperthreading on data point as well. Didn't test for VASP yet but for our other two DFT codes i.e. DACAPO and GPAW hyperthreading "off" seems to be about 10% faster. > A doubling of the can have that effect. ?The Intel L3 can no come anywhere > close to feeding 4 cores running flat out. Could you explain this more? I am a little lost with the processor dynamics. Does this mean using a quad core for HPC on the Nehlem is not likely to work well for scaling? Or do you imply a solution so that I could fix this somehow? Thanks again! -- Rahul From Craig.Tierney at noaa.gov Tue Aug 11 13:03:51 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Tue, 11 Aug 2009 14:03:51 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81ACF3.60802@noaa.gov> Message-ID: <4A81CEA7.3030802@noaa.gov> Rahul Nabar wrote: > On Tue, Aug 11, 2009 at 12:40 PM, Craig Tierney wrote: >> What are you doing to ensure that you have both memory and processor >> affinity enabled? >> > > All I was using now was the flag: > > --mca mpi_paffinity_alone 1 > > Is there anything else I ought to be doing as well? > That should be adequate. Craig -- Craig Tierney (craig.tierney at noaa.gov) From rpnabar at gmail.com Tue Aug 11 15:24:04 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 17:24:04 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References:

Message-ID: On Tue, Aug 11, 2009 at 12:19 PM, Mikhail Kuzminsky wrote: > If this results are for HyperThreading "ON", it may be not too strange > because of "virtual cores" competition. > > But if this results are for switched off Hyperthreading - it's strange. > I have usual good DFT scaling w/number of cores on G03 - about in 7 times > for 8 cores. Yes, it is very strange and I still cannot explain it very well. Do you have scaling info for VASP? You did mention DFT codes so I was wondering. I still haven't found much info for VASP on Nehlems. Maybe it is some feature of this particular code. All the other tests make sense. And I just find it hard to believe that the Nehalem which has been so far touted to be such a good proc. can be outperformed by by one year old AMD Barcelonas.......I feel it's something I am doing wrong. -- Rahul From rpnabar at gmail.com Tue Aug 11 15:31:08 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 17:31:08 -0500 Subject: [Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year? In-Reply-To: References: <20090406080836.GC30865@bx9.net>

Message-ID: On Thu, Apr 9, 2009 at 11:35 AM, Douglas J. Trainor wrote: > Rahul, > > I think Greg et al. are correct. ?Does your SC1435 have a Delta Electronics > switching power supply? ?I bet you have a 600 watt Delta. > > Intel recently had problems with outsourced 350 watt "FHJ350WPS" switching > power supplies that apparently affected 5% of some server lines. ?These were > loading imbalance problems between the 3.3 volt and 12 volt lines. ?The > affected power supplies had a minimum loading requirement that was not met. > ?The over-voltage protection circuit would kick in on the 3.3V line. > ?However, in these cases, the Intel machines would not reboot. ?Intel is > modifying the 3.3 volt minimum loading from 1.2 amps to 0.2 amps to fix the > problem. A while ago I had posted about these crashing SC1435's that I had. I received lots of good suggestions on this group. Thanks all! A lot of persistence with the vendor succeed in making their Engineering team do long-run tests on one of our captured machines. It needed to be tested for over one month and then they finally replicated the failure. Whew! (In the past they had aborted tests way before this time period) They won't give me many internal details but apparantly it is caused by an "hardware issue more likely caused certain motherboards with Opterons" [sic] So, thank again and it does seem that we finally got down to the cause of this irritating problem! Just posted this in case it helps any other SC1435 admins in a similar boat! Cheers! -- Rahul From coutinho at dcc.ufmg.br Tue Aug 11 15:57:11 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Tue, 11 Aug 2009 19:57:11 -0300 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: 2009/8/11 Rahul Nabar > On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley > wrote: > > Looks to me like you fit in the barcelona 512KB L2 cache (and get good > > scaling) and do not fit in the nehalem 256KB L2 cache (and get poor > scaling). > > Thanks Bill! I never realized that the L2 cache of the Nehalem is > actually smaller than that of the Barcelona! > > I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe. > THe size of the L2 cache is fixed across the steppings of the Nehlem > isn't it? I think that probably it only will be fixed on newer models or only in Westmere (Nehalem shrink to 32nm). > > > > Were the binaries compiled specifically to target both architectures? As > a > > first guess I suggest trying pathscale (RIP) or open64 for amd, and > intel's > > compiler for intel. But portland group does a good job at both in most > cases. > > We used the intel compilers. One of my fellow grad students did the > actual compilation for VASP but I believe he used the "correct" [sic] > flags to the best of our knowledge. I could post them on the list > perhaps. There was no cross-compilation. We compiled a fresh binary > for the Nehalem. > > > I"m curious about the hyperthreading on data point as well. > > Didn't test for VASP yet but for our other two DFT codes i.e. DACAPO > and GPAW hyperthreading "off" seems to be about 10% faster. > > > > A doubling of the can have that effect. The Intel L3 can no come > anywhere > > close to feeding 4 cores running flat out. > > Could you explain this more? I am a little lost with the processor > dynamics. Does this mean using a quad core for HPC on the Nehlem is > not likely to work well for scaling? Or do you imply a solution so > that I could fix this somehow? > Nehalem and Barcelona have the following cache architecture: L1 cache: 64KB (32kb data, 32kb instruction), per core L2 cache: Barcelona :512kb, Nehalem: 256kb, per core L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores. Both in Barcelona and Nehalem, the "uncore" (everything outside a core, like L3 and memory controllers) runs at lower speed than the cores and all cores communicate through L3, so it must handle some coherence signals too. This makes impossible to L3 feed all cores at full speed if L2 caches have big miss ratios. So, what is happening with your program is something like: Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate, but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate. So in Nehelem the shared L3 cache has to handle much more requests from all cores than Barcelona, becoming a big bottleneck. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Tue Aug 11 16:07:40 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 18:07:40 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: On Tue, Aug 11, 2009 at 5:57 PM, Bruno Coutinho wrote: > Nehalem and Barcelona have the following cache architecture: > > L1 cache: 64KB (32kb data, 32kb instruction), per core > L2 cache: Barcelona :512kb, Nehalem: 256kb, per core > L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores. > > > Both in Barcelona and Nehalem, the "uncore" (everything outside a core, like > L3 and memory controllers) runs at lower speed than the cores and all cores > communicate through L3, so it must handle some coherence signals too. > This makes impossible to L3 feed all cores at full speed if L2 caches have > big miss ratios. > > So, what is happening with your program is something like: > > Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate, > but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate. > So in Nehelem the shared L3 cache has to handle much more requests from all > cores than Barcelona, becoming a big bottleneck. Thanks Bruno! That makes a lot of sense now. Assuming that is what is happening is there any way of still using the Nehalems fruitfully for this code? Any smart tricks / hacks? The reason is that the Nehalems seem to scale and perform beautifully for my other codes. The only other option is to relapse back to the AMDs. I believe the Shanghai would be a choice or an Instanbul. I assume the cache structure there is as good as the Barcelona if not better! Any experiences with these chips on the group? Funnily, I haven't heard of any such Nehalem (-ive) stories anywhere else. Am I the first one to hit this cache bottleneck? I doubt it. Any other cache heavy users? -- Rahul From coutinho at dcc.ufmg.br Tue Aug 11 16:27:55 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Tue, 11 Aug 2009 20:27:55 -0300 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: 2009/8/11 Rahul Nabar > On Tue, Aug 11, 2009 at 5:57 PM, Bruno Coutinho > wrote: > > Nehalem and Barcelona have the following cache architecture: > > > > L1 cache: 64KB (32kb data, 32kb instruction), per core > > L2 cache: Barcelona :512kb, Nehalem: 256kb, per core > > L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores. > > > > > > Both in Barcelona and Nehalem, the "uncore" (everything outside a core, > like > > L3 and memory controllers) runs at lower speed than the cores and all > cores > > communicate through L3, so it must handle some coherence signals too. > > This makes impossible to L3 feed all cores at full speed if L2 caches > have > > big miss ratios. > > > > So, what is happening with your program is something like: > > > > Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate, > > but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate. > > So in Nehelem the shared L3 cache has to handle much more requests from > all > > cores than Barcelona, becoming a big bottleneck. > > Thanks Bruno! That makes a lot of sense now. Assuming that is what is > happening is there any way of still using the Nehalems fruitfully for > this code? Any smart tricks / hacks? You can use profilers that monitor hardware performance counters like oprofile or papi to measure miss ratios and verify if that is what is happening. But solving it is a much larger problem. :) > > > The reason is that the Nehalems seem to scale and perform beautifully > for my other codes. > > The only other option is to relapse back to the AMDs. I believe the > Shanghai would be a choice or an Instanbul. I assume the cache > structure there is as good as the Barcelona if not better! Any > experiences with these chips on the group? > > Funnily, I haven't heard of any such Nehalem (-ive) stories anywhere > else. Am I the first one to hit this cache bottleneck? I doubt it. Any > other cache heavy users? > > -- > Rahul > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgb at phy.duke.edu Tue Aug 11 18:50:26 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 11 Aug 2009 21:50:26 -0400 (EDT) Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A819971.9060905@scalableinformatics.com> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: On Tue, 11 Aug 2009, Joe Landman wrote: > There is a cost to going cheap. This cost is time, and loss of productivity. > If your time (your students time) is free, and you don't need to pay for > consequences (loss of grants, loss of revenue, loss of productivity, ...) in > delayed delivery of results from computing or storage systems, then, by all > means, roll these things yourself, and deal with the myriad of debugging > issues in making the complex beasts actually work. You have hardware stack > issues, software stack issues, interaction issues, ... Oh, damn, might as well demonstrate that I'm not dead yet. I'm getting better. Actually, I'm just getting back from Beaufort and so fish are not calling me and neither is the mountain of unpacking, so I might as well chip in. My own experiences in this regard are that one can span a veritable spectrum of outcomes from great and very cost efficient to horrible and expensive in money and time. Larger projects the odds of the latter go up as what is a small inefficiency in 16-32 systems becomes and enormous and painful one for 1024. I'll skip the actual anecdotes -- most of them are probably in the archives anyway -- and just go straight to the (IMO) conclusions. The price of your systems should scale with the number you buy. Building an 8 node starter cluster? Tiger.com $350 specials are fine. Building a professional/production cluster with 32 or more nodes? Go rackmount (already a modest premium) and start kicking in for service contracts and a few extras to help keep things smooth. Building 32 nodes with an eye on expandibility? Go tier 1 or tier 2 vendor (or a professional and experienced cluster consultant, such as Joe), with four year service, after asking on list to see if the vendor is competent and uses or builds good hardware. IBM nodes are great. Penguin nodes (in my own experience) are great. Dell nodes are "ok", sort of the low end of high end. I don't have much experience with HP in a cluster setting. And do not, not, not, get no-name nodes from a cheap online vendor unless you value pain. This is advice that works for anything on the DIY side -- even a 1024 node cluster can be built by just you (or you and a friend/flunky) as long as you allow a realistic amount of time to install it and debug it -- say 10-15 minutes a node in production with electric screwdrivers to rack them from delivery box to rack (after a bit of practice, and you'll get LOTS of practice:-) Call it a day per rack, so yeah, 3-4 human-weeks of FTE. PLUS at least 10 minutes of install/debugging time, on average, per node. These aren't fixed numbers -- I'm sure there are humans who can rack/derack a node in five minutes, and if you get your vendor to premount the rails then ANYBODY can rack a node in two or three minutes in production. Then there are people like me, who might edge over closer to twenty minutes, or circumstances like "oops, this back of rack screws is the wrong size, time for crazed phone calls to and overnights from the vendor" that can ruin your expected average fast. (Software) install time depends on your general competence in linux, clustering, and how much energy you expended ahead of time setting up servers to accomodate the cluster install. If you are a linux god, a cluster god, and have a thoroughly debugged e.g. kickstart server (and got the vendor to default the BIOS to "DHCP boot" on fallthrough from an naked hard drive) then you might knock the install time down to making a table entry and turning on the systems -- and debugging the ones that failed to boot and install, or (in the case of diskless systems) boot and operate. A less gifted and experienced sysadmin might have to hook up a console to each system and hand install it, but nowadays even doing this isn't very time consuming as many installs can proceed in parallel. For non-DIY clusters -- turnkey, or contract built by somebody else -- the same general principles apply. If the cluster is fairly small, you aren't horribly at risk if you get relatively inexpensive nodes, bearing in mind that you're still trading off money SOMEWHERE later to save money now. If you are getting a medium large cluster or if downtime is very expensive, don't skimp on nodes -- get nodes that have a solid vendor standing behind them, with guaranteed onsite service for 3-4 years (the expected service life of your cluster). Here, in addition, you need to be damn sure you get your turnkey cluster from somebody who is not an idiot, who knows what they are doing and can actually deliver a functional cluster no less efficiently than described above and who will stand with you through the inevitable problems that will surface installing a larger cluster. In a nutshell, the "cost of going cheap" isn't linear, with or without student/cheap labor. For small clusters installed by somebody who knows what they are doing and e.g. operated and used by the owner or the owner's lab including students, operated by departmental sysadmins with cluster experience and enough warm bodies to have some opportunity cost labor handy -- sure, go cheap -- if a node or two is DOA or fails, so what? It takes you an extra day or two to get the cluster going, but most of that time is waiting for parts -- OC time is much smaller, and everybody has other things to do while waiting. But as clusters get larger, the marginal cost of the differential failure rate between cheap and expensive scales up badly and can easily exceed the OC labor pool's capacity, especially if by bad luck you get a cheap node and it turns out to be a "lemon" and the faraway dot com that sold it to you refuses to fix or replace it. The turnover from cheap to much more expensive than just getting good nodes from a reputable vendor (which don't usually cost THAT much more than cheap) can happen real fast, and the time wasted can go from a few days to months equally fast. So be aware of this. It is easy to find people on this list with horror stories associated with building larger clusters with cheap nodes. With smaller clusters it isn't horrible -- it is annoying. You can often afford to throw e.g. a bad motherboard away and just buy another one and reinstall a better one in the nodes one at a time for eight or a dozen nodes. You can't do this, sanely, for 64, or 128, or 1024. One last thing to be aware of is the politics of grants. Few people out there buy nodes out of pocket. They pay for clusters using OPM (Other People's Money). In many cases it is MUCH EASIER to budget expensive nodes, with onsite service and various guarantees, up front in the initial purchase in a grant than it is to budget less money on more cheaper nodes and then budget ENOUGH money to be able to handle any possible failure contingency in the next three years of the grant cycle. Granting agencies actually might even prefer it this way (they should if they have any sense). Imagine their excitement if halfway through the computation they are funding your entire cluster blows a cheap capacitor made in Taiwan (same one on every motherboard) and your cheap vendor vanishes like dust in the wind, bankrupted like everybody else who sold motherboards with that cap on them. Now they have a choice -- buy you a NEW cluster so you can finish, or write off the whole project (and quite possibly write off you as well, for ever and ever). Conservatism may cost you a few nodes that you could have added if you went cheap, but it is INSURANCE that the nodes you get will be around to reliably complete the computation. rgb > > What I am saying is that Doug is onto something here. It ain't easy. Doug > simply expressed that it isn't. > > As for the article being self serving? I dunno, I don't think so. Doug runs > a consultancy called Basement Supercomputing that provides services for such > folks. I didn't see overt advertisements, or even, really, covert "hire us" > messages. I think this was fine as a white paper, and Doug did note that it > started life as one. > > My $0.02 > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics Inc. > email: landman at scalableinformatics.com > web : http://scalableinformatics.com > http://scalableinformatics.com/jackrabbit > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From rpnabar at gmail.com Tue Aug 11 20:30:39 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 22:30:39 -0500 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A819971.9060905@scalableinformatics.com> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: On Tue, Aug 11, 2009 at 11:16 AM, Joe Landman wrote: > > There is a cost to going cheap. ?This cost is time, and loss of > productivity. ?If your time (your students time) is free, and you don't need > to pay for consequences (loss of grants, loss of revenue, loss of > productivity, ...) in delayed delivery of results from computing or (1) Why always consider it a "loss" of your student's time? I was one such "student" think there is enormous learning potential here. Of course, my systems never did match the uptime / performance of a "turnkey" solution but the skills learnt in setting one up are rarely gained otherwise. At a university research is one goal; but learning is definitely another. (2) A key problem that I don't know how to work around for turnkey solutions: "How do I pharase the contract and performance gurrantee so that I get the vendor to do all the things that I want?" Many of us run codes that are not very high volume nor very standardized. Everybody wants to tweak and do something new. Especially in research. In a such a scenario I don't want the vendor to "just give me boxes with an OS" but also get my code installed, compiled, running and optimized. Plus schedulers and some such. Not just install them but set up fairshares that reflect user situations. Most "turnkey" options seem to do just fine for the early parts (as far as I can see) but those are the easy steps. The problem is that , by their very nature, the later steps are harder to "define". And when a problem lacks definition "turnkey" solutions are hard to spec. What good "in house" sys admin manpower does is handle the hairy issues of the latter steps. And once you invest in developing (or training or hiring) good quality sys-admins then taking intelligent decisions about selection, installation, and commissioning are easy enough for the well-trained guys anyways! And if you don't invest in good-quality internal computer guys then the vendors are gonna take you for a ride all the time! -- Rahul From landman at scalableinformatics.com Tue Aug 11 20:54:36 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 11 Aug 2009 23:54:36 -0400 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: <4A823CFC.5050605@scalableinformatics.com> Rahul Nabar wrote: > On Tue, Aug 11, 2009 at 11:16 AM, Joe > Landman wrote: >> There is a cost to going cheap. This cost is time, and loss of >> productivity. If your time (your students time) is free, and you don't need >> to pay for consequences (loss of grants, loss of revenue, loss of >> productivity, ...) in delayed delivery of results from computing or > > > (1) Why always consider it a "loss" of your student's time? I was one Time is a zero sum game irrespective of how much coffee you consume. If you wind up spending large fractions of your time on computing, you spend less time on research. Students as cheap/free labor means they aren't getting their research work done (unless their research is on how to build/maintain the cluster). > such "student" think there is enormous learning potential here. Of Yes, there is much to learn. Even some meta-learning, such as when not to spend time on things. > course, my systems never did match the uptime / performance of a > "turnkey" solution but the skills learnt in setting one up are rarely > gained otherwise. At a university research is one goal; but learning > is definitely another. Hmm.... usually the process of research and the process of learning went hand in hand. I agree that people *should* get a grounding in all aspects of their research, and should get their hands dirty to a degree. But you shouldn't have them spend the time they should be doing research in focusing exclusively on managing resources (unless you are trying to teach them how to do time and resource management, which is a very important skill for scientists). > (2) A key problem that I don't know how to work around for turnkey > solutions: "How do I pharase the contract and performance gurrantee so > that I get the vendor to do all the things that I want?" It starts out with you defining the goals you wish to achieve, and then working through the path to achieve them. Decide which portion you wish to do, and find a partner to help you do what you don't want them to do. A good vendor *will* partner with you to solve real problems. > Many of us run codes that are not very high volume nor very > standardized. Everybody wants to tweak and do something new. > Especially in research. In a such a scenario I don't want the vendor > to "just give me boxes with an OS" but also get my code installed, We like to get our hands on the code early, with test cases, so we understand how it will behave on the hardware before our customer gets it ... precisely so we can help answer questions, and solve problems. Most vendors just want to deliver the boxes. > compiled, running and optimized. Plus schedulers and some such. Not > just install them but set up fairshares that reflect user situations. :) I sometimes joke that you know you have your scheduler set up right when everyone hates you. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From deadline at eadline.org Wed Aug 12 07:29:43 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed, 12 Aug 2009 10:29:43 -0400 (EDT) Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A81825B.4000303@unige.ch> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> Message-ID: <33802.192.168.1.213.1250087383.squirrel@mail.eadline.org> > Douglas Eadline wrote: >> All, >> >> I posted this on ClusterMonkey the other week. >> It is actually derived from a white paper I wrote for >> SiCortex. I'm sure those on this list have some >> experience/opinions with these issues (and other >> cluster issues!) >> >> The True Cost of HPC Cluster Ownership >> >> http://www.clustermonkey.net//content/view/262/1/ >> > > This article sounds unbalanced and self-serving. > > While it I clear that self-made clusters imply added new costs > in regard of turn-key clusters, they also empower the buyer > using standard and open solutions by an increased independence from > the vendor, and increases also its knowledge for future choices. > This aspect is hard to measure in monetary terms, but certainly very > important for some users. > > I have experienced all kinds of clusters (turn-key, mostly > self-assembled, and partly vendor assembled and tested), and my conclusion > is that the best is when the user has at least the choice to determine > the degree of vendor integration/lock-in. Bad choices occur > because people are badly informed, and the article is so biased > that it doesn't improve objective information on this regard, just > serves as increasing fear and doubt. A few comments, First, the paper was originally commissioned by SiCortex. I removed promotional parts and updated the paper to reflect what I believe are valid points worth considering when procuring and HPC cluster. (i.e. I believe informing people about unknown potential costs is a good thing (tm)) The simple premise is "due to the nature of clusters, there are costs that were once part of the HPC purchase price, that are not included anymore and you may have to absorb the cost" Second, I did not advocate lock-in of any kind. Indeed, the thing I like about clusters and open software is there is lock-in protection. I do suggest that many people have two choices: 1) go at it on your own and understand that it will probably be a learning experience. It will in all likelihood take longer than you thought. 2) get qualified help so your cluster is up an running ASAP I don't favor either case, My intention was to assist those who do not understand the nature of cluster computing with some of the issues we have all faced at one time or another. As far as being objective, I have experienced first hand all the situations I mentioned. As far as case 2, there is some form of service lock-in, but I consider this a good thing. If you can find someone that keeps things working for you and/or you don't have the time or personnel and/or you want to focus on science or engineering, then paying them is not a bad idea. And by the way, this is normally the case in most industrial settings, plus they need CYA strategy as well. Finally, as far as self serving, well I can tell you the phone is not ringing off the hook. It was not my intention to generate business from an article on ClusterMonkey. I wish it were that easy. Finally, because ClusterMonkey.net is an open community site I encourage you contribute to the conversation. -- Doug From bill at cse.ucdavis.edu Wed Aug 12 08:14:09 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 12 Aug 2009 08:14:09 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A81CEA7.3030802@noaa.gov> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> Message-ID: <4A82DC41.2060805@cse.ucdavis.edu> I've been working on a pthread memory benchmark that is loosely modeled on McCalpin's stream. It's been quite a challenge to remove all the noise/lost performance from the benchmark to get close to performance I expected. Some of the obstacles: * For the compilers that tend to be better at stream (open64 and pathscale), you lose the performance if you just replace double a[],b[],c[] with double *a,*b,*c. Patch[1] available. I don't have a work around for this, suggestions welcome. Is it really necessary for dynamic arrays to be substantially slower than static? * You have to be very careful with pointer alignment both with cache lines, and each other * cpu_affinity (by CPU id) * numa (by socket id) The results are relatively smooth graphs, here's an example, it's uselessly busy until you toggle off a few graphs (by clicking on the key): http://cse.ucdavis.edu/bill/pstream.svg The biggest puzzle I have now is what the previous generation intel quads, the current generation AMD quads, and numerous other CPUs show a big benefit in L1, while the nehalem shows no benefit. [1] http://cse.ucdavis.edu/bill/stream-malloc.patch From kus at free.net Wed Aug 12 08:14:25 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Wed, 12 Aug 2009 19:14:25 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A81ACF3.60802@noaa.gov> Message-ID: In message from Craig Tierney (Tue, 11 Aug 2009 11:40:03 -0600): >Rahul Nabar wrote: >> On Mon, Aug 10, 2009 at 12:48 PM, Bruno >>Coutinho wrote: >>> This is often caused by cache competition or memory bandwidth >>>saturation. >>> If it was cache competition, rising from 4 to 6 threads would make >>>it worse. >>> As the code became faster with DDR3-1600 and much slower with Xeon >>>5400, >>> this code is memory bandwidth bound. >>> Tweaking CPU affinity to avoid thread jumping among cores of the >>>will not >>> help much, as the big bottleneck is memory bandwidth. >>> To this code, CPU affinity will only help in NUMA machines to >>>maintain >>> memory access in local memory. >>> >>> >>> If the machine has enough bandwidth to feed the cores, it will >>>scale. >> >> Exactly! But I thought this was the big advance with the Nehalem >>that >> it has removed the CPU<->Cache<->RAM bottleneck. So if the code >>scaled >> with the AMD Barcelona then it would continue to scale with the >> Nehalem right? >> >> I'm posting a copy of my scaling plot here if it helps. >> >> http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg >> >> To remove most possible confounding factors this particular Nehlem >> plot is produced with the following settings: >> >> Hyperthreading OFF >> 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration >> X5550 >> >> Even if we explained away the bizzare performance of the 4 node case >> to the Turbo effect what is most confusing is how the 8 core data >> point could be so much slower than the corresponding 8 core point on >>a >> old AMD Barcelona. >> >> Something's wrong here that I just do not understand. BTW, any other >> VASP users here? Anybody have any Nehalem experience? >> > >Rahul, >What are you doing to ensure that you have both memory and processor >affinity enabled? >Craig As I mentioned here in "numactl&SuSE11.1' thread, on some kernels there is wrong behaviour for Nehalem (bad /sys/devices/system/node directory content). This bug is presented, in particular, in default OpenSuSE 11 kernels (2.6.27.7-9 and 2.6.29-6), and (as it was writted in the corresponding thread discussion) in FC11 2.6.29 kernel. I found that in such situation disabling of NUMA in BIOS gives only increase of STREAM throughput. Therefore I think this (Rahul) problem is not due to BIOS settings. Unfortunately I've no data about VASP itself. It's interesting, do somebody have "normally working" w/Nehalem - in the sense of NUMA - kernels ? AFAIK more old 2.6 kernels (from SuSE 10.3) works OK, but I didn't check. May be error in NUMA support is the reason of Rahul problem ? Mikhail > > >> -- >> Rahul >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>Computing >> To change your subscription (digest mode or unsubscribe) visit >>http://www.beowulf.org/mailman/listinfo/beowulf >> > > >-- >Craig Tierney (craig.tierney at noaa.gov) >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From jlb17 at duke.edu Wed Aug 12 08:43:03 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 12 Aug 2009 11:43:03 -0400 (EDT) Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: On Tue, 11 Aug 2009 at 9:50pm, Robert G. Brown wrote > In a nutshell, the "cost of going cheap" isn't linear, with or without > student/cheap labor. For small clusters installed by somebody who knows > what they are doing and e.g. operated and used by the owner or the > owner's lab including students, operated by departmental sysadmins with > cluster experience and enough warm bodies to have some opportunity cost > labor handy -- sure, go cheap -- if a node or two is DOA or fails, so > what? It takes you an extra day or two to get the cluster going, but > most of that time is waiting for parts -- OC time is much smaller, and > everybody has other things to do while waiting. But as clusters get > larger, the marginal cost of the differential failure rate between cheap > and expensive scales up badly and can easily exceed the OC labor pool's > capacity, especially if by bad luck you get a cheap node and it turns > out to be a "lemon" and the faraway dot com that sold it to you refuses > to fix or replace it. The turnover from cheap to much more expensive > than just getting good nodes from a reputable vendor (which don't > usually cost THAT much more than cheap) can happen real fast, and the > time wasted can go from a few days to months equally fast. One thing I haven't seen addressed is to look at the proposed usage of the cluster. If most of the code to be run on the cluster is embarrassingly parallel, then the cost of a node going down or the network being less than optimal is fairly low. In this case, IMO, it's pretty easy to make the argument to go the DIY route (depending on size and available labor pool, of course, as others have mentioned). If, OTOH, you intend to run tightly coupled MPI code across the entire cluster, then it becomes very valuable to ensure that everything is working together just so. There a turn-key vendor (and/or highly skilled third party) can make more sense. In other words, the answer, as always, is "It depends." -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From Craig.Tierney at noaa.gov Wed Aug 12 09:32:08 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 10:32:08 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: <4A82EE88.20908@noaa.gov> Mikhail Kuzminsky wrote: > In message from Craig Tierney (Tue, 11 Aug 2009 > 11:40:03 -0600): >> Rahul Nabar wrote: >>> On Mon, Aug 10, 2009 at 12:48 PM, Bruno >>> Coutinho wrote: >>>> This is often caused by cache competition or memory bandwidth >>>> saturation. >>>> If it was cache competition, rising from 4 to 6 threads would make >>>> it worse. >>>> As the code became faster with DDR3-1600 and much slower with Xeon >>>> 5400, >>>> this code is memory bandwidth bound. >>>> Tweaking CPU affinity to avoid thread jumping among cores of the >>>> will not >>>> help much, as the big bottleneck is memory bandwidth. >>>> To this code, CPU affinity will only help in NUMA machines to maintain >>>> memory access in local memory. >>>> >>>> >>>> If the machine has enough bandwidth to feed the cores, it will scale. >>> >>> Exactly! But I thought this was the big advance with the Nehalem that >>> it has removed the CPU<->Cache<->RAM bottleneck. So if the code scaled >>> with the AMD Barcelona then it would continue to scale with the >>> Nehalem right? >>> >>> I'm posting a copy of my scaling plot here if it helps. >>> >>> http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg >>> >>> To remove most possible confounding factors this particular Nehlem >>> plot is produced with the following settings: >>> >>> Hyperthreading OFF >>> 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration >>> X5550 >>> >>> Even if we explained away the bizzare performance of the 4 node case >>> to the Turbo effect what is most confusing is how the 8 core data >>> point could be so much slower than the corresponding 8 core point on a >>> old AMD Barcelona. >>> >>> Something's wrong here that I just do not understand. BTW, any other >>> VASP users here? Anybody have any Nehalem experience? >>> >> >> Rahul, >> What are you doing to ensure that you have both memory and processor >> affinity enabled? >> Craig > > As I mentioned here in "numactl&SuSE11.1' thread, on some kernels there > is wrong behaviour for Nehalem (bad /sys/devices/system/node directory > content). This bug is presented, in particular, in default OpenSuSE 11 > kernels (2.6.27.7-9 and 2.6.29-6), and (as it was writted in the > corresponding thread discussion) in FC11 2.6.29 kernel. > > I found that in such situation disabling of NUMA in BIOS gives only > increase of STREAM throughput. Therefore I think this (Rahul) problem is > not due to BIOS settings. Unfortunately I've no data about VASP itself. > > It's interesting, do somebody have "normally working" w/Nehalem - in the > sense of NUMA - kernels ? AFAIK more old 2.6 kernels (from SuSE 10.3) > works OK, but I didn't check. May be error in NUMA support is the reason > of Rahul problem ? > What do you mean normally? I am running Centos 5.3 with 2.6.18-128.2.1 right now on a 448 node Nehalem cluster. I am so far happy with how things work. The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support where nodes would just start randomly run slow. Upgrading the kernel fixed that. But that performance problem was either all or none, I don't recall it exhibiting itself in the way that Rahul described. Craig > Mikhail >> >> >>> -- >>> Rahul >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> >> -- >> Craig Tierney (craig.tierney at noaa.gov) >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> -- >> ??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >> ? ????? ???????? ??????????? ??????????? >> MailScanner, ? ?? ???????? >> ??? ??? ?? ???????? ???????????? ????. >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Craig Tierney (craig.tierney at noaa.gov) From rpnabar at gmail.com Wed Aug 12 10:56:11 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 12:56:11 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A82EE88.20908@noaa.gov> References: <4A82EE88.20908@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney wrote: > What do you mean normally? ?I am running Centos 5.3 with 2.6.18-128.2.1 > right now on a 448 node Nehalem cluster. ?I am so far happy with how things work. > The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support > where nodes would just start randomly run slow. ?Upgrading the kernel > fixed that. ?But that performance problem was either all or none, I don't recall > it exhibiting itself in the way that Rahul described. > For me it shows: Linux version 2.6.18-128.el5 (mockbuild at builder10.centos.org) I am a bit confused with the numbering scheme, now. Is this older or newer than Craigs? You are right Craig, I haven't noticed any random slowdowns but my data is statistically sparse. I only have a single Nehalem+CentOS test node right now. -- Rahul From rpnabar at gmail.com Wed Aug 12 10:58:38 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 12:58:38 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81ACF3.60802@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 10:14 AM, Mikhail Kuzminsky wrote: > As I mentioned here in "numactl&SuSE11.1' thread, on some kernels there is > wrong behaviour for Nehalem (bad /sys/devices/system/node directory > content). This bug is presented, in particular, in default OpenSuSE 11 > kernels (2.6.27.7-9 and 2.6.29-6), and (as it was writted in the > corresponding thread discussion) in FC11 2.6.29 kernel. Is there a way to check if I have this bug? ls /sys/devices/system/node/ node0 node1 Don't know what exactly is "bad content" in here. I'll see if I can find a bug report online. > It's interesting, do somebody have "normally working" w/Nehalem - in the > sense of NUMA - kernels ? AFAIK more old 2.6 kernels (from SuSE 10.3) works > OK, but I didn't check. May be error in NUMA support is the reason of Rahul > problem ? > Any way to test? Are there any NUMA support tests or benchmarks? -- Rahul From rpnabar at gmail.com Wed Aug 12 11:00:41 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 13:00:41 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A82EE88.20908@noaa.gov> References: <4A82EE88.20908@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney wrote: > What do you mean normally? ?I am running Centos 5.3 with 2.6.18-128.2.1 > right now on a 448 node Nehalem cluster. ?I am so far happy with how things work. > The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support > where nodes would just start randomly run slow. ?Upgrading the kernel > fixed that. ?But that performance problem was either all or none, I don't recall > it exhibiting itself in the way that Rahul described. > I was trying another angle. Playing with the power profiles. Just downloaded cpufreq-utils via yum. Tried to see what profile was loaded: cpufreq-info cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006 Report errors and bugs to cpufreq at vger.kernel.org, please. analyzing CPU 0: no or unknown cpufreq driver is active on this CPU analyzing CPU 1: no or unknown cpufreq driver is active on this CPU analyzing CPU 2: no or unknown cpufreq driver is active on this CPU analyzing CPU 3: no or unknown cpufreq driver is active on this CPU analyzing CPU 4: no or unknown cpufreq driver is active on this CPU analyzing CPU 5: no or unknown cpufreq driver is active on this CPU analyzing CPU 6: no or unknown cpufreq driver is active on this CPU analyzing CPU 7: no or unknown cpufreq driver is active on this CPU Is this lack of the right drivers indicative of a deeper fault or is this fairly local to this issue? This could be a clue or a red herring. Just thought that I ought to post it. -- Rahul From Craig.Tierney at noaa.gov Wed Aug 12 11:02:15 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 12:02:15 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> Message-ID: <4A8303A7.50801@noaa.gov> Rahul Nabar wrote: > On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney wrote: >> What do you mean normally? I am running Centos 5.3 with 2.6.18-128.2.1 >> right now on a 448 node Nehalem cluster. I am so far happy with how things work. >> The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support >> where nodes would just start randomly run slow. Upgrading the kernel >> fixed that. But that performance problem was either all or none, I don't recall >> it exhibiting itself in the way that Rahul described. >> > > For me it shows: > > Linux version 2.6.18-128.el5 (mockbuild at builder10.centos.org) > > I am a bit confused with the numbering scheme, now. Is this older or > newer than Craigs? You are right Craig, I haven't noticed any random > slowdowns but my data is statistically sparse. I only have a single > Nehalem+CentOS test node right now. > When you run uname -a you don't get something like: [ctierney at wfe7 serial]$ uname -a Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux We did build our kernel from source, only because we ripped out the IB so we could build from the latest OFED stack. Try: # rpm -qa | grep kernel And see what version is listed. We have found a few performance problems so far. 1) Nodes would start going slow, really slow. However, when they started to go slow they stayed slow and the problem was cleared by a reboot. This problem was resolved by upgrading to the kernel we use now. 2) Nodes are reporting too many System Events that look like single-bit errors. This again would show up as nodes that would start to go slow, and wouldn't be resolved until a reboot. We no longer things we had lots of bad memory, and the latest BIOS may have fixed it. We are upload that bios now and will start checking. The only time I was getting variability in timings was when I wasn't pinning processes and memory correctly. My tests have always used all the cores in a node though. I think that OpenMPI is doing the correct thing with mpi_affinity_alone. For mvapich, we wrote a wrapper script (similar to TACC) that uses numactl directly to pin memory and threads. Craig -- Craig Tierney (craig.tierney at noaa.gov) From Craig.Tierney at noaa.gov Wed Aug 12 11:06:23 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 12:06:23 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> Message-ID: <4A83049F.8000009@noaa.gov> Rahul Nabar wrote: > On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney wrote: >> What do you mean normally? I am running Centos 5.3 with 2.6.18-128.2.1 >> right now on a 448 node Nehalem cluster. I am so far happy with how things work. >> The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support >> where nodes would just start randomly run slow. Upgrading the kernel >> fixed that. But that performance problem was either all or none, I don't recall >> it exhibiting itself in the way that Rahul described. >> > > I was trying another angle. Playing with the power profiles. Just > downloaded cpufreq-utils via yum. Tried to see what profile was > loaded: > > cpufreq-info > cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006 > Report errors and bugs to cpufreq at vger.kernel.org, please. > analyzing CPU 0: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 1: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 2: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 3: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 4: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 5: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 6: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 7: > no or unknown cpufreq driver is active on this CPU > > Is this lack of the right drivers indicative of a deeper fault or is > this fairly local to this issue? This could be a clue or a red > herring. Just thought that I ought to post it. > I read there are several different tools to manage the CPU frequency. If you are using Centos/Redhat try: cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors Does it list any? If not, that might be why cpufreq-info can't find anything. Craig -- Craig Tierney (craig.tierney at noaa.gov) From gus at ldeo.columbia.edu Wed Aug 12 11:09:04 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 12 Aug 2009 14:09:04 -0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A82DC41.2060805@cse.ucdavis.edu> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> Message-ID: <4A830540.1070101@ldeo.columbia.edu> Hi Bill, list Bill: This is very interesting indeed. Thanks for sharing! Bill's graph seem to show that Shanghai and Barcelona scale (almost) linearly with the number of cores, whereas Nehalem stops scaling and flattens out at 4 cores. The Nehalem 8 cores and 4 cores curves are virtually indistinguishable, and for very large arrays 4 cores is ahead. Only for huge arrays (>16M) Nehalem gets ahead of Shanghai and Barcelona. Did I interpret the graph right? Wasn't this type of scaling problem that plagued the Clovertown and Harpertown? Any possibility that kernels, BIOS, etc, are not yet ready for Nehalem? Thanks, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Bill Broadley wrote: > I've been working on a pthread memory benchmark that is loosely modeled on > McCalpin's stream. It's been quite a challenge to remove all the noise/lost > performance from the benchmark to get close to performance I expected. Some > of the obstacles: > * For the compilers that tend to be better at stream (open64 and pathscale), > you lose the performance if you just replace double a[],b[],c[] with > double *a,*b,*c. Patch[1] available. I don't have a work around for > this, suggestions welcome. Is it really necessary for dynamic arrays > to be substantially slower than static? > * You have to be very careful with pointer alignment both with cache lines, > and each other > * cpu_affinity (by CPU id) > * numa (by socket id) > > The results are relatively smooth graphs, here's an example, it's uselessly > busy until you toggle off a few graphs (by clicking on the key): > > http://cse.ucdavis.edu/bill/pstream.svg > > The biggest puzzle I have now is what the previous generation intel quads, the > current generation AMD quads, and numerous other CPUs show a big benefit in > L1, while the nehalem shows no benefit. > > [1] http://cse.ucdavis.edu/bill/stream-malloc.patch > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From bill at Princeton.EDU Wed Aug 12 11:11:45 2009 From: bill at Princeton.EDU (Bill Wichser) Date: Wed, 12 Aug 2009 14:11:45 -0400 Subject: [Beowulf] FYI - Dell M610 BIOS 1.0.4 Message-ID: <4A8305E1.1030705@princeton.edu> We just upgraded our BIOS on the Dell M610 blades from 1.0.4 to 1.1.4 and found that memory performance using nodeperf benchmark has almost doubled. If anyone has an old BIOS on these types of nodes I'd HIGHLY recommend updating the BIOS. Bill From rpnabar at gmail.com Wed Aug 12 11:14:29 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 13:14:29 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A83049F.8000009@noaa.gov> References: <4A82EE88.20908@noaa.gov> <4A83049F.8000009@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 1:06 PM, Craig Tierney wrote: > I read there are several different tools to manage the CPU frequency. > If you are using Centos/Redhat try: > > cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors > > Does it list any? ?If not, that might be why cpufreq-info can't > find anything. Thanks again Craig! cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors: No such file or directory But I distinctly remember several "power performance " options being listed in the BIOS. It is funny that the governors aren't listed. I am not sure how I can fix this. On my older AMD Barcelonas I am used to setting the governor to "performance" and that way it then does not drop its frequency ever. -- Rahul From rpnabar at gmail.com Wed Aug 12 11:16:02 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 13:16:02 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A8303A7.50801@noaa.gov> References: <4A82EE88.20908@noaa.gov> <4A8303A7.50801@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 1:02 PM, Craig Tierney wrote: > When you run uname -a you don't get something like: > [ctierney at wfe7 serial]$ uname -a > Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux uname -a Linux node25 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux So it does seem that yours might be a little newer than mine (the 2.1 suffix?) > > We did build our kernel from source, only because we ripped out > the IB so we could build from the latest OFED stack. We didn't. We used the latest from the CentOS website. > Try: > > # rpm -qa | grep kernel > rpm -qa | grep kernel kernel-devel-2.6.18-128.el5 kernel-headers-2.6.18-128.el5 kernel-2.6.18-128.el5 From bill at cse.ucdavis.edu Wed Aug 12 11:19:59 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 12 Aug 2009 11:19:59 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A830540.1070101@ldeo.columbia.edu> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> <4A830540.1070101@ldeo.columbia.edu> Message-ID: <4A8307CF.7090700@cse.ucdavis.edu> Gus Correa wrote: > Hi Bill, list > > Bill: This is very interesting indeed. Thanks for sharing! > > Bill's graph seem to show that Shanghai and Barcelona scale > (almost) linearly with the number of cores, whereas Nehalem stops > scaling and flattens out at 4 cores. Right. That's not really surprising since the core i7 has only 4 cores. I wasn't testing a dual socket nehalem. So on a single socket core i7 that I tested the hyperthreading provided no additional performance. None to surprising since hyperthreading is about sharing idle functional units, but doesn't do much when the cache or memory system is saturated. > The Nehalem 8 cores and 4 cores curves are virtually indistinguishable, Yes, but it was 8 threads on 4 cores, vs 4 threads on 4 cores. I'd expect something less memory intensive and more cpu intensive would show a big difference. In fact many of the HPC codes I've tried see a benefit. > and for very large arrays 4 cores is ahead. > Only for huge arrays (>16M) Nehalem gets ahead > of Shanghai and Barcelona. Yes, impressive that a single socket intel has more main memory bandwidth then a dual socket shanghai. > Did I interpret the graph right? > Wasn't this type of scaling problem that plagued > the Clovertown and Harpertown? Heh, the mention single socket core i7 has substantially more (2-4x) memory bandwidth of the previous generation intels. > Any possibility that kernels, BIOS, etc, are not yet ready for Nehalem? They look good for me, still trying to find out why I don't see better performance inside L1 though. From Craig.Tierney at noaa.gov Wed Aug 12 11:21:39 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 12:21:39 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> <4A83049F.8000009@noaa.gov> Message-ID: <4A830833.7090209@noaa.gov> Rahul Nabar wrote: > On Wed, Aug 12, 2009 at 1:06 PM, Craig Tierney wrote: >> I read there are several different tools to manage the CPU frequency. >> If you are using Centos/Redhat try: >> >> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors >> >> Does it list any? If not, that might be why cpufreq-info can't >> find anything. > > Thanks again Craig! > > cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors: > No such file or directory > > But I distinctly remember several "power performance " options being > listed in the BIOS. It is funny that the governors aren't listed. I am > not sure how I can fix this. > > On my older AMD Barcelonas I am used to setting the governor to > "performance" and that way it then does not drop its frequency ever. > We have ours set to performance and then copy the setting from scaling_max_freq to scaling_min_freq to make sure the speed is set right. When everything works we will play with modifying that in the prolog/epilog of the batch system to try and reduce power consumption during idle time. Craig -- Craig Tierney (craig.tierney at noaa.gov) From Craig.Tierney at noaa.gov Wed Aug 12 11:22:15 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 12:22:15 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> <4A8303A7.50801@noaa.gov> Message-ID: <4A830857.3020104@noaa.gov> Rahul Nabar wrote: > On Wed, Aug 12, 2009 at 1:02 PM, Craig Tierney wrote: >> When you run uname -a you don't get something like: > >> [ctierney at wfe7 serial]$ uname -a >> Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux > > uname -a > Linux node25 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 > x86_64 x86_64 GNU/Linux > > So it does seem that yours might be a little newer than mine (the 2.1 suffix?) > > >> We did build our kernel from source, only because we ripped out >> the IB so we could build from the latest OFED stack. > > We didn't. We used the latest from the CentOS website. > >> Try: >> >> # rpm -qa | grep kernel >> > > rpm -qa | grep kernel > kernel-devel-2.6.18-128.el5 > kernel-headers-2.6.18-128.el5 > kernel-2.6.18-128.el5 Weird. You might try and see if there are later packages to install. Craig -- Craig Tierney (craig.tierney at noaa.gov) From kus at free.net Wed Aug 12 11:50:16 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Wed, 12 Aug 2009 22:50:16 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A830540.1070101@ldeo.columbia.edu> Message-ID: In message from Gus Correa (Wed, 12 Aug 2009 14:09:04 -0400): >Hi Bill, list > >Bill: This is very interesting indeed. Thanks for sharing! > >Bill's graph seem to show that Shanghai and Barcelona scale >(almost) linearly with the number of cores, whereas Nehalem stops >scaling and flattens out at 4 cores. >The Nehalem 8 cores and 4 cores curves are virtually >indistinguishable, >and for very large arrays 4 cores is ahead. >Only for huge arrays (>16M) Nehalem gets ahead >of Shanghai and Barcelona. IMHO, if arrays are not "huge", they will fit in cache L3 (8MB !). Or on X axe are presented Mwords ? Mikhail > >Did I interpret the graph right? >Wasn't this type of scaling problem that plagued >the Clovertown and Harpertown? >Any possibility that kernels, BIOS, etc, are not yet ready for >Nehalem? > >Thanks, >Gus Correa >--------------------------------------------------------------------- >Gustavo Correa >Lamont-Doherty Earth Observatory - Columbia University >Palisades, NY, 10964-8000 - USA >--------------------------------------------------------------------- > >Bill Broadley wrote: >> I've been working on a pthread memory benchmark that is loosely >>modeled on >> McCalpin's stream. It's been quite a challenge to remove all the >>noise/lost >> performance from the benchmark to get close to performance I >>expected. Some >> of the obstacles: >> * For the compilers that tend to be better at stream (open64 and >>pathscale), >> you lose the performance if you just replace double a[],b[],c[] >>with >> double *a,*b,*c. Patch[1] available. I don't have a work around >>for >> this, suggestions welcome. Is it really necessary for dynamic >>arrays >> to be substantially slower than static? >> * You have to be very careful with pointer alignment both with cache >>lines, >> and each other >> * cpu_affinity (by CPU id) >> * numa (by socket id) >> >> The results are relatively smooth graphs, here's an example, it's >>uselessly >> busy until you toggle off a few graphs (by clicking on the key): >> >> http://cse.ucdavis.edu/bill/pstream.svg >> >> The biggest puzzle I have now is what the previous generation intel >>quads, the >> current generation AMD quads, and numerous other CPUs show a big >>benefit in >> L1, while the nehalem shows no benefit. >> >> [1] http://cse.ucdavis.edu/bill/stream-malloc.patch >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>Computing >> To change your subscription (digest mode or unsubscribe) visit >>http://www.beowulf.org/mailman/listinfo/beowulf > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From rpnabar at gmail.com Wed Aug 12 11:55:45 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 13:55:45 -0500 Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs Message-ID: I am a bit confused about the high "used" memory that top is showing on one of my machines? Is this "leaky" memory caused by codes that did not return all their memory? Can I identify who is hogging the memory? Any other ways to "release" this memory? I can see no user processes really (even the load average is close to zero), but yet 7 GB out of our total of 16GB seems to be used. ################################################ top - 13:45:00 up 4 days, 20:07, 2 users, load average: 0.00, 0.00, 0.00 Tasks: 146 total, 1 running, 145 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16508824k total, 7148804k used, 9360020k free, 307040k buffers Swap: 8385920k total, 0k used, 8385920k free, 6380236k cached ########################################################### On the other hand, I recall reading somewhere before that due to the paging mechanism Linux is also supposed to start using as much memory as you give it? Just confused if this is something I need to worry about or not. Incidentally the way I discovered this was because users reported that their codes were running ~30% faster right after a machine reboot as opposed to after a few days running. Do people do anything special to make sure that in a scheduler based environment (say PBS) the last job releases all its memory resources before the new one starts running? [apologize for the multi-posting ; I first posted this on a generic linux list but then thought that HPC guys might be more sensitive and concerned about such memory issues] -- Rahul From mdidomenico4 at gmail.com Wed Aug 12 12:06:08 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 12 Aug 2009 15:06:08 -0400 Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: My first guess is that this is likely files being cached. We have NFS mounts and this happens alot... I don't regularly, but on occasion i've sync'd the filesystems and dumped the caches with a prolog script. Linux doesn't seem to empty the page cache faster enough, i'm sure theres a more eloquent way to do this On Wed, Aug 12, 2009 at 2:55 PM, Rahul Nabar wrote: > I am a bit confused about the high "used" memory that top is showing on one > of my machines? Is this "leaky" memory caused by codes that did not return > all their memory? Can I identify who is hogging the memory? Any other ways > to "release" this memory? > > I can see no user processes really (even the load average is close to > zero), but yet 7 GB out of our total of 16GB seems to be used. > > ################################################ > top - 13:45:00 up 4 days, 20:07, 2 users, load average: 0.00, 0.00, 0.00 > Tasks: 146 total, 1 running, 145 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 16508824k total, 7148804k used, 9360020k free, 307040k buffers > Swap: 8385920k total, 0k used, 8385920k free, 6380236k cached > ########################################################### > > On the other hand, I recall reading somewhere before that due to the > paging mechanism > Linux is also supposed to start using as much memory as you give it? Just > confused if this is something I need to worry about or not. > > Incidentally the way I discovered this was because users reported that their > codes were running ~30% faster right after a machine reboot as opposed to > after a few days running. Do people do anything special to make sure > that in a scheduler based environment (say PBS) the last job releases > all its memory resources before the new one starts running? > > [apologize for the multi-posting ; I first posted this on a generic > linux list but then thought that HPC guys might be more sensitive and > concerned about such memory issues] > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From jlb17 at duke.edu Wed Aug 12 12:07:56 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 12 Aug 2009 15:07:56 -0400 (EDT) Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: On Wed, 12 Aug 2009 at 1:55pm, Rahul Nabar wrote > Mem: 16508824k total, 7148804k used, 9360020k free, 307040k buffers > Swap: 8385920k total, 0k used, 8385920k free, 6380236k cached > ########################################################### > > On the other hand, I recall reading somewhere before that due to the > paging mechanism > Linux is also supposed to start using as much memory as you give it? Just > confused if this is something I need to worry about or not. Yes, Linux caches as much in memory as it can, and this is a good thing. But that memory gets released when it's needed by an active process. So, above, you have 7148804k used, but the vast majority (6380236k), is cache. If you use 'free', it'll show you the amount not counting buffers+cache. In short -- situation normal, nothing to see here, move along. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From hahn at mcmaster.ca Wed Aug 12 12:07:57 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 12 Aug 2009 15:07:57 -0400 (EDT) Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: > I am a bit confused about the high "used" memory that top is showing on one > of my machines? Is this "leaky" memory caused by codes that did not return > all their memory? Can I identify who is hogging the memory? Any other ways > to "release" this memory? free memory is WASTED memory. linux tries hard to keep only a smallish, limited amount of memory wasted. if you add up rss of all processes, the difference between that and 'used' is normally dominated by kernel page-cache. see /proc/sys/vm/drop_caches on how to force the kernel to throw away FS-related caches. also, I often do this: awk '{print $3*$4,$0}' /proc/slabinfo|sort -rn|head to get a quick snapshot of kinds of memory use. > Linux is also supposed to start using as much memory as you give it? Just > confused if this is something I need to worry about or not. you should never worry about paging (swapping, thrashing) until you see nontrivial swapin (NOT out) traffic. (ie, the 'si' column in "vmstat 1"). > Incidentally the way I discovered this was because users reported that their > codes were running ~30% faster right after a machine reboot as opposed to > after a few days running. isn't this one of the anomalous nehalem machines we've been talking about? if so, it's become clear that the kernel isn't managing the memory numa-aware, so the problem is probably just poor numa-layout/balance of allocations. > that in a scheduler based environment (say PBS) the last job releases > all its memory resources before the new one starts running? you could drop_caches, but this would also hurt you sometimes. From rpnabar at gmail.com Wed Aug 12 12:12:39 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 14:12:39 -0500 Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: On Wed, Aug 12, 2009 at 2:07 PM, Mark Hahn wrote: > isn't this one of the anomalous nehalem machines we've been talking about? > if so, it's become clear that the kernel isn't managing the memory > numa-aware, so the problem is probably just poor numa-layout/balance > of allocations. Thanks Mark. It is one of those Nehalems. -- Rahul From gus at ldeo.columbia.edu Wed Aug 12 12:27:06 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 12 Aug 2009 15:27:06 -0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A8307CF.7090700@cse.ucdavis.edu> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> <4A830540.1070101@ldeo.columbia.edu> <4A8307CF.7090700@cse.ucdavis.edu> Message-ID: <4A83178A.7020503@ldeo.columbia.edu> Hi Bill, list Bill: Many thanks for all the answers. Thanks also for the important clarification. So, the graphs you sent before compare dual socket Shanghai and Barcelona, to single socket Nehalem, right? This changes the perception a lot, as one should at most compare the 4-thread Shanghai and Barcelona curves (assuming the threads were running on a single socket) to the 4-thread Nehalem curves, right? The 8-thread curves are different animals. Would you have the (full) comparison to dual socket Nehalem, perhaps using the SMT feature also, and up to 16 threads? The benefit of SMT in HPC codes you mention matches what I saw with SMT on PPC IBM machines running climate models. (I don't have access to Nehalems to try the same codes for now.) Thank you, Gus Correa Bill Broadley wrote: > Gus Correa wrote: >> Hi Bill, list >> >> Bill: This is very interesting indeed. Thanks for sharing! >> >> Bill's graph seem to show that Shanghai and Barcelona scale >> (almost) linearly with the number of cores, whereas Nehalem stops >> scaling and flattens out at 4 cores. > > Right. That's not really surprising since the core i7 has only 4 cores. I > wasn't testing a dual socket nehalem. So on a single socket core i7 that I > tested the hyperthreading provided no additional performance. None to > surprising since hyperthreading is about sharing idle functional units, but > doesn't do much when the cache or memory system is saturated. > >> The Nehalem 8 cores and 4 cores curves are virtually indistinguishable, > > Yes, but it was 8 threads on 4 cores, vs 4 threads on 4 cores. I'd expect > something less memory intensive and more cpu intensive would show a big > difference. In fact many of the HPC codes I've tried see a benefit. > >> and for very large arrays 4 cores is ahead. >> Only for huge arrays (>16M) Nehalem gets ahead >> of Shanghai and Barcelona. > > Yes, impressive that a single socket intel has more main memory bandwidth then > a dual socket shanghai. > >> Did I interpret the graph right? >> Wasn't this type of scaling problem that plagued >> the Clovertown and Harpertown? > > Heh, the mention single socket core i7 has substantially more (2-4x) memory > bandwidth of the previous generation intels. > >> Any possibility that kernels, BIOS, etc, are not yet ready for Nehalem? > > They look good for me, still trying to find out why I don't see better > performance inside L1 though. From tjrc at sanger.ac.uk Wed Aug 12 14:10:47 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed, 12 Aug 2009 22:10:47 +0100 Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: On 12 Aug 2009, at 8:07 pm, Mark Hahn wrote: > also, I often do this: > awk '{print $3*$4,$0}' /proc/slabinfo|sort -rn|head > to get a quick snapshot of kinds of memory use. That's a little gem! Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From bill at cse.ucdavis.edu Wed Aug 12 19:42:30 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 12 Aug 2009 19:42:30 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: <4A837D96.8000506@cse.ucdavis.edu> Rahul Nabar wrote: > On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley wrote: >> Looks to me like you fit in the barcelona 512KB L2 cache (and get good >> scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling). > > Thanks Bill! I never realized that the L2 cache of the Nehalem is > actually smaller than that of the Barcelona! Indeed. Usually a doubling of cache size doesn't make a huge difference, but of course there are the occasional times when it makes a big difference. > I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe. > THe size of the L2 cache is fixed across the steppings of the Nehlem > isn't it? I believe so, at least so far. >> Were the binaries compiled specifically to target both architectures? As a >> first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's >> compiler for intel. But portland group does a good job at both in most cases. > > We used the intel compilers. One of my fellow grad students did the > actual compilation for VASP but I believe he used the "correct" [sic] > flags to the best of our knowledge. I could post them on the list > perhaps. There was no cross-compilation. We compiled a fresh binary > for the Nehalem. I'd make sure the compiler is fairly current. I believe both the barcelona/shanghai and the core i7/nehalem have some significant tweaks that if the compiler isn't aware of the new functionality you leave significant performance on the table. In particular the newest SSE features won't be of any benefit without direct compiler support. >> A doubling of the can have that effect. The Intel L3 can no come anywhere >> close to feeding 4 cores running flat out. > > Could you explain this more? I am a little lost with the processor > dynamics. In general each step through the memory hierarchy (registers, l1, l2, l3, and main memory) approximately double latency and halve the bandwidth available. So for instance if you fit in L1 caches you might well be able to enjoy 160GB/sec, but if you more than 1MB on a nehalem chip you will be in L3 with only 48GB/sec or so. Check out: (the slightly updated) http://cse.ucdavis.edu/bill/pstream.svg So if you compare the 2MB lines the core i7 with 4 threads running can handle 47GB/sec. The dual socket barcelona or shanghai system can handle 128GB/sec. So even a dual socket Nehalem, even with one of the faster clocks (I tested 2.6 GHz) and perfect scaling the dual nehelam would only get 95GB/sec still well below the amd score. Of course there are many other things going on and it might well be other differences in the architecture responsible for the difference. Even if it was memory bandwidth there was many other parts of the graph where the single socket intel does substantially better than half the AMD, and in the case of accessing main memory the single socket intel is faster than the dual socket AMD. So basically it comes down to fun handwaving about the architecture, but if you are making a price/performance decision collect a bunch of production runs and get out a stop watch. Your vasp difference in performance and scaling might well disappear with different inputs. > Does this mean using a quad core for HPC on the Nehlem is > not likely to work well for scaling? Or do you imply a solution so > that I could fix this somehow? I didn't test a dual socket nehalem because I didn't have access, I hope to have numbers soonish. In the mean time contact me off list if you want the code to try it yourself. From eugen at leitl.org Thu Aug 13 10:03:07 2009 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 13 Aug 2009 19:03:07 +0200 Subject: [Beowulf] HPC as service Message-ID: <20090813170307.GM25322@leitl.org> (cloud, huh) http://www.penguincomputing.com/POD/HPC_as_a_service HPC as a Service? - A new offering from Penguin Computing Penguin Computing is proud to provide a new HPC as a Service offering for its high performance computing customers. For more details, see Penguin on Demand. Penguin POD Datasheet What is High Performance Computing as a Service? HPC as a Service is a computing model where users have on-demand access to dynamically scalable, high-performance clusters optimized for parallel computing including the expertise needed to set up, optimize and run their applications over the Internet. The traditional barriers associated with high-performance computing such as the initial capital outlay, time to procure the system, effort to optimize the software environment, engineering their system for peak demand and continuing operating costs have been removed. Instead, HPC as a Service users have a virtualized, scalable cluster available on demand that operates and has the same performance characteristics as a physical HPC cluster located in their data room. HPC as a Service inside the Cloud There are different definitions of cloud computing, but at the core ?Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.?1 HPC as a Service extends this model by making concentrated, non-virtualized high-performance computing resources available in the cloud. HPC As A Service HPC as a Service provides users with a number of key benefits: 1. HPC resources scale with demand and are available with no capital outlay ? only the resources used are actually paid for. 2. Experts in high-performance computing help setup and optimize the software environment and can help trouble-shoot issues that might occur. 3. Faster time-to-results especially for computational requirements that greatly exceed the existing computing capacity. 4. Accounts are provided on an individual user basis, and users are billed for the time they use POD. 5. Computing costs are reduced, particularly where the workflow has spikes in demand. 6. Access from anywhere in the worlds with high-speed data transfer in and out. 7. Exchanging hot-swappable 2TB disk drives overnight through Federal Express provides a secure and convenient way to transfer 25GB+ data sets. From ellis at runnersroll.com Thu Aug 13 10:30:08 2009 From: ellis at runnersroll.com (Ellis Wilson III) Date: Thu, 13 Aug 2009 17:30:08 +0000 Subject: [Beowulf] HPC as service In-Reply-To: <20090813170307.GM25322@leitl.org> References: <20090813170307.GM25322@leitl.org> Message-ID: <352377585-1250184566-cardhu_decombobulator_blackberry.rim.net-1068022804-@bxe1087.bisx.prod.on.blackberry> Wow. Well so much for my dream of completing my doctorate, buying an abandoned factory, building a nice cluster and offerring this exact service :(. Nice job penguin. Ellis -----Original Message----- From: Eugen Leitl Date: Thu, 13 Aug 2009 19:03:07 To: Subject: [Beowulf] HPC as service (cloud, huh) http://www.penguincomputing.com/POD/HPC_as_a_service HPC as a Service? - A new offering from Penguin Computing Penguin Computing is proud to provide a new HPC as a Service offering for its high performance computing customers. For more details, see Penguin on Demand. Penguin POD Datasheet What is High Performance Computing as a Service? HPC as a Service is a computing model where users have on-demand access to dynamically scalable, high-performance clusters optimized for parallel computing including the expertise needed to set up, optimize and run their applications over the Internet. The traditional barriers associated with high-performance computing such as the initial capital outlay, time to procure the system, effort to optimize the software environment, engineering their system for peak demand and continuing operating costs have been removed. Instead, HPC as a Service users have a virtualized, scalable cluster available on demand that operates and has the same performance characteristics as a physical HPC cluster located in their data room. HPC as a Service inside the Cloud There are different definitions of cloud computing, but at the core ?Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.?1 HPC as a Service extends this model by making concentrated, non-virtualized high-performance computing resources available in the cloud. HPC As A Service HPC as a Service provides users with a number of key benefits: 1. HPC resources scale with demand and are available with no capital outlay ? only the resources used are actually paid for. 2. Experts in high-performance computing help setup and optimize the software environment and can help trouble-shoot issues that might occur. 3. Faster time-to-results especially for computational requirements that greatly exceed the existing computing capacity. 4. Accounts are provided on an individual user basis, and users are billed for the time they use POD. 5. Computing costs are reduced, particularly where the workflow has spikes in demand. 6. Access from anywhere in the worlds with high-speed data transfer in and out. 7. Exchanging hot-swappable 2TB disk drives overnight through Federal Express provides a secure and convenient way to transfer 25GB+ data sets. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From amjad11 at gmail.com Thu Aug 13 12:15:04 2009 From: amjad11 at gmail.com (amjad ali) Date: Fri, 14 Aug 2009 00:15:04 +0500 Subject: [Beowulf] parallelization problem Message-ID: <428810f20908131215l6f9d6364w54657ea03c01501d@mail.gmail.com> Hi, all, I am parallelizing a CFD 2D code in FORTRAN+OPENMPI. Suppose that the grid (all triangles) is partitioned among 8 processes using METIS. Each process has different number of neighboring processes. Suppose each process has n elements/faces whose data it needs to sends to corresponding neighboring processes, and it has m number of elements/faces on which it needs to get data from corresponding neighboring processes. Values of n and m are different for each process. Another aim is to hide the communication behind computation. For this I do the following for each process: DO j = 1 to n CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, ireq(j), ierr) ENDDO DO k = 1 to m CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, ierr) ENDDO This solves my problem. But it gives memory leakage; RAM gets filled after few thousands of iteration. What is the solution/remedy? How should I tackle this? In another CFD code I removed this problem of memory-filling by following (in that code n=m) : DO j = 1 to n CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, ireq(j), ierr) ENDDO CALL MPI_WAITALL(n,ireq,status,ierr) DO k = 1 to n CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, ierr) ENDDO But this is not working in current code; and the previous code was not giving correct results with large number of processes. Please suggest solution. THANKS A LOT FOR YOUR KIND ATTENTION. With best regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.janssens at opencfd.co.uk Thu Aug 13 13:45:01 2009 From: m.janssens at opencfd.co.uk (Mattijs Janssens) Date: Thu, 13 Aug 2009 21:45:01 +0100 Subject: [Beowulf] parallelization problem In-Reply-To: <428810f20908131215l6f9d6364w54657ea03c01501d@mail.gmail.com> References: <428810f20908131215l6f9d6364w54657ea03c01501d@mail.gmail.com> Message-ID: <200908132145.01517.m.janssens@opencfd.co.uk> Do a non-blocking receive as well so MPI_IRECV instead of MPI_RECV and make sure you have an MPI_WAITALL for all the requests (both from sending and from receiving). Kind regards, Mattijs On Thursday 13 August 2009 20:15:04 amjad ali wrote: > Hi, all, > > > > I am parallelizing a CFD 2D code in FORTRAN+OPENMPI. Suppose that the grid > (all triangles) is partitioned among 8 processes using METIS. Each process > has different number of neighboring processes. Suppose each process has n > elements/faces whose data it needs to sends to corresponding neighboring > processes, and it has m number of elements/faces on which it needs to get > data from corresponding neighboring processes. Values of n and m are > different for each process. Another aim is to hide the communication behind > computation. For this I do the following for each process: > > > > DO j = 1 to n > > CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, > ireq(j), ierr) > > ENDDO > > > > DO k = 1 to m > > CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, > ierr) > > ENDDO > > > > > > This solves my problem. But it gives memory leakage; RAM gets filled after > few thousands of iteration. What is the solution/remedy? How should I > tackle this? > > > > In another CFD code I removed this problem of memory-filling by following > (in that code n=m) : > > > > DO j = 1 to n > > CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, > ireq(j), ierr) > > ENDDO > > > > CALL MPI_WAITALL(n,ireq,status,ierr) > > > > DO k = 1 to n > > CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, > ierr) > > ENDDO > > > > But this is not working in current code; and the previous code was not > giving correct results with large number of processes. > > > > Please suggest solution. > > > > THANKS A LOT FOR YOUR KIND ATTENTION. > > > > With best regards, > > Amjad Ali. -- Mattijs Janssens OpenCFD Ltd. 9 Albert Road, Caversham, Reading RG4 7AN. Tel: +44 (0)118 9471030 Email: M.Janssens at OpenCFD.co.uk URL: http://www.OpenCFD.co.uk From christian at myri.com Wed Aug 12 08:36:35 2009 From: christian at myri.com (Christian Bell) Date: Wed, 12 Aug 2009 11:36:35 -0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A82DC41.2060805@cse.ucdavis.edu> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> Message-ID: On Aug 12, 2009, at 11:14 AM, Bill Broadley wrote: > * For the compilers that tend to be better at stream (open64 and > pathscale), > you lose the performance if you just replace double a[],b[],c[] with > double *a,*b,*c. Patch[1] available. I don't have a work around for > this, suggestions welcome. Is it really necessary for dynamic arrays > to be substantially slower than static? Yes -- when pointers, the compiler assumes (by default) that the pointers can alias each other, which can prevent aggressive optimizations that are otherwise possible with arrays. C99 has introduced the 'restrict' keyword to allow programmers to assert that pointers of the same type cannot alias each other. However, restrict is just a hint and some compilers may or may not take advantage of it. You can also consult your compiler's documentation to see if there are other compiler-specific hints (asserting no loop-carried dependencies, loop fusion/fission). I remember stacking half a dozen pragmas over a 3-line loop on a Cray C compiler years ago to ensure that accesses where suitably optimized (or in this case, vectorized). . . christian From tom.elken at qlogic.com Thu Aug 13 16:37:07 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Thu, 13 Aug 2009 16:37:07 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> Message-ID: <35AAF1E4A771E142979F27B51793A48885F52B6CC9@AVEXMB1.qlogic.org> > On Behalf Of Christian Bell > On Aug 12, 2009, at 11:14 AM, Bill Broadley wrote: > > > Is it really necessary for dynamic arrays > > to be substantially slower than static? > > Yes -- when pointers, the compiler assumes (by default) that the > pointers can alias each other, which can prevent aggressive > optimizations that are otherwise possible with arrays. ... > I remember stacking half a dozen pragmas over a > 3-line loop on a Cray C compiler years ago to ensure that accesses > where suitably optimized (or in this case, vectorized). To add some details to what Christian says, the HPC Challenge version of STREAM uses dynamic arrays and is hard to optimize. I don't know what's best with current compiler versions, but you could try some of these that were used in past HPCC submissions with your program, Bill: PathScale 2.2.1 on Opteron: Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 -OPT:alias=restrict:align_unsafe=on -CG:movnti=1 Intel C/C++ Compiler 10.1 on Harpertown CPUs: Base OPT flags: -O2 -xT -ansi-alias -ip -i-static Intel recently used Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: -O2 -xSSE4.2 -ansi-alias -ip and got good STREAM results in their HPCC submission on their ENdeavor cluster. -Tom > > > . . christian > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From richard.walsh at comcast.net Thu Aug 13 16:56:33 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Thu, 13 Aug 2009 23:56:33 +0000 (UTC) Subject: [Beowulf] Wake on LAN supported on both built-in interfaces ... ?? Message-ID: <1236706861.11245011250207793299.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> All, I have a head node that am trying to get WOL set up on. It is a SuperMicro motherboard (X8DTi-F) with two built in interfaces (eth0, eth1). I am told by SuperMicro support that both interfaces support WOL fully, but when I probe them with ethtool only eth0 indicates that it supports WOL with: ... Supports Wake-on: umbg Wake-on: g ... $ethtool eth1 ... yields: ... Supports Wake-on: d Wake-on: d ... Attempting: ethtool -s eth1 wol g fails indicating (as expected) indicating: "Cannot set new wake-on-lan settings: Operation not supported not setting wol" The same command on eth0 works fine. I have set up eth0 to be my internal (private interface) and eth1 to be the internet facing interface. I would like avoid reworking everything ... ;-) ... Any thoughts? Go arounds? Is SuperMicro telling me the truth about both interfaces providing full support. Perhaps I am missing some configuration option in the BIOS. SuperMicro says I am not, but ... Thanks, rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: From bill at cse.ucdavis.edu Thu Aug 13 17:09:24 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Thu, 13 Aug 2009 17:09:24 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <35AAF1E4A771E142979F27B51793A48885F52B6CC9@AVEXMB1.qlogic.org> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> <35AAF1E4A771E142979F27B51793A48885F52B6CC9@AVEXMB1.qlogic.org> Message-ID: <4A84AB34.2050209@cse.ucdavis.edu> Tom Elken wrote: > To add some details to what Christian says, the HPC Challenge version of > STREAM uses dynamic arrays and is hard to optimize. I don't know what's > best with current compiler versions, but you could try some of these that > were used in past HPCC submissions with your program, Bill: Thanks for the heads up, I've checked the specbench.org compiler options for hints on where to start with optimization flags, but I didn't know about the dynamic stream. Is the HPC challenge code open source? > PathScale 2.2.1 on Opteron: > Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 > STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 -OPT:alias=restrict:align_unsafe=on -CG:movnti=1 Alas my pathscale license expired and I believe with sci-cortex's death (RIP) I can't renew it. I tried open64-4.2.2 with those flags and on a nehalem single socket: $ opencc -O4 -fopenmp stream.c -o stream-open64 -static $ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static $ ./stream-open64 Total memory required = 457.8 MB. Function Rate (MB/s) Avg time Min time Max time Copy: 22061.4958 0.0145 0.0145 0.0146 Scale: 22228.4705 0.0144 0.0144 0.0145 Add: 20659.2638 0.0233 0.0232 0.0233 Triad: 20511.0888 0.0235 0.0234 0.0235 Dynamic: $ ./stream-open64-malloc Function Rate (MB/s) Avg time Min time Max time Copy: 14436.5155 0.0222 0.0222 0.0222 Scale: 14667.4821 0.0218 0.0218 0.0219 Add: 15739.7070 0.0305 0.0305 0.0305 Triad: 15770.7775 0.0305 0.0304 0.0305 > Intel C/C++ Compiler 10.1 on Harpertown CPUs: > Base OPT flags: -O2 -xT -ansi-alias -ip -i-static > Intel recently used > Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: > -O2 -xSSE4.2 -ansi-alias -ip > and got good STREAM results in their HPCC submission on their ENdeavor cluster. $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o stream-icc-malloc $ ./stream-icc | grep ":" STREAM version $Revision: 5.9 $ Copy: 14767.0512 0.0022 0.0022 0.0022 Scale: 14304.3513 0.0022 0.0022 0.0023 Add: 15503.3568 0.0031 0.0031 0.0031 Triad: 15613.9749 0.0031 0.0031 0.0031 $ ./stream-icc-malloc | grep ":" STREAM version $Revision: 5.9 $ Copy: 14604.7582 0.0022 0.0022 0.0022 Scale: 14480.2814 0.0022 0.0022 0.0022 Add: 15414.3321 0.0031 0.0031 0.0031 Triad: 15738.4765 0.0031 0.0030 0.0031 So ICC does manage zero penalty, alas no faster than open64 with the penalty. I'll attempt to track down the HPCC stream source code to see if their dynamic arrays are any friendlier than mine (I just use malloc). In any case many thanks for the pointer. Oh, my dynamic tweak: $ diff stream.c stream-malloc.c 43a44 > # include 97c98 < static double a[N+OFFSET], --- > /* static double a[N+OFFSET], 99c100,102 < c[N+OFFSET]; --- > c[N+OFFSET]; */ > > double *a, *b, *c; 134a138,142 > > a=(double *)malloc(sizeof(double)*(N+OFFSET)); > b=(double *)malloc(sizeof(double)*(N+OFFSET)); > c=(double *)malloc(sizeof(double)*(N+OFFSET)); > 283c291,293 < --- > free(a); > free(b); > free(c); From kus at free.net Fri Aug 14 08:08:32 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 14 Aug 2009 19:08:32 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A84AB34.2050209@cse.ucdavis.edu> Message-ID: In message from Bill Broadley (Thu, 13 Aug 2009 17:09:24 -0700): >Tom Elken wrote: >> To add some details to what Christian says, the HPC Challenge >>version of >> STREAM uses dynamic arrays and is hard to optimize. I don't know >>what's >> best with current compiler versions, but you could try some of these >>that >> were used in past HPCC submissions with your program, Bill: > >Thanks for the heads up, I've checked the specbench.org compiler >options for >hints on where to start with optimization flags, but I didn't know >about the >dynamic stream. > >Is the HPC challenge code open source? Yes, they are open. > >> PathScale 2.2.1 on Opteron: >> Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 >> STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 >>-OPT:alias=restrict:align_unsafe=on -CG:movnti=1 > >Alas my pathscale license expired and I believe with sci-cortex's >death (RIP) >I can't renew it. Now I understand that I was sage :-) (we purchased perpetual acafemic license). ВТW, do somebody know about Pathscale compilers future (if it will be) ? Mikhail > >I tried open64-4.2.2 with those flags and on a nehalem single socket: > >$ opencc -O4 -fopenmp stream.c -o stream-open64 -static >$ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static > >$ ./stream-open64 >Total memory required = 457.8 MB. >Function Rate (MB/s) Avg time Min time Max time >Copy: 22061.4958 0.0145 0.0145 0.0146 >Scale: 22228.4705 0.0144 0.0144 0.0145 >Add: 20659.2638 0.0233 0.0232 0.0233 >Triad: 20511.0888 0.0235 0.0234 0.0235 > >Dynamic: >$ ./stream-open64-malloc > >Function Rate (MB/s) Avg time Min time Max time >Copy: 14436.5155 0.0222 0.0222 0.0222 >Scale: 14667.4821 0.0218 0.0218 0.0219 >Add: 15739.7070 0.0305 0.0305 0.0305 >Triad: 15770.7775 0.0305 0.0304 0.0305 > >> Intel C/C++ Compiler 10.1 on Harpertown CPUs: >> Base OPT flags: -O2 -xT -ansi-alias -ip -i-static >> Intel recently used >> Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: >> -O2 -xSSE4.2 -ansi-alias -ip >> and got good STREAM results in their HPCC submission on their >>ENdeavor cluster. > >$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc >$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o >stream-icc-malloc > >$ ./stream-icc | grep ":" >STREAM version $Revision: 5.9 $ >Copy: 14767.0512 0.0022 0.0022 0.0022 >Scale: 14304.3513 0.0022 0.0022 0.0023 >Add: 15503.3568 0.0031 0.0031 0.0031 >Triad: 15613.9749 0.0031 0.0031 0.0031 >$ ./stream-icc-malloc | grep ":" >STREAM version $Revision: 5.9 $ >Copy: 14604.7582 0.0022 0.0022 0.0022 >Scale: 14480.2814 0.0022 0.0022 0.0022 >Add: 15414.3321 0.0031 0.0031 0.0031 >Triad: 15738.4765 0.0031 0.0030 0.0031 > >So ICC does manage zero penalty, alas no faster than open64 with the >penalty. > >I'll attempt to track down the HPCC stream source code to see if >their dynamic >arrays are any friendlier than mine (I just use malloc). > >In any case many thanks for the pointer. > >Oh, my dynamic tweak: >$ diff stream.c stream-malloc.c >43a44 >> # include >97c98 >< static double a[N+OFFSET], >--- >> /* static double a[N+OFFSET], >99c100,102 >< c[N+OFFSET]; >--- >> c[N+OFFSET]; */ >> >> double *a, *b, *c; >134a138,142 >> >> a=(double *)malloc(sizeof(double)*(N+OFFSET)); >> b=(double *)malloc(sizeof(double)*(N+OFFSET)); >> c=(double *)malloc(sizeof(double)*(N+OFFSET)); >> >283c291,293 >< >--- >> free(a); >> free(b); >> free(c); > > > > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From kus at free.net Fri Aug 14 11:47:01 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 14 Aug 2009 22:47:01 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A84AB34.2050209@cse.ucdavis.edu> Message-ID: In message from Bill Broadley (Thu, 13 Aug 2009 17:09:24 -0700): Do I unerstand correctly that this results are for 4 cores& 4 openmp threads ? And what is DDR3 RAM: DDR3/1066 ? Mikhail >I tried open64-4.2.2 with those flags and on a nehalem single socket: >$ opencc -O4 -fopenmp stream.c -o stream-open64 -static >$ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static >$ ./stream-open64 >Total memory required = 457.8 MB. >Function Rate (MB/s) Avg time Min time Max time >Copy: 22061.4958 0.0145 0.0145 0.0146 >Scale: 22228.4705 0.0144 0.0144 0.0145 >Add: 20659.2638 0.0233 0.0232 0.0233 >Triad: 20511.0888 0.0235 0.0234 0.0235 >Dynamic: >$ ./stream-open64-malloc > >Function Rate (MB/s) Avg time Min time Max time >Copy: 14436.5155 0.0222 0.0222 0.0222 >Scale: 14667.4821 0.0218 0.0218 0.0219 >Add: 15739.7070 0.0305 0.0305 0.0305 >Triad: 15770.7775 0.0305 0.0304 0.0305 > >> Intel C/C++ Compiler 10.1 on Harpertown CPUs: >> Base OPT flags: -O2 -xT -ansi-alias -ip -i-static >> Intel recently used >> Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: >> -O2 -xSSE4.2 -ansi-alias -ip >> and got good STREAM results in their HPCC submission on their >>ENdeavor cluster. > >$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc >$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o >stream-icc-malloc > >$ ./stream-icc | grep ":" >STREAM version $Revision: 5.9 $ >Copy: 14767.0512 0.0022 0.0022 0.0022 >Scale: 14304.3513 0.0022 0.0022 0.0023 >Add: 15503.3568 0.0031 0.0031 0.0031 >Triad: 15613.9749 0.0031 0.0031 0.0031 >$ ./stream-icc-malloc | grep ":" >STREAM version $Revision: 5.9 $ >Copy: 14604.7582 0.0022 0.0022 0.0022 >Scale: 14480.2814 0.0022 0.0022 0.0022 >Add: 15414.3321 0.0031 0.0031 0.0031 >Triad: 15738.4765 0.0031 0.0030 0.0031 > >So ICC does manage zero penalty, alas no faster than open64 with the >penalty. > >I'll attempt to track down the HPCC stream source code to see if >their dynamic >arrays are any friendlier than mine (I just use malloc). > >In any case many thanks for the pointer. > >Oh, my dynamic tweak: >$ diff stream.c stream-malloc.c >43a44 >> # include >97c98 >< static double a[N+OFFSET], >--- >> /* static double a[N+OFFSET], >99c100,102 >< c[N+OFFSET]; >--- >> c[N+OFFSET]; */ >> >> double *a, *b, *c; >134a138,142 >> >> a=(double *)malloc(sizeof(double)*(N+OFFSET)); >> b=(double *)malloc(sizeof(double)*(N+OFFSET)); >> c=(double *)malloc(sizeof(double)*(N+OFFSET)); >> >283c291,293 >< >--- >> free(a); >> free(b); >> free(c); > > > > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From bill at cse.ucdavis.edu Fri Aug 14 12:47:31 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Fri, 14 Aug 2009 12:47:31 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: <4A85BF53.3010106@cse.ucdavis.edu> Mikhail Kuzminsky wrote: > In message from Bill Broadley (Thu, 13 Aug 2009 > 17:09:24 -0700): > > Do I unerstand correctly that this results are for 4 cores& 4 openmp > threads ? And what is DDR3 RAM: DDR3/1066 ? 4 cores and 8 openmp threads. 4 threads is slightly faster: Function Rate (MB/s) Avg time Min time Max time Copy: 23670.3046 0.0135 0.0135 0.0136 Scale: 23304.9257 0.0138 0.0137 0.0139 Add: 21951.8053 0.0219 0.0219 0.0219 Triad: 21538.2451 0.0223 0.0223 0.0224 I put DDR3-1333 in the machine, but the bios seems to want to run them at 1066, I'm not sure exactly what speed they are running at. From mathog at caltech.edu Fri Aug 14 13:14:18 2009 From: mathog at caltech.edu (David Mathog) Date: Fri, 14 Aug 2009 13:14:18 -0700 Subject: [Beowulf] Wake on LAN supported on both built-in interfaces ... ?? Message-ID: richard.walsh at comcast.net wrote: > I have a head node that am trying to get WOL set up on. > > It is a SuperMicro motherboard (X8DTi-F) with two built > in interfaces (eth0, eth1). I am told by SuperMicro support > that both interfaces support WOL fully, but when I probe them > with ethtool only eth0 indicates that it supports WOL with: > That board has "Intel? 82576 Dual-Port Gigabit Ethernet" and Intel provides some information on that here: http://edc.intel.com/Link.aspx?id=2372 where it says: Wake-on-LAN support Packet recognition and wake-up for LAN on motherboard applications without software configuration and nothing more. That is ambiguous, it requires that at least one interface support WOL, but it does not say explicitly that both do. Most likely the hardware does support on both ports but the driver is confused somehow by the dual chip. Try contacting the author of the linux driver and/or Intel directly. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From tom.elken at qlogic.com Fri Aug 14 13:57:53 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Fri, 14 Aug 2009 13:57:53 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A85BF53.3010106@cse.ucdavis.edu> References: <4A85BF53.3010106@cse.ucdavis.edu> Message-ID: <35AAF1E4A771E142979F27B51793A48885F52B6D52@AVEXMB1.qlogic.org> > On Behalf Of Bill Broadley > I put DDR3-1333 in the machine, but the bios seems to want to run them > at > 1066, How many dimms per memory channel do you have? My understanding (which may be a few months old) is that if you have more than one dimm per memory channel, DDR3-1333 dimms will run at 1066 speed; i.e. on your 1-CPU system, if you have 6 dimms, you have 2 per memory channel. > I'm not sure exactly what speed they are running at. Your results look excellent, so I wouldn't be surprised if they are running at 1333. -Tom > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From kus at free.net Fri Aug 14 16:10:42 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Sat, 15 Aug 2009 03:10:42 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <35AAF1E4A771E142979F27B51793A48885F52B6D52@AVEXMB1.qlogic.org> Message-ID: In message from Tom Elken (Fri, 14 Aug 2009 13:57:53 -0700): >> On Behalf Of Bill Broadley > >> I put DDR3-1333 in the machine, but the bios seems to want to run >>them >> at >> 1066, > >How many dimms per memory channel do you have? > >My understanding (which may be a few months old) is that if you have >more than one dimm per memory channel, DDR3-1333 dimms will run at >1066 speed; >i.e. on your 1-CPU system, if you have 6 dimms, you have 2 per memory >channel. > >> I'm not sure exactly what speed they are running at. > >Your results look excellent, so I wouldn't be surprised if they are >running at 1333. I have 12-18 GB/s on 4 threads of stream/ifort w/DDR3-1066 on dual E5520 server. But it works under "numa-bad" kernel w/o control of numa-efficient allocation. Mikhail > >-Tom > >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >> Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From kus at free.net Fri Aug 14 16:24:25 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Sat, 15 Aug 2009 03:24:25 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A85EF91.2050705@cse.ucdavis.edu> Message-ID: In message from Bill Broadley (Fri, 14 Aug 2009 16:13:21 -0700): >Mikhail Kuzminsky wrote: >>> Your results look excellent, so I wouldn't be surprised if they are >>> running at 1333. >> >> I have 12-18 GB/s on 4 threads of stream/ifort w/DDR3-1066 on dual >>E5520 >> server. But it works under "numa-bad" kernel w/o control of >> numa-efficient allocation. > >Sounds pretty bad. > >Why 4 threads? You need 8 cores to keep all 6 memory busses busy. For comparison w/your tests: you have only 4 cores. On 8 threads I have 20-26 GB/s. > >Which compiler? ifort pointed above means intel fortran 11.0.38. Mikhail > open64 does substantially better than gcc. > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From amjad11 at gmail.com Fri Aug 14 16:44:44 2009 From: amjad11 at gmail.com (amjad ali) Date: Sat, 15 Aug 2009 04:44:44 +0500 Subject: [Beowulf] METIS Partitioning within program Message-ID: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> Hi all, For my parallel code to run, I first make grid partitioning on command line then for running the parallel code I give hard-code the path of METIS-partition files. It is very cumbersome if I need to run code with different grids and for different -np value. Please tell me how to call METIS partitioning routine from within the program run so that whatever -np value would be we are at ease. THANKS A LOT FOR YOUR ATTENTION. Regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.walsh at comcast.net Sat Aug 15 05:57:51 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Sat, 15 Aug 2009 12:57:51 +0000 (UTC) Subject: [Beowulf] METIS Partitioning within program In-Reply-To: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> Message-ID: <1942209729.9031250341071497.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Amjad, Have you thought of using the system call: "system(const char *string);" Type "man system" for a description. You can pass any string to the shell to be run with this call. For instance: system("date > date.out"); would instruct the shell to place the current date and time in the file date.out. If the command you wish to run changes cyclically y ou would have to manage the changes from inside the program. I am assuming a C program here. Regards, rbw ----- Original Message ----- From: "amjad ali" To: "Beowulf Mailing List" Sent: Friday, August 14, 2009 6:44:44 PM GMT -06:00 US/Canada Central Subject: [Beowulf] METIS Partitioning within program Hi all, For my parallel code to run, I first make grid partitioning on command line then for running the parallel code I give hard-code the path of METIS-partition files. It is very cumbersome if I need to run code with different grids and for different -np value. Please tell me how to call METIS partitioning routine from within the program run so that whatever -np value would be we are at ease. THANKS A LOT FOR YOUR ATTENTION. Regards, Amjad Ali. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.janssens at opencfd.co.uk Mon Aug 17 03:28:13 2009 From: m.janssens at opencfd.co.uk (Mattijs Janssens) Date: Mon, 17 Aug 2009 11:28:13 +0100 Subject: [Beowulf] METIS Partitioning within program In-Reply-To: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> References: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> Message-ID: <200908171128.13601.m.janssens@opencfd.co.uk> > For my parallel code to run, I first make grid partitioning on command line > then for running the parallel code I give hard-code the path of > METIS-partition files. It is very cumbersome if I need to run code with > different grids and for different -np value. Please tell me how to call > METIS partitioning routine from within the program run so that whatever -np > value would be we are at ease. You could call the Metis routines directly from your code. See http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/manual.pdf Regards, Mattijs From hahn at mcmaster.ca Mon Aug 17 09:05:59 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 17 Aug 2009 12:05:59 -0400 (EDT) Subject: [Beowulf] METIS Partitioning within program In-Reply-To: <200908171128.13601.m.janssens@opencfd.co.uk> References: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> <200908171128.13601.m.janssens@opencfd.co.uk> Message-ID: >> For my parallel code to run, I first make grid partitioning on command line >> then for running the parallel code I give hard-code the path of >> METIS-partition files. It is very cumbersome if I need to run code with >> different grids and for different -np value. Please tell me how to call this sort of thing is not uncommon - I normally recommend that the user submit a script which does the setup step, then mpirun's the parallel code. this wastes some cycles (the setup is normally serial, so wastes n-1 cpus for that duration.) alternatively, submitting a serial setup job followed by a (dependent) parallel job makes sense but incurs more queue time. From mmuratet at hudsonalpha.org Fri Aug 14 15:22:00 2009 From: mmuratet at hudsonalpha.org (Michael Muratet) Date: Fri, 14 Aug 2009 17:22:00 -0500 Subject: [Beowulf] newbie beorun question Message-ID: Greetings I thought I understood the operation of beorun --no-local but now I'm not so sure. I started a task with beorun --no-local command and it started on node 0 which is reasonable because the cluster was empty. I started a second task the same way and the second appears to have started on node 0 as well, leaving all the other nodes empty. How can this happen? I've re-read the man page a couple of times now and realized I don't see where it has a guarantee of load balancing. I don't see anything in the beomap man page, either, that balances loads (except manually). We are working towards submitting everything through torque, but in the meantime, is there any load balancing available via beorun? Thanks Mike From tru at pasteur.fr Sat Aug 15 07:28:23 2009 From: tru at pasteur.fr (Tru Huynh) Date: Sat, 15 Aug 2009 16:28:23 +0200 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> <4A8303A7.50801@noaa.gov> Message-ID: <20090815142823.GR18295@sillage.bis.pasteur.fr> On Wed, Aug 12, 2009 at 01:16:02PM -0500, Rahul Nabar wrote: > We didn't. We used the latest from the CentOS website. Then something is not configured properly, the latest CentOS-5 kernel is kernel-2.6.18-128.4.1.el5.x86_64. Linux darwin.localdomain 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux 2.6.18-128.el5 is the GA released version for 5.3, not the current version see http://mirror.centos.org/centos/5/os/x86_64/CentOS http://mirror.centos.org/centos/5/updates/x86_64/RPMS/ Tru -- Dr Tru Huynh | http://www.pasteur.fr/recherche/unites/Binfs/ mailto:tru at pasteur.fr | tel/fax +33 1 45 68 87 37/19 Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France From peterffaber at web.de Sat Aug 15 08:31:17 2009 From: peterffaber at web.de (Peter Faber) Date: Sat, 15 Aug 2009 17:31:17 +0200 Subject: [Beowulf] parallelization problem Message-ID: <4A86D4C5.7060800@web.de> amjad ali wrote: > I am parallelizing a CFD 2D code in FORTRAN+OPENMPI. Suppose that the grid > (all triangles) is partitioned among 8 processes using METIS. Each process > has different number of neighboring processes. Suppose each process has n > elements/faces whose data it needs to sends to corresponding neighboring > processes, and it has m number of elements/faces on which it needs to get > data from corresponding neighboring processes. Values of n and m are > different for each process. Another aim is to hide the communication behind > computation. For this I do the following for each process: > DO j = 1 to n > > CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, ireq(j), > ierr) > > ENDDO > > DO k = 1 to m > > CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, > ierr) > > ENDDO You may want to place the MPI_WAIT somewhere below the MPI_RECV in order to ensure that all receives can be executed and thus all sends be completed. If your program does not work with an MPI_WAIT in place, there may be something wrong with your values for n, m, dest(j) and/or source(k), which may also explain the memory "leak". Perhaps you can check these values with a smaller number of processes? Just my 2 cents... PFF From codestr0m at osunix.org Sat Aug 15 19:45:54 2009 From: codestr0m at osunix.org (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Sat, 15 Aug 2009 19:45:54 -0700 Subject: [Beowulf] PathScale (RIP) WAS: bizarre scaling behavior on a Nehalem In-Reply-To: <4A81A518.2030805@cse.ucdavis.edu> References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: <4A8772E2.70704@netsyncro.com> Bill Broadley wrote: > ... > Were the binaries compiled specifically to target both architectures? As a > first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's > compiler for intel. But portland group does a good job at both in most cases. > I'm just now catching up on my email and really cringed when I read this... Keep an eye out for PathScale related news in the near future. If you were an old PathScale customer please feel free to contact me offline. If by phone after Wednesday of next week is best. preferred email : codestr0m at osunix.org direct : +1 415-269-8386 ./Christopher From codestr0m at osunix.org Sat Aug 15 20:13:42 2009 From: codestr0m at osunix.org (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Sat, 15 Aug 2009 20:13:42 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <35AAF1E4A771E142979F27B51793A48885F52B6CC9@AVEXMB1.qlogic.org> References: