From brockp at umich.edu Mon Aug 3 05:56:04 2009 From: brockp at umich.edu (Brock Palen) Date: Mon, 3 Aug 2009 08:56:04 -0400 Subject: [Beowulf] Lustre Featured on Podcast Message-ID: Thanks to Andreas for taking an hour out to talk with Jeff Squyres and myself (Brock Palen) about the Lustre cluster filesystem on our podcast www.rce-cast.com, You can find the whole show at: http://www.rce-cast.com/index.php/Podcast/rce-14-lustre-cluster-filesystem.html Thanks again! If any of you have requests of topics you would like to hear please let me know! Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 From niftyompi at niftyegg.com Mon Aug 3 22:29:50 2009 From: niftyompi at niftyegg.com (NiftyOMPI Tom Mitchell) Date: Mon, 3 Aug 2009 22:29:50 -0700 Subject: [Beowulf] Fabric design consideration In-Reply-To: <3E9B990982B6404CAC116FE42F1FC97240F7BCE1@USFMB1.forest.usf.edu> References: <3E9B990982B6404CAC116FE42F1FC97240F7BCE1@USFMB1.forest.usf.edu> Message-ID: <88815dc10908032229n35dc509clba0b1a52ab6af8f1@mail.gmail.com> On Thu, Jul 30, 2009 at 8:18 AM, Smith, Brian wrote: > Hi, All, > > I've been re-evaluating our existing InfiniBand fabric design for our HPC systems since I've been tasked with determining how we will add more systems in the future as more and more researchers opt to add capacity to our central system. ?We've already gotten to the point where we've used up all available ports on the 144 port SilverStorm 9120 chassis that we have and we need to expand capacity. ?One option that we've been floating around -- that I'm not particularly fond of, btw -- is to purchase a second chassis and link them together over 24 ports, two per spline. ?While a good deal of our workload would be ok with 5:1 blocking and 6 hops (3 across each chassis), I've determined that, for the money, we're definitely not getting the best solution. > > The plan that I've put together involves using the SilverStorm as the core in a spine-leaf design. ?We'll go ahead and purchase a batch of 24 port QDR switches, two for each rack, to connect our 156 existing nodes (with up to 50 additional on the way). ?Each leaf will have 6 links back to the spine for 3:1 blocking and 5 hops (2 for the leafs, 3 for the spine). ?This will allow us to scale the fabric out to 432 total nodes before having to purchase another spine switch. ?At that point, half of the six uplinks will go to the first spine, half to the second. ?In theory, it looks like we can scale this design -- with future plans to migrate to a 288 port chassis -- to quite a large number of nodes. ?Also, just to address this up front, we have a very generic workload, with a mix of md, abinitio, cfd, fem, blast, rf, etc. > > If the good folks on this list would be kind enough to give me your input regarding these options or possibly propose a third (or forth) option, I'd very much appreciate it. > > Brian Smith I think the hop count is a smaller design issue than cable length for QDR. Cable length and the physical layout of hosts in the machine room may prove to be the critical issue in your design. Also since routing is static some seemingly obvious assumptions about routing, links, cross sectional bandwidth and blocking can be non-obvious. Also less obvious to a group like this is your storage, job mix and batch system. For example in a single rack with a pair of QDR 24 port switches. You might wish to have two or three links connecting those 24 port switches directly at QDR rates. Then the remaining three or four links would connect (DDR?) back to the 144 switch. If the batch system was 'rack aware' jobs that could run on a single rack would and jobs that had ranks scattered about would see a lightly loaded central switch. Adding QDR to the mix as you scale out to 400+ nodes using newer multi core processor nodes could be fun. When you knock on vendor doors ask about optical links... QDR optical links may let you reach beyond some classic fabrics layouts as your machine room and cpu core count grows. -- NiftyOMPI T o m M i t c h e l l From brockp at umich.edu Tue Aug 4 11:48:21 2009 From: brockp at umich.edu (Brock Palen) Date: Tue, 4 Aug 2009 14:48:21 -0400 Subject: [Beowulf] force factory rest of sfs7000 (topspin 120) Message-ID: <635DE2F6-3A2C-4A58-91F1-072288667650@umich.edu> We have a cisco sfs7000 (maybe still under support waiting on cisco) also known as a topspin 120, IB switch. We cannot login with the password we (thought) had it set to. I have looked online and find little tonight about forcing the switch back to factory defaults without a login. Serial console works fine, just can't login. We can screw in firmware a little by stopping boot, just don't know what to do from there. If anyone has directions how to force sfs7000 to factory defaults, or password recovery help would be great. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 From Greg at keller.net Wed Aug 5 08:41:23 2009 From: Greg at keller.net (Greg Keller) Date: Wed, 5 Aug 2009 10:41:23 -0500 Subject: [Beowulf] Re NiftyOMPI Tom Mitchell In-Reply-To: <200908041900.n74J08Em000968@bluewest.scyld.com> References: <200908041900.n74J08Em000968@bluewest.scyld.com> Message-ID: <843B6DC4-1123-4AE3-A52E-2D187C67C201@Keller.net> > Brian, A 3rd option: upgrade your Chassis to 288 ports. The beauty of SS/ Qlogic switches is they all use the same components. The Chassis/ Backplane are relatively dumb and cheap. You can re-use your spine switches and leaf switches. You don't even need to add the additional spine switches if 2:1 blocking is OK. Be very careful which ports you use to link the switches together if you do try and splice 2 chassis together. SMs can have trouble mapping many configrations, and you're probably best off dedicating line cards as "Uplink" or "Compute" (but don't mix/match) if I recall the layouts correctly. With these "multi-tiered" switches the SM sometimes can't figure out which way is up if you mix the ports apparently. A 4th Option: 36 Port QDR + DDR Also note that the QDR switches are based on 36 port chips and not a huge price jump (per port), so with a "Hybrid" cable for the uplinks, you may be able to purchase the newer technology and block the heck out of it. So adding 48 additional nodes could be as easy as: Disconnect 48 nodes for uplinks from the core switch Connect 4 x 36 port QDR with 12 uplinks to each Connect 48 old, and 48 new nodes to the 36 port QDR "edge" This leaves you with 96 nodes on each side of a 48 port Option 3 is the cleanest, and generically my favorite if you can get a chassis for a reasonable price. Cheers! Greg > Date: Mon, 3 Aug 2009 22:29:50 -0700 > From: NiftyOMPI Tom Mitchell > Subject: Re: [Beowulf] Fabric design consideration > To: "Smith, Brian" > Cc: "beowulf at beowulf.org" > Message-ID: > <88815dc10908032229n35dc509clba0b1a52ab6af8f1 at mail.gmail.com> > Content-Type: text/plain; charset=UTF-8 > > On Thu, Jul 30, 2009 at 8:18 AM, Smith, Brian > wrote: >> Hi, All, >> >> I've been re-evaluating our existing InfiniBand fabric design for >> our HPC systems since I've been tasked with determining how we will >> add more systems in the future as more and more researchers opt to >> add capacity to our central system. We've already gotten to the >> point where we've used up all available ports on the 144 port >> SilverStorm 9120 chassis that we have and we need to expand >> capacity. One option that we've been floating around -- that I'm >> not particularly fond of, btw -- is to purchase a second chassis >> and link them together over 24 ports, two per spline. While a good >> deal of our workload would be ok with 5:1 blocking and 6 hops (3 >> across each chassis), I've determined that, for the money, we're >> definitely not getting the best solution. >> >> The plan that I've put together involves using the SilverStorm as >> the core in a spine-leaf design. We'll go ahead and purchase a >> batch of 24 port QDR switches, two for each rack, to connect our >> 156 existing nodes (with up to 50 additional on the way). Each >> leaf will have 6 links back to the spine for 3:1 blocking and 5 >> hops (2 for the leafs, 3 for the spine). This will allow us to >> scale the fabric out to 432 total nodes before having to purchase >> another spine switch. At that point, half of the six uplinks will >> go to the first spine, half to the second. In theory, it looks >> like we can scale this design -- with future plans to migrate to a >> 288 port chassis -- to quite a large number of nodes. Also, just >> to address this up front, we have a very generic workload, with a >> mix of md, abinitio, cfd, fem, blast, rf, etc. >> >> If the good folks on this list would be kind enough to give me your >> input regarding these options or possibly propose a third (or >> forth) option, I'd very much appreciate it. >> >> Brian Smith > > I think the hop count is a smaller design issue than cable length for > QDR. Cable length and the > physical layout of hosts in the machine room may prove to be the > critical issue in > your design. Also since routing is static some seemingly obvious > assumptions about > routing, links, cross sectional bandwidth and blocking can be non- > obvious. > > Also less obvious to a group like this is your storage, job mix and > batch system. > > For example in a single rack with a pair of QDR 24 port switches. You > might wish > to have two or three links connecting those 24 port switches directly > at QDR rates. > Then the remaining three or four links would connect (DDR?) back to > the 144 switch. > If the batch system was 'rack aware' jobs that could run on a single > rack would and > jobs that had ranks scattered about would see a lightly loaded > central switch. > > Adding QDR to the mix as you scale out to 400+ nodes using newer multi > core processor > nodes could be fun. > > When you knock on vendor doors ask about optical links... QDR optical > links may let you reach > beyond some classic fabrics layouts as your machine room and cpu core > count grows. > > -- > NiftyOMPI > T o m M i t c h e l l > > > > ------------------------------ > > Message: 2 > Date: Tue, 4 Aug 2009 14:48:21 -0400 > From: Brock Palen > Subject: [Beowulf] force factory rest of sfs7000 (topspin 120) > To: Bewoulf > Message-ID: <635DE2F6-3A2C-4A58-91F1-072288667650 at umich.edu> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > We have a cisco sfs7000 (maybe still under support waiting on cisco) > also known as a topspin 120, IB switch. > > We cannot login with the password we (thought) had it set to. I have > looked online and find little tonight about forcing the switch back to > factory defaults without a login. > > Serial console works fine, just can't login. We can screw in firmware > a little by stopping boot, just don't know what to do from there. If > anyone has directions how to force sfs7000 to factory defaults, or > password recovery help would be great. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > > > ------------------------------ > > _______________________________________________ > Beowulf mailing list > Beowulf at beowulf.org > http://www.beowulf.org/mailman/listinfo/beowulf > > > End of Beowulf Digest, Vol 66, Issue 3 > ************************************** From deadline at eadline.org Wed Aug 5 11:36:53 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed, 5 Aug 2009 14:36:53 -0400 (EDT) Subject: [Beowulf] BProc In-Reply-To: References: Message-ID: <33345.192.168.1.213.1249497413.squirrel@mail.eadline.org> If you would like to read more from Don, take a look at newly posted interview: Don Becker On The State Of HPC http://www.linux-mag.com/cache/7449/1.html -- Doug From douglas.guptill at dal.ca Wed Aug 5 11:52:50 2009 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Wed, 5 Aug 2009 15:52:50 -0300 Subject: [Beowulf] BProc In-Reply-To: <33345.192.168.1.213.1249497413.squirrel@mail.eadline.org> References: <33345.192.168.1.213.1249497413.squirrel@mail.eadline.org> Message-ID: <20090805185250.GA12440@dome> On Wed, Aug 05, 2009 at 02:36:53PM -0400, Douglas Eadline wrote: > > If you would like to read more from Don, take a look > at newly posted interview: > > Don Becker On The State Of HPC > > http://www.linux-mag.com/cache/7449/1.html Love it. Thanks, Douglas. From kus at free.net Fri Aug 7 11:51:21 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 07 Aug 2009 22:51:21 +0400 Subject: [Beowulf] numactl & SuSE11.1 Message-ID: I've OpenSuSE 11.1 (kernel 2.6.22.5-31) installed on dual Nehalem (Xeon E5520) server. numactl -- show says libnuma: Warning : /sys not mounted or invalid. Assuming one node: No such file or directory /sys/devices/system/node contains 2 directories, but they are node0 and node2 (instead node1 which I expected). How is possible to correct this situation ? Mikhail Kuzminsky Computer Assistant to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From rpnabar at gmail.com Fri Aug 7 16:59:24 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 7 Aug 2009 18:59:24 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem Message-ID: Is it a bad mistake to configure a Nehalem (2 sockets quad core giving a total of 8 cores; E5520) with 16 GB RAM (4 DIMMs of 4GB each)? I know (I think) that the optimized memory for Nehalems is in banks of 6 due to the way the architecture is? I have often seen Nehalems coming with 24 GB memory as 6 DIMMs of 4 GB each. Our code requirements dictate 2 GB / core is enough. Should I be paying for the additional RAM to make it 24 GB? Also, are there any other tips for the Nehalems in general to coax out max performance? Maybe some compiler flags or BIOS settings etc? The only thing I did so far was to put the BIOS power setting into a "max performance" mode. In the past I've gotten about 5% additional performance by changing the power profile to "performance" using cpu-freq-set on my AMD Opteron Barcelonas. Any similar gotchas for the Nehalems and HPC? -- Rahul From gus at ldeo.columbia.edu Fri Aug 7 17:58:39 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 07 Aug 2009 20:58:39 -0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: Message-ID: <4A7CCDBF.7070904@ldeo.columbia.edu> Hi Rahul, list In case you haven't read it, this Nehalem memory guide from Dell has good information and the memory configuration details: http://www.delltechcenter.com/page/04-08-2009+-+Nehalem+and+Memory+Configurations A researcher here bought a Nehalem workstation (not a cluster) with 24GB RAM also. We followed the article recommendation, which was also what the vendor suggested. Maybe 24GB is more than needed, but presumably avoids the performance penalty that would hit a 16GB configuration. Since the computer will mostly run Matlab jobs, and Matlab has no bounds when it comes to memory, it may not have been a waste anyway. Some people are reporting good results when using the Nehalem hypethreading feature (activated on the BIOS). When the code permits, this virtually doubles the number of cores on Nehalems. That feature works very well on IBM PPC-6 processors (IBM calls it "simultaneous multi-threading" SMT, IIRR), and scales by a factor of >1.5, at least with the atmospheric model I tried. This may be a useful way to explore your 24GB, say, by running 12 processes on a 8-core node (50% oversubscribed), instead of the 8 processes that you run today on the Barcelonas. As for compiler flags, if you are using Intel these are probably good: -wS (which gives you SSE4, but check if there is something fancier now for Nehalem) -fast, although some of our codes had problems with the -ipo that is part of -fast, and I had to reduce it to -ip plus the other bits and pieces of -fast. I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Rahul Nabar wrote: > Is it a bad mistake to configure a Nehalem (2 sockets quad core giving > a total of 8 cores; E5520) with 16 GB RAM (4 DIMMs of 4GB each)? I > know (I think) that the optimized memory for Nehalems is in banks of 6 > due to the way the architecture is? I have often seen Nehalems coming > with 24 GB memory as 6 DIMMs of 4 GB each. > > Our code requirements dictate 2 GB / core is enough. Should I be > paying for the additional RAM to make it 24 GB? > > Also, are there any other tips for the Nehalems in general to coax out > max performance? Maybe some compiler flags or BIOS settings etc? The > only thing I did so far was to put the BIOS power setting into a "max > performance" mode. > > In the past I've gotten about 5% additional performance by changing > the power profile to "performance" using cpu-freq-set on my AMD > Opteron Barcelonas. Any similar gotchas for the Nehalems and HPC? > From davidramirezmolina at gmail.com Fri Aug 7 12:42:58 2009 From: davidramirezmolina at gmail.com (David Ramirez) Date: Fri, 7 Aug 2009 14:42:58 -0500 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? Message-ID: Due to space constraints I am considering implementing a 8-node (+ master) HPC cluster project using small form computers. Knowing that Shuttle is a reputable brand, with several years in the market, I wonder if any of you out there has already used them on clusters and how has been your experience (performance, reliability etc.) -- | David Ramirez Molina | davidramirezmolina at gmail.com | Houston, Texas - USA Ancora Imparo (A?n aprendo) - Michelangelo a los 80 a?os -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenyon1 at iit.edu Fri Aug 7 12:55:38 2009 From: chenyon1 at iit.edu (Yong Chen) Date: Fri, 07 Aug 2009 19:55:38 +0000 (GMT) Subject: [Beowulf] [hpc-announce] Call for Attendance: P2S2-2009 Workshop Message-ID: Dear Colleagues, The Second International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) will be held in Vienna, Austria, on Sept. 22nd, 2009 in conjunction with The 38th International Conference on Parallel Processing (ICPP-2009). The workshop program has been finalized and can be found here: http://www.mcs.anl.gov/events/workshops/p2s2/pro.html (listed below for your reference). We welcome you attend the P2S2-2009 workshop and look forward to seeing you in Vienna, Austria! =============================================================================== Session 1: Opening Time: 09:00 - 10:30, Location: Room F3 (89), Chair: Pavan Balaji, Argonne National Laboratory Opening Remarks (D. K. Panda, Pavan Balaji and Abhinav Vishnu) Invited Keynote by Dr. Pete Beckman, Argonne National Laboratory, "Challenges for System Software on Exascale Platforms" 10:30 - 11:00 Coffee Break Session 2: Software for Large-scale Systems Time: 11:00 - 12:30, Location: Room F3 (89), Chair: Tom Peterka, Argonne National Laboratory 1. "Characterizing the Performance of Big Memory on Blue Gene Linux" Kazutomo Yoshii, Kamil Iskra, P. Chris Broekema, Harish Naik and Pete Beckman 2. "Optimization of Preconditioned Parallel Iterative Solvers for Finite-Element Applications using Hybrid Parallel Programming Models on T2K Open Supercomputer (Todai Combined Cluster)" Kengo Nakajima 3. "Analyzing Checkpointing Trends for Applications on Peta-scale Systems" Harish Naik, Rinku Gupta and Pete Beckman 12:30 - 14:00 Lunch Session 3: Communication and I/O Time: 14:00 - 15:30, Location: Room F3 (89), Chair: Abhinav Vishnu, Pacific Northwest National Laboratory 1. "Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand" Tejus Gangadharappa, Matthew Koop and Dhabaleswar K Panda 2. "CkDirect: Unsynchronized One-Sided Communication in a Message-Driven Paradigm" Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele and Laxmikant Kale 3. "Exploiting Latent I/O Asynchrony in Petascale Science Applications" Patrick Widener, Matthew Wolf, Hasan Abbasi, Scott McManus, Mary Payne, Patrick Bridges and Karsten Schwan 4. "Gears4Net - An Asynchronous Programming Model" Martin Saternus, Torben Weis, Sebastian Holzapfel and Arno Wacker 15:30 - 16:00 Coffee Break Session 4: Software for Multicore Architectures Time: 16:00 - 17:30, Location: Room F3 (89), Chair: Ron Brightwell, Sandia National Laboratory 1. "Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms" Changjun Hu, Yali Liu and Jianjiang Li 2. "Open Source Software Support for the OpenMP Runtime API for Profiling" Oscar Hernandez, Van Bui, Richard Kufrin and Barbara Chapman 3. "Just-In-Time Renaming and Lazy Write-Back on the Cell/B.E." Pieter Bellens, Rosa Badia and Jesus Labarta From landman at scalableinformatics.com Sat Aug 8 09:51:58 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sat, 08 Aug 2009 12:51:58 -0400 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: References: Message-ID: <4A7DAD2E.4060004@scalableinformatics.com> David Ramirez wrote: > Due to space constraints I am considering implementing a 8-node (+ > master) HPC cluster project using small form computers. Knowing that > Shuttle is a reputable brand, with several years in the market, I wonder > if any of you out there has already used them on clusters and how has > been your experience (performance, reliability etc.) The down sides (from watching others do this) 1) no ECC ram. You will get bit-flips. ECC protects you (to a degree) against some bit-flippage. If you can get ECC memory (and turn on the ECC support in BIOS), by all means, do so. 2) power. One customer from a while ago did this, and found that the power supplies on the units were not able to supply a machine running the processor and memory (and disk/network etc) at nearly full load for many hours. You have to make sure your entire computing infrastructure (in the box) fits in *under* the power budget from the supply. This may be easier these days using "gamer" rigs which have power to handle GPU cards, but keep this in mind anyway. 3) networks. Sadly, the NICs on the hobby machines aren't usually up to the level of quality on the server systems. You might not get PXE capability (though these days, I haven't seen many boards without it). Just evaluate your options carefully with the specs in hand. You will have design tradeoffs due to the space constraint, just keep in mind your goals as you evaluate them. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From gerry.creager at tamu.edu Sat Aug 8 10:08:37 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sat, 08 Aug 2009 12:08:37 -0500 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: <4A7DAD2E.4060004@scalableinformatics.com> References: <4A7DAD2E.4060004@scalableinformatics.com> Message-ID: <4A7DB115.1020409@tamu.edu> Joe Landman wrote: > David Ramirez wrote: >> Due to space constraints I am considering implementing a 8-node (+ >> master) HPC cluster project using small form computers. Knowing that >> Shuttle is a reputable brand, with several years in the market, I >> wonder if any of you out there has already used them on clusters and >> how has been your experience (performance, reliability etc.) > > The down sides (from watching others do this) > > 1) no ECC ram. You will get bit-flips. ECC protects you (to a degree) > against some bit-flippage. If you can get ECC memory (and turn on the > ECC support in BIOS), by all means, do so. > > 2) power. One customer from a while ago did this, and found that the > power supplies on the units were not able to supply a machine running > the processor and memory (and disk/network etc) at nearly full load for > many hours. You have to make sure your entire computing infrastructure > (in the box) fits in *under* the power budget from the supply. This may > be easier these days using "gamer" rigs which have power to handle GPU > cards, but keep this in mind anyway. > > 3) networks. Sadly, the NICs on the hobby machines aren't usually up to > the level of quality on the server systems. You might not get PXE > capability (though these days, I haven't seen many boards without it). Adding to Joe's comments... and having tried this a couple of years ago, the NICs are not up to the drill. Plan to add Intel gigabit NICs. While they'll likely be TOE NICs, learn how to tune them and to stop TOE functionality on them. I've nothing kind to say about Broadcom NICs, and am even less kind in HPC/HTC environments with hobbyist chipset implementations. > Just evaluate your options carefully with the specs in hand. You will > have design tradeoffs due to the space constraint, just keep in mind > your goals as you evaluate them. I'd be trying to find ways to get 1u systems and, if 8 is the number, you'll find they don't take up much room. gerry -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From bill at cse.ucdavis.edu Sat Aug 8 12:53:01 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Sat, 08 Aug 2009 12:53:01 -0700 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: <4A7DB115.1020409@tamu.edu> References: <4A7DAD2E.4060004@scalableinformatics.com> <4A7DB115.1020409@tamu.edu> Message-ID: <4A7DD79D.4070804@cse.ucdavis.edu> Gerry Creager wrote: > I'd be trying to find ways to get 1u systems and, if 8 is the number, > you'll find they don't take up much room. Doubly so if you get one of the 2 nodes in 1U or 4 nodes in 2U. From hahn at mcmaster.ca Sat Aug 8 15:42:45 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 8 Aug 2009 18:42:45 -0400 (EDT) Subject: [Beowulf] numactl & SuSE11.1 In-Reply-To: References: Message-ID: > I've OpenSuSE 11.1 (kernel 2.6.22.5-31) installed on dual Nehalem (Xeon > E5520) server. > numactl -- show > says > libnuma: Warning : /sys not mounted or invalid. Assuming one node: No such > file or directory > > /sys/devices/system/node contains 2 directories, but they are node0 and node2 > (instead node1 which I expected). sounds like the kernel isn't grokking the cpu; given that 2.6.22.5 dates from 08/22/2007, that's not all that surprising... > How is possible to correct this situation ? I'm guessing a new kernel would do it - since all numactl's can grok amd's opterons, they ought to deal with intel's ;) From hahn at mcmaster.ca Sat Aug 8 15:47:47 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 8 Aug 2009 18:47:47 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: Message-ID: > Is it a bad mistake to configure a Nehalem (2 sockets quad core giving > a total of 8 cores; E5520) with 16 GB RAM (4 DIMMs of 4GB each)? I there's no ambiguity here: unpopulated channels decrease bandwidth and/or concurrency. (does anyone know whether nehalem can "ungang" memory channels like opteron can? it would be fascinating to see benchmarks showing a benefit to higher memory concurrency for a manycore workload...) > Our code requirements dictate 2 GB / core is enough. Should I be > paying for the additional RAM to make it 24 GB? ram is, historically and relatively, cheap. otoh, can your code get by with 1.5G/core? actually, I tend to see some association with smallish memory footprints (2G/core is definitely not large) with cache-friendliness. this would argue that the higher bandwidth may not make much difference to your code... regards, mark hahn. From rpnabar at gmail.com Sat Aug 8 16:50:40 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 8 Aug 2009 18:50:40 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <4A7CCDBF.7070904@ldeo.columbia.edu> References: <4A7CCDBF.7070904@ldeo.columbia.edu> Message-ID: On Fri, Aug 7, 2009 at 7:58 PM, Gus Correa wrote: > Some people are reporting good results when using the > Nehalem hypethreading feature (activated on the BIOS). > When the code permits, this virtually doubles the number > of cores on Nehalems. > That feature works very well on IBM PPC-6 processors > (IBM calls it "simultaneous multi-threading" SMT, IIRR), > and scales by a factor of >1.5, at least with the atmospheric > model I tried. Thanks for all the useful comments, Gus! Hyperthreading is confusing the hell out of me. I expected to see 8 cores in cat /proc/cpuinfo Now I see 16. (This means I must have left hyperthreading on I guess; I ought to go to the server room; reboot and check the BIOS) This is confusing my benchmarking too. Let's say I ran an MPI job with -np 4. If there was no other job on this machine would hyperthreading bring the other CPUs into play as well? The reason I ask is this: I have noticed that a single 4 core job is slower than two 4 core jobs run simultanously. This seems puzzling to me. -- Rahul From tegner at renget.se Sat Aug 8 22:42:28 2009 From: tegner at renget.se (Jon Tegner) Date: Sun, 09 Aug 2009 07:42:28 +0200 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: References: Message-ID: <4A7E61C4.1040409@renget.se> David Ramirez wrote: > Due to space constraints I am considering implementing a 8-node (+ > master) HPC cluster project using small form computers. Knowing that > Shuttle is a reputable brand, with several years in the market, I > wonder if any of you out there has already used them on clusters and > how has been your experience (performance, reliability etc.) Not exactly what you asked for, but slightly related anyway: I have been working on small AND silent lately. You can check some pictures at www.renget.se/bilder/clm1s.jpg www.renget.se/bilder/clm2s.jpg www.renget.se/bilder/clm3s.jpg www.renget.se/bilder/clm4s.jpg You can judge the size from the mainboards (24.5x24.5 cm) or by the standard 3.5 HD. There are no fans in this system (except for the power bricks), and it is reasonably small. Cooling is achieved by transferring heat from cpus to a cooling channel, and this heat is then removed by natural convection. The orientation of the boards also improves the cooling of the other components on the boards. The absence of fans (and small size) makes a system like this suitable for use in an office (1u system are generally very noisy). I'm working on an article for ClusterMonkey, and I'll fill in missing details on that forum. Regards, /jon From tjrc at sanger.ac.uk Sun Aug 9 00:17:43 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Sun, 9 Aug 2009 08:17:43 +0100 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: <4A7DB115.1020409@tamu.edu> References: <4A7DAD2E.4060004@scalableinformatics.com> <4A7DB115.1020409@tamu.edu> Message-ID: <73994BB4-4E0F-43C8-9B1E-8BB718FD6BCA@sanger.ac.uk> If space is a constraint, but up-front cost less so, you might want to consider a small blade chassis; something like an HP c-3000, which can take 8 blades. Especially if all you want is a GigE interconnect, which will fit in the same box. Potentially that will get you 64 cores in 6U, and essentialy no extra space required for infrastructure. Presumably other blade vendors do similar things. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From a.travis at abdn.ac.uk Sun Aug 9 04:30:11 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Sun, 09 Aug 2009 12:30:11 +0100 Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: <4A7DAD2E.4060004@scalableinformatics.com> References: <4A7DAD2E.4060004@scalableinformatics.com> Message-ID: <4A7EB343.3040403@abdn.ac.uk> Joe Landman wrote: > David Ramirez wrote: >> Due to space constraints I am considering implementing a 8-node (+ >> master) HPC cluster project using small form computers. Knowing that >> Shuttle is a reputable brand, with several years in the market, I wonder >> if any of you out there has already used them on clusters and how has >> been your experience (performance, reliability etc.) > > The down sides (from watching others do this) > > 1) no ECC ram. You will get bit-flips. ECC protects you (to a degree) > against some bit-flippage. If you can get ECC memory (and turn on the > ECC support in BIOS), by all means, do so. Hello, Joe and David. I agree about the ECC RAM, but I used to have six IWill dual Opteron 246 SFF computers in a Beowulf cluster and these do have ECC memory. > 2) power. One customer from a while ago did this, and found that the > power supplies on the units were not able to supply a machine running > the processor and memory (and disk/network etc) at nearly full load for > many hours. You have to make sure your entire computing infrastructure > (in the box) fits in *under* the power budget from the supply. This may > be easier these days using "gamer" rigs which have power to handle GPU > cards, but keep this in mind anyway. Absolutely, and that is why I said "I used to have" above! The Iwill's have custom PSU's that are badly overrun, and when they die it's very expensive to get them repaired. Eventually, they can't be repaired: I've had several returned now as "beyond economic repair", and I've decided to retire the IWill's. It's a pity, because the IWill's are nice machines. However, even with 55W Opteron 248HE's fitted the IWill Zmaxdp can't keep the CPU's cool under load unless they have their extremely noisy fans running at full speed. I've kept one Zmaxd2 for desktop use with dual Opteron 246HE's and it's fine, unless you make it work hard ;-) > 3) networks. Sadly, the NICs on the hobby machines aren't usually up to > the level of quality on the server systems. You might not get PXE > capability (though these days, I haven't seen many boards without it). Well, the IWill's are/were server-grade machines with GBit NIC's and they do PXE boot. > Just evaluate your options carefully with the specs in hand. You will > have design tradeoffs due to the space constraint, just keep in mind > your goals as you evaluate them. I really would avoid SFF systems as compute nodes: I've just used Tyan ATX FF S3970 motherboards in pedestal cases on industrial shelving and you bear in mind that standard Shuttle cases are only 50% the size of an ATX case. You can get four ATX cases in the space occupied by your eight Shuttle SFF computers... Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From Bogdan.Costescu at iwr.uni-heidelberg.de Sun Aug 9 06:50:59 2009 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Sun, 9 Aug 2009 15:50:59 +0200 (CEST) Subject: [Beowulf] Small form computers as cluster nodes - any comments about the Shuttle brand ? In-Reply-To: References: Message-ID: On Fri, 7 Aug 2009, David Ramirez wrote: > Due to space constraints I am considering implementing a 8-node (+ master) > HPC cluster project using small form computers. Knowing that Shuttle is a > reputable brand, with several years in the market, I wonder if any of you > out there has already used them on clusters and how has been your experience > (performance, reliability etc.) I've built a cluster of 80 nodes, which will turn 5 this month. Using Shuttle SB75G2, supports ECC, has a GigE on board (Broadcom) and the power supply is more than enough for the CPU (PIV Northwood 3.2GHz), one SATA HDD, a low power and performance graphics card (there's no on board graphics unfortunately) and an extra GigE card (Intel E1000). The decision for adding an extra NIC was not due to problems with the Broadcom chip, but simply to have dedicated networks; the Broadcom is able to do PXE just fine and this is the way these nodes have booted since setting them up. I was pleasantly surprised by the reliability of these computers. Given their tightness, they require attention and good skills when building them, f.e. using good quality thermal paste to avoid local thermal problems and routing cables to avoid transport thermal problems. About 70 of the 80 are still running well today, most of the failed ones stopped working correctly after the 3 years of warranty so I didn't make much effort to find out what is wrong - the main problem being instability under combined CPU and I/O load. Of course, when RAM and HDDs failed and were easy to recognize as causes, they were replaced as needed. As I wrote earlier on this list, the main disadvantage of such SFFs is the lack of IPMI support. There is no serial console support in the BIOS, so changing BIOS settings is a pain. Power control can be achieved with a PDU, but I didn't choose this way because I knew that the nodes should be always up and I wouldn't have to press the power buttons too often ;-) Another thing to keep in mind is that, due to their tightness, they are quite sensitive to the external temperature - if the A/C fails, expect a sharp raise in internal temperature, so setting up monitoring, both environmental and for the builtin sensors, is recommended. Good luck! -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.costescu at iwr.uni-heidelberg.de From ljdursi at scinet.utoronto.ca Sun Aug 9 08:52:00 2009 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Sun, 9 Aug 2009 11:52:00 -0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: Message-ID: <432C3A61-6B51-4887-BE0E-C6848BB8E4BF@scinet.utoronto.ca> On 7-Aug-09, at 7:59PM, Rahul Nabar wrote: > Is it a bad mistake to configure a Nehalem (2 sockets quad core giving > a total of 8 cores; E5520) with 16 GB RAM (4 DIMMs of 4GB each)? It depends. You'll have to do the timings with your codes; with mine (a uniform grid explicit hydrodynamics code; memory limited, with extremely regular memory access patterns) I saw a pretty robust 10% performance difference between a 16GB `unbalanced' and an 18GB `balanced' memory configuration. You'll have to do the measurements and decide if the resulting performance gain is worth the cost... - Jonathan -- Jonathan Dursi From gus at ldeo.columbia.edu Sun Aug 9 19:34:07 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Sun, 09 Aug 2009 22:34:07 -0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> Message-ID: <4A7F871F.1050805@ldeo.columbia.edu> Hi Rahul, list See answers inline. Rahul Nabar wrote: > On Fri, Aug 7, 2009 at 7:58 PM, Gus Correa wrote: >> Some people are reporting good results when using the >> Nehalem hypethreading feature (activated on the BIOS). >> When the code permits, this virtually doubles the number >> of cores on Nehalems. >> That feature works very well on IBM PPC-6 processors >> (IBM calls it "simultaneous multi-threading" SMT, IIRR), >> and scales by a factor of >1.5, at least with the atmospheric >> model I tried. > > > Thanks for all the useful comments, Gus! Hyperthreading is confusing > the hell out of me. So it is to me. The good news is that according to all reports I read, hyperthreading in Nehalem works well (by contrast with the old version on Pentium-4 and the corresponding Xeons). I expected to see 8 cores in cat /proc/cpuinfo Now > I see 16. (This means I must have left hyperthreading on I guess; I > ought to go to the server room; reboot and check the BIOS) > Most likely it is on. Maybe it is the BIOS default, or the vendor set it up this way. Unfortunately I don't have access to the Nehalem machine. So, I can't check the /proc/cpuinfo here, play with MPI, etc. I helped a grad student configure it, for his thesis research, but the researcher who he works for is a PITA. Bad politics. > This is confusing my benchmarking too. Let's say I ran an MPI job with > -np 4. If there was no other job on this machine would hyperthreading > bring the other CPUs into play as well? > Which MPI do you use? IIRR, you have Gigabit Ethernet, right? (not Infiniband) If you use OpenMPI, you can set the processor affinity, i.e. bind each MPI process to one "processor" (which was once a CPU, then became a core, and now is probably a virtual processor associated to the hyperthreaded Nehalem core). In my experience (and other people's also) this improves performance. On the Opteron Shanghais we have, "top" shows the process number always paired with the "procesor", which in this case is a core, when processor affinity is set. I presume with Nehalem the thing will work, although the processes will be paired with the multithreaded core. In OpenMPI all this takes is to add the flag: -mca mpi_paffinity_alone 1 to the mpiexec line. OpenMPI has finer grained control of processor affinity through a file where you make the process-to-processor association. However, the setup above may be good enough for jazz, and is quite simple. Up to MPICH2 1.0.8p1 there was no such a thing in MPICH. However, I haven't checked their latest greatest version 1.1. They may have something now. > The reason I ask is this: I have noticed that a single 4 core job is > slower than two 4 core jobs run simultanously. This seems puzzling to > me. > It is possible that this is the result of not setting processor affinity. The Linux scheduler may not switch processes across cores/processors efficiently. You may check this out by logging in to a node and using "top", hitting "1" (to show all cores/hyperthreads), hitting "f" to change the displayed fields, then hitting "j" (check, not sure if it is "j") to show the processor/core/hyperthreaded core). I would guess you can pair 6 hyperthreaded cores on each socket to 6 processes. This would give a symmetric and probably load balanced distribution of work. This would also handle 12 processes per node, and fully utilize your 24GB of memory, on your production jobs that require 2GB/process. (Not sure you actually have 24GB or 16GB, though. You didn't say how much memory you bought.) I would be curious to learn what you get, with processor affinity on Nehalems. I would guess it should work, like in physical cores. At least on the IBM PPC-6 it does work and improves the performance. I read somebody telling that it works well also with Nehalems, specifically with an ocean model, getting a decent scaling around 1.4 using 16 processes per node, IIRR. I hope this helps. Good luck! Gus Correa From rpnabar at gmail.com Sun Aug 9 20:33:09 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 9 Aug 2009 22:33:09 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <4A7F871F.1050805@ldeo.columbia.edu> References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Sun, Aug 9, 2009 at 9:34 PM, Gus Correa wrote: > Most likely it is on. > Maybe it is the BIOS default, or the vendor set it up this way. > > Unfortunately I don't have access to the Nehalem machine. > So, I can't check the /proc/cpuinfo here, play with MPI, etc. > I helped a grad student configure it, for his thesis research, > but the researcher who he works for is a PITA. ?Bad politics. Is there a way of finding out within Linux if Hyperthreading is on or not? I know there is a BIOS setting but one of the machines I am testing is remote and I do not have access to BIOS. I'll ask them though but I am impatient to figure out! Alternatively /proc/cpuinfo shows a bunch, say 16, cores. Is there a way to find out if all of these are real cores or hyperthreaded? -- Rahul From rpnabar at gmail.com Sun Aug 9 20:42:25 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 9 Aug 2009 22:42:25 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <4A7F871F.1050805@ldeo.columbia.edu> References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Sun, Aug 9, 2009 at 9:34 PM, Gus Correa wrote: > See answers inline. Thanks! > So it is to me. > The good news is that according to all reports I read, > hyperthreading in Nehalem works well What I am more concerned about is its implications on benchmarking and schedulers. (a) I am seeing strange scaling behaviours with Nehlem cores. eg A specific DFT (Density Functional Theory) code we use is maxing out performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are actually slower than 2 and 4 cores (depending on setup) Just doesn't make sense to me. We are indeed doing something wrong. And no, it isn't just bad parallelization of this code since we have ran it on AMDs and of course performance increases with cores on a single server for sure. (b) We usually set up Torque / PBS / maui to also allow partial server requests. i.e. somebody could say just get 4 cores on a server. The other four cores could go to another job or stay empty. Question is with hyperthreading this compartmentalization is lost isn't it? So userA who got 4 cores could end up leeching on the other 4 cores too? Or am I wrong? > > Which MPI do you use? OpenMPI > IIRR, you have Gigabit Ethernet, right? (not Infiniband) Yes. That's right. No infiniband. > If you use OpenMPI, you can set the processor affinity, > i.e. bind each MPI process to one "processor" (which was once > a CPU, then became a core, and now is probably a virtual > processor associated to the hyperthreaded Nehalem core). > In my experience (and other people's also) this improves > performance. Yup, good point. I have done this with Barcelonas (AMD) and had a 5% boost. Let me try it with the Nehalems too. > > It is possible that this is the result of not setting > processor affinity. > The Linux scheduler may not switch processes > across cores/processors efficiently. So let me double check my understanding. On this Nehalem if I set the processor affinity is that akin to disabling hyperthreading too? Or are these two independent concepts? > (Not sure you actually have 24GB or 16GB, though. > You didn't say how much memory you bought.) I am running two tests. machineA has 24 GB machineB has 16GB. But other things change too. machineA has the X5550 whereas machineB has the E5520. I'll post the results once I have them for the Nehalems! Thanks again, Gus. All very helpful. -- Rahul From tomislav.maric at gmx.com Mon Aug 10 04:48:02 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Mon, 10 Aug 2009 13:48:02 +0200 Subject: [Beowulf] Re: [Paraview] VTK under ParaView In-Reply-To: References: <4A7FEBD4.4070404@gmx.com> Message-ID: <4A8008F2.3000506@gmx.com> J?r?me wrote: > Hi, > > ParaView comes with its own VTK sources. You can find in the source tree > : ./Paraview3/VTK > The VTK binaries will be put in the ParaView binary tree : ./ParaViewBin/bin > > Obviously, the paths depend on your calling way, and on your CMake settings > > Hope that helps > > Jerome Thank you for the advice, I think I have solved a problem by istalling the software with .deb package. Best regards, Tomislav From hahn at mcmaster.ca Mon Aug 10 05:33:27 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 10 Aug 2009 08:33:27 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: > Is there a way of finding out within Linux if Hyperthreading is on or > not? in /proc/cpuinfo, I believe it's a simple as siblings > cpu cores. that is, I'm guessing one of your nehalem's shows as having 8 siblings and 4 cpu cores. From hahn at mcmaster.ca Mon Aug 10 05:41:09 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 10 Aug 2009 08:41:09 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: > (a) I am seeing strange scaling behaviours with Nehlem cores. eg A > specific DFT (Density Functional Theory) code we use is maxing out > performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are > actually slower than 2 and 4 cores (depending on setup) this is on the machine which reports 16 cores, right? I'm guessing that the kernel is compiled without numa and/or ht, so enumerates virtual cpus first. that would mean that when otherwise idle, a 2-core proc will get virtual cores within the same physical core. and that your 8c test is merely keeping the first socket busy. > other four cores could go to another job or stay empty. Question is > with hyperthreading this compartmentalization is lost isn't it? So > userA who got 4 cores could end up leeching on the other 4 cores too? > Or am I wrong? the kernel/scheduler is smart enough to do mostly the right thing WRT virtual cores. when compiled properly... >> It is possible that this is the result of not setting >> processor affinity. >> The Linux scheduler may not switch processes >> across cores/processors efficiently. > > So let me double check my understanding. On this Nehalem if I set the > processor affinity is that akin to disabling hyperthreading too? Or > are these two independent concepts? processor affinity just means restricting the set of cores a proc can run on. it's orthogonal to the question of choosing the _right_ cores. From hahn at mcmaster.ca Mon Aug 10 08:51:31 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 10 Aug 2009 11:51:31 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <20090810154348.GC6915@alice05> References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> <20090810154348.GC6915@alice05> Message-ID: > Googling for 'dmidecode Hyper Thread' I found this 2004 article: the info in /proc/cpuinfo has definitely changed since 2004. From mdidomenico4 at gmail.com Mon Aug 10 09:26:59 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 10 Aug 2009 12:26:59 -0400 Subject: [Beowulf] sun x4100's with infiniband Message-ID: just cause i've posted it everywhere else, figured i'd make one last ditch effort and see if anyone on this list might know the answer... I have several Sun x4100 with Infiniband servers which appear to be running at 200MB/sec instead of 800MB/sec. It's a freshly reformatted cluster converting from solaris to linux. During the conversion we reset the bios settings with "load optimal defaults" and cleared all the BMC/BIOS events logs and such. Does anyone know which bios setting got changed during the process which dropped the bandwidth? From rpnabar at gmail.com Mon Aug 10 09:29:42 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 11:29:42 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, Aug 10, 2009 at 7:33 AM, Mark Hahn wrote: > in /proc/cpuinfo, I believe it's a simple as siblings > cpu cores. > that is, I'm guessing one of your nehalem's shows as having 8 siblings > and 4 cpu cores. Yes. That works. Also looking at the "physical id" helps. I was confused by the ht flag. Apparently that is not relevant/. It only indicates whether the CPU can report hyperthreading or not. No wonder all my boxes have that "ht" flag. -- Rahul From rpnabar at gmail.com Mon Aug 10 09:41:06 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 11:41:06 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem Message-ID: A while ago Tiago Marques had provided some benchmarking info in a thread ( http://www.beowulf.org/archive/2009-May/025739.html ) and some recent tests that I've been doing made me interested in this snippet again: >One of the codes, VASP, is very bandwidth limited and loves to run in a >number of cores multiple of 3. The 5400s are also very bandwith - memory and >FSB - limited which causes that they sometimes don't scale well above 6 >cores. They are very fast per core, as someone mentioned, when compared to >AMD cores. >These are the times I get from a benchmark I usually run in VASP: > >VASP on Core i7: > - 1 core = 162.453s, 162.778s (no HT) > - 2 cores = 100s,102s (no HT) > - 3 cores = 77.835s, 78.195s (no HT) > - 4 cores = 87.63s, 87.322s (no HT) > - 6 cores = *76.56s, 76.4s* > - 6 cores DDR3-1600 CAS9 - 69.654s, 68.816s, 67.7s > >HT doesn't add much but DDR3-1600 does. Still, ~78s is very fast with a >quad-core because our dual 5400s can only do *91s* at best, even using >tweaks like CPU affinity, which brings it down from 95s, by distributing >only 3 threads per socket and not 4/2 or having 4 of them constantly jumping >from socket to socket. Apparently it shows that the Nehalems for VASP scale well only to 3 cores? Putting 4 cores on the job actually causes the runtime to increase? This seems pretty bizzare to me at first sight but this seems close to what I am getting as well. Any other people seen similar scaling? (I am trying the cpu affinity flags now to see if that makes a difference) How would you explain this? In the past I've seen the codes scale well to core numbers higher than this. -- Rahul From rpnabar at gmail.com Mon Aug 10 09:43:22 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 11:43:22 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn wrote: >> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A >> specific DFT (Density Functional Theory) code we use is maxing out >> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >> actually slower than 2 and 4 cores (depending on setup) > > this is on the machine which reports 16 cores, right? ?I'm guessing > that the kernel is compiled without numa and/or ht, so enumerates virtual > cpus first. ?that would mean that when otherwise idle, a 2-core > proc will get virtual cores within the same physical core. ?and that your 8c > test is merely keeping the first socket busy. No. On both machines. The one reporting 16 cores and the other reporting 8. i.e. one hyperthreaded and the other not. Both having 8 physical cores. What is bizarre is I tried using -np 16. THat ought to definitely utilize all cores, right? I'd have expected the 16 core performance to be the best. BUt no the performance peaks at a smaller number of cores. -- Rahul From hahn at mcmaster.ca Mon Aug 10 10:04:56 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 10 Aug 2009 13:04:56 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: >> this is on the machine which reports 16 cores, right? ?I'm guessing >> that the kernel is compiled without numa and/or ht, so enumerates virtual >> cpus first. ?that would mean that when otherwise idle, a 2-core >> proc will get virtual cores within the same physical core. ?and that your 8c >> test is merely keeping the first socket busy. > > No. On both machines. The one reporting 16 cores and the other > reporting 8. i.e. one hyperthreaded and the other not. Both having 8 > physical cores. > > What is bizarre is I tried using -np 16. THat ought to definitely > utilize all cores, right? I'd have expected the 16 core performance to > be the best. BUt no the performance peaks at a smaller number of > cores. I think I would still invoke kernel miscompilation, since if the kernel isn't aware of the memory/core/socket topology, it probably makes quite poor affinity-oblivious allocations. this is the machine where numactl doesn't do anything sensible, right? From kus at free.net Mon Aug 10 10:43:56 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Mon, 10 Aug 2009 21:43:56 +0400 Subject: [Beowulf] numactl & SuSE11.1 In-Reply-To: <4A7E038A.6080406@gmail.com> Message-ID: I'm sorry for my mistake: the problem is on Nehalem Xeon under SuSE -11.1, but w/kernel 2.6.27.7-9 (w/Supermicro X8DT mobo). For Opteron 2350 w/SuSE 10.3 (w/ more old 2.6.22.5-31 -I erroneously inserted this string in my previous message) numactl works OK (w/Tyan mobo). NUMA is enabled in BIOS. Of course, CONFIG_NUMA (and CONFIG_NUMA_EMU) are setted to "y" in both kernels. Unfortunately I (i.e. root) can't change files in /sys/devices/system/node (or rename directory node2 to node1) :-( - as it's possible w/some files in /proc filesystem. It's interesting, that extraction from dmesg show, that IT WAS NODE1, but then node2 is appear ! ACPI: SRAT BF79A4B0, 0150 (r1 041409 OEMSRAT 1 INTL 1) ACPI: SSDT BF79FAC0, 249F (r1 DpgPmm CpuPm 12 INTL 20051117) ACPI: Local APIC address 0xfee00000 SRAT: PXM 0 -> APIC 0 -> Node 0 SRAT: PXM 0 -> APIC 2 -> Node 0 SRAT: PXM 0 -> APIC 4 -> Node 0 SRAT: PXM 0 -> APIC 6 -> Node 0 SRAT: PXM 1 -> APIC 16 -> Node 1 SRAT: PXM 1 -> APIC 18 -> Node 1 SRAT: PXM 1 -> APIC 20 -> Node 1 SRAT: PXM 1 -> APIC 22 -> Node 1 SRAT: Node 0 PXM 0 0-a0000 SRAT: Node 0 PXM 0 100000-c0000000 SRAT: Node 0 PXM 0 100000000-1c0000000 SRAT: Node 2 PXM 257 1c0000000-340000000 (here !!) NUMA: Allocated memnodemap from 1c000 - 22880 NUMA: Using 20 for the hash shift. Bootmem setup node 0 0000000000000000-00000001c0000000 NODE_DATA [0000000000022880 - 000000000003a87f] bootmap [000000000003b000 - 0000000000072fff] pages 38 (8 early reservations) ==> bootmem [0000000000 - 01c0000000] #0 [0000000000 - 0000001000] BIOS data page ==> [0000000000 - 0000001000] #1 [0000006000 - 0000008000] TRAMPOLINE ==> [0000006000 - 0000008000] #2 [0000200000 - 0000bf27b8] TEXT DATA BSS ==> [0000200000 - 0000bf27b8] #3 [0037a3b000 - 0037fef104] RAMDISK ==> [0037a3b000 - 0037fef104] #4 [000009cc00 - 0000100000] BIOS reserved ==> [000009cc00 - 0000100000] #5 [0000010000 - 0000013000] PGTABLE ==> [0000010000 - 0000013000] #6 [0000013000 - 000001c000] PGTABLE ==> [0000013000 - 000001c000] #7 [000001c000 - 0000022880] MEMNODEMAP ==> [000001c000 - 0000022880] Bootmem setup node 2 00000001c0000000-0000000340000000 NODE_DATA [00000001c0000000 - 00000001c0017fff] bootmap [00000001c0018000 - 00000001c0047fff] pages 30 (8 early reservations) ==> bootmem [01c0000000 - 0340000000] #0 [0000000000 - 0000001000] BIOS data page #1 [0000006000 - 0000008000] TRAMPOLINE #2 [0000200000 - 0000bf27b8] TEXT DATA BSS #3 [0037a3b000 - 0037fef104] RAMDISK #4 [000009cc00 - 0000100000] BIOS reserved #5 [0000010000 - 0000013000] PGTABLE #6 [0000013000 - 000001c000] PGTABLE #7 [000001c000 - 0000022880] MEMNODEMAP found SMP MP-table at [ffff8800000ff780] 000ff780 [ffffe20000000000-ffffe20006ffffff] PMD -> [ffff880028200000-ffff88002e1fffff] on node 0 [ffffe20007000000-ffffe2000cffffff] PMD -> [ffff8801c0200000-ffff8801c61fffff] on node 2 Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From renato.oferenda at gmail.com Mon Aug 10 08:43:49 2009 From: renato.oferenda at gmail.com (Renato Callado Borges) Date: Mon, 10 Aug 2009 12:43:49 -0300 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <20090810154348.GC6915@alice05> On Mon, Aug 10, 2009 at 08:33:27AM -0400, Mark Hahn wrote: >> Is there a way of finding out within Linux if Hyperthreading is on or >> not? > > in /proc/cpuinfo, I believe it's a simple as siblings > cpu cores. > that is, I'm guessing one of your nehalem's shows as having 8 siblings > and 4 cpu cores. Googling for 'dmidecode Hyper Thread' I found this 2004 article: http://www.linux.com/archive/articles/41088 And it says: "I would have liked to just read /proc/cpuinfo to determine if Hyper-Threading is enabled, but currently that info is not exported to that file. /proc/cpuinfo just displays the number of physical CPUs in the system and ignores Hyper-Threading. The process of using x86info is similar to the process of using dmidecode: execute and parse the output. In this case, x86info will say _The physical package supports 2 logical processors_ if Hyper-Threading is enabled on a standard Xeon system." Installed x86info in my box, ran it and (correctly) it says my box' physical package supports 1 logical processor. (It's a Pentium 4). -- []'s, RCB. From jlb17 at duke.edu Mon Aug 10 12:09:48 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Mon, 10 Aug 2009 15:09:48 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote > On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn wrote: >>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A >>> specific DFT (Density Functional Theory) code we use is maxing out >>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >>> actually slower than 2 and 4 cores (depending on setup) >> >> this is on the machine which reports 16 cores, right? ?I'm guessing >> that the kernel is compiled without numa and/or ht, so enumerates virtual >> cpus first. ?that would mean that when otherwise idle, a 2-core >> proc will get virtual cores within the same physical core. ?and that your 8c >> test is merely keeping the first socket busy. > > No. On both machines. The one reporting 16 cores and the other > reporting 8. i.e. one hyperthreaded and the other not. Both having 8 > physical cores. > > What is bizarre is I tried using -np 16. THat ought to definitely > utilize all cores, right? I'd have expected the 16 core performance to > be the best. BUt no the performance peaks at a smaller number of > cores. Well, as there are only 8 "real" cores, running a computationally intensive process across 16 should *definitely* do worse than across 8. However, it's not so surprising that you're seeing peak performance with 2-4 threads. Nehalem can actually overclock itself when only some of the cores are busy -- it's called Turbo Mode. That *could* be what you're seeing. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From gus at ldeo.columbia.edu Mon Aug 10 12:40:15 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 10 Aug 2009 15:40:15 -0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <4A80779F.4080300@ldeo.columbia.edu> Joshua Baker-LePain wrote: > On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote > >> On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn wrote: >>>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A >>>> specific DFT (Density Functional Theory) code we use is maxing out >>>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >>>> actually slower than 2 and 4 cores (depending on setup) >>> >>> this is on the machine which reports 16 cores, right? I'm guessing >>> that the kernel is compiled without numa and/or ht, so enumerates >>> virtual >>> cpus first. that would mean that when otherwise idle, a 2-core >>> proc will get virtual cores within the same physical core. and that >>> your 8c >>> test is merely keeping the first socket busy. >> >> No. On both machines. The one reporting 16 cores and the other >> reporting 8. i.e. one hyperthreaded and the other not. Both having 8 >> physical cores. >> >> What is bizarre is I tried using -np 16. THat ought to definitely >> utilize all cores, right? I'd have expected the 16 core performance to >> be the best. BUt no the performance peaks at a smaller number of >> cores. > > Well, as there are only 8 "real" cores, running a computationally > intensive process across 16 should *definitely* do worse than across 8. > However, it's not so surprising that you're seeing peak performance with > 2-4 threads. Nehalem can actually overclock itself when only some of > the cores are busy -- it's called Turbo Mode. That *could* be what > you're seeing. > Hi Rahul, Joshua, list If Rahul is running these tests with his production jobs, which he says require 2GB/process, and if he has 24GB/node (or is it 16GB/node?), then with 16 processes running on a node memory paging probably kicked in, because the physical memory is less than 32GB. Would this be the reason for the drop in performance, Rahul? In any case, Joshua is right that you can't expect linear scaling from 8 to 16 processes on a node. What I saw on an IBM machine with PPC-6 and SMT (similar to Intel hyperthreading) was a speedup of around 1.4, rather than 2. Still a great deal! If I understand right, hyperthreading opportunistically uses idle execution units on a core to schedule a second thread to use them. As clever and efficient as it is, I would guess this mechanism cannot produce as much work as two physical cores. There is an article about it in Tom's Hardware: http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-5.html My $0.02 of guesses Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rpnabar at gmail.com Mon Aug 10 13:02:51 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 15:02:51 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain wrote: > Well, as there are only 8 "real" cores, running a computationally intensive > process across 16 should *definitely* do worse than across 8. However, it's > not so surprising that you're seeing peak performance with 2-4 threads. > ?Nehalem can actually overclock itself when only some of the cores are busy > -- it's called Turbo Mode. ?That *could* be what you're seeing. That could very well be it! Is there any way to test if the CPU has overclocked itself? Or can I turn the "turbo mode" off and check? -- Rahul From jlb17 at duke.edu Mon Aug 10 13:07:00 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Mon, 10 Aug 2009 16:07:00 -0400 (EDT) Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: On Mon, 10 Aug 2009 at 3:02pm, Rahul Nabar wrote > On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain wrote: >> Well, as there are only 8 "real" cores, running a computationally intensive >> process across 16 should *definitely* do worse than across 8. However, it's >> not so surprising that you're seeing peak performance with 2-4 threads. >> ?Nehalem can actually overclock itself when only some of the cores are busy >> -- it's called Turbo Mode. ?That *could* be what you're seeing. > > That could very well be it! Is there any way to test if the CPU has > overclocked itself? > > Or can I turn the "turbo mode" off and check? You *should* be able to turn off turbo mode in the BIOS. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From Craig.Tierney at noaa.gov Mon Aug 10 13:20:36 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Mon, 10 Aug 2009 14:20:36 -0600 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <4A808114.1020502@noaa.gov> Joshua Baker-LePain wrote: > On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote > >> On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn wrote: >>>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A >>>> specific DFT (Density Functional Theory) code we use is maxing out >>>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >>>> actually slower than 2 and 4 cores (depending on setup) >>> >>> this is on the machine which reports 16 cores, right? I'm guessing >>> that the kernel is compiled without numa and/or ht, so enumerates >>> virtual >>> cpus first. that would mean that when otherwise idle, a 2-core >>> proc will get virtual cores within the same physical core. and that >>> your 8c >>> test is merely keeping the first socket busy. >> >> No. On both machines. The one reporting 16 cores and the other >> reporting 8. i.e. one hyperthreaded and the other not. Both having 8 >> physical cores. >> >> What is bizarre is I tried using -np 16. THat ought to definitely >> utilize all cores, right? I'd have expected the 16 core performance to >> be the best. BUt no the performance peaks at a smaller number of >> cores. > > Well, as there are only 8 "real" cores, running a computationally > intensive process across 16 should *definitely* do worse than across 8. > However, it's not so surprising that you're seeing peak performance with > 2-4 threads. Nehalem can actually overclock itself when only some of > the cores are busy -- it's called Turbo Mode. That *could* be what > you're seeing. > We are seeing that the chips will overclock themselves even with all cores running. The percent increase in speed can be from 2-10% per node. I have never had a run (single node HPL) run as slow as it does when Turbo is turned off. However, with all the variation per node, there isn't much of a win for large jobs as they will generally slow down to the slowest node. Craig > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Craig Tierney (craig.tierney at noaa.gov) From bill at cse.ucdavis.edu Mon Aug 10 13:22:49 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Mon, 10 Aug 2009 13:22:49 -0700 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <4A808199.4050605@cse.ucdavis.edu> Joshua Baker-LePain wrote: > Well, as there are only 8 "real" cores, running a computationally > intensive process across 16 should *definitely* do worse than across 8. I've seen many cases where that isn't true. The P4 rarely justified turning on HT because throughput would often be lower. With the nehalem often it helps, the best way to tell is to try it. > However, it's not so surprising that you're seeing peak performance with > 2-4 threads. Nehalem can actually overclock itself when only some of > the cores are busy -- it's called Turbo Mode. That *could* be what > you're seeing. Indeed. From tom.elken at qlogic.com Mon Aug 10 14:07:23 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Mon, 10 Aug 2009 14:07:23 -0700 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <35AAF1E4A771E142979F27B51793A48885F2481BA4@AVEXMB1.qlogic.org> > Well, as there are only 8 "real" cores, running a computationally > intensive process across 16 should *definitely* do worse than across 8. Not typically. At the SPEC website there are quite a few SPEC MPI2007 (which is an average across 13 HPC applications) results on Nehalem. Summary: IBM, SGI and Platform have some comparisons on clusters with "SMT On" of running 1 rank for every core compared to running 2 ranks on every core. In general, on low core-counts, like up to 32 there is about an 8% advantage for running 2 ranks per core. At larger core counts, IBM published a pair of results on 64 cores where the 64-rank performance was equal to the 128-rank performance. Not all of these applications scale linearly, so on some of them you lose efficiency at 128 ranks compared to 64 ranks. Details: Results from this year are mostly on Nehalem: http://www.spec.org/mpi2007/results/res2009q3/ (IBM) http://www.spec.org/mpi2007/results/res2009q2/ (Platform) http://www.spec.org/mpi2007/results/res2009q1/ (SGI) (Intel has results with Turbo mode turned on and off in the q2 and q3 results, for a different comparison) Or you can pick out the Xeon 'X5570' and 'X5560' results from the list of all results: http://www.spec.org/mpi2007/results/mpi2007.html In the result index, when " Compute Threads Enabled" = 2x "Compute Cores Enabled", then you know SMT is turned on. In these cases, you can then check that when " MPI Ranks" = " Compute Threads Enabled" then you are running 2 ranks per core. -Tom > However, it's not so surprising that you're seeing peak performance > with > 2-4 threads. Nehalem can actually overclock itself when only some of > the > cores are busy -- it's called Turbo Mode. That *could* be what you're > seeing. > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF From rpnabar at gmail.com Mon Aug 10 15:28:59 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 10 Aug 2009 17:28:59 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: On Mon, Aug 10, 2009 at 12:48 PM, Bruno Coutinho wrote: > This is often caused by cache competition or memory bandwidth saturation. > If it was cache competition, rising from 4 to 6 threads would make it worse. > As the code became faster with DDR3-1600 and much slower with Xeon 5400, > this code is memory bandwidth bound. > Tweaking CPU affinity to avoid thread jumping among cores of the will not > help much, as the big bottleneck is memory bandwidth. > To this code, CPU affinity will only help in NUMA machines to maintain > memory access in local memory. > > > If the machine has enough bandwidth to feed the cores, it will scale. Exactly! But I thought this was the big advance with the Nehalem that it has removed the CPU<->Cache<->RAM bottleneck. So if the code scaled with the AMD Barcelona then it would continue to scale with the Nehalem right? I'm posting a copy of my scaling plot here if it helps. http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg To remove most possible confounding factors this particular Nehlem plot is produced with the following settings: Hyperthreading OFF 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration X5550 Even if we explained away the bizzare performance of the 4 node case to the Turbo effect what is most confusing is how the 8 core data point could be so much slower than the corresponding 8 core point on a old AMD Barcelona. Something's wrong here that I just do not understand. BTW, any other VASP users here? Anybody have any Nehalem experience? -- Rahul From h-bugge at online.no Tue Aug 11 00:43:03 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Tue, 11 Aug 2009 09:43:03 +0200 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: <35AAF1E4A771E142979F27B51793A48885F2481BA4@AVEXMB1.qlogic.org> References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> <35AAF1E4A771E142979F27B51793A48885F2481BA4@AVEXMB1.qlogic.org> Message-ID: <72371DA9-DC23-4F99-A6FE-F3BC9854041E@online.no> On Aug 10, 2009, at 23:07 , Tom Elken wrote: > Summary: > IBM, SGI and Platform have some comparisons on clusters with "SMT > On" of running 1 rank for every core compared to running 2 ranks on > every core. In general, on low core-counts, like up to 32 there is > about an 8% advantage for running 2 ranks per core. At larger core > counts, IBM published a pair of results on 64 cores where the 64- > rank performance was equal to the 128-rank performance. Not all of > these applications scale linearly, so on some of them you lose > efficiency at 128 ranks compared to 64 ranks. > > Details: Results from this year are mostly on Nehalem: > http://www.spec.org/mpi2007/results/res2009q3/ (IBM) > http://www.spec.org/mpi2007/results/res2009q2/ (Platform) > http://www.spec.org/mpi2007/results/res2009q1/ (SGI) > (Intel has results with Turbo mode turned on and off > in the q2 and q3 results, for a different comparison) > > Or you can pick out the Xeon 'X5570' and 'X5560' results from the > list of all results: > http://www.spec.org/mpi2007/results/mpi2007.html > > In the result index, when > " Compute Threads Enabled" = 2x "Compute Cores Enabled", then you > know SMT is turned on. > In these cases, you can then check that when > " MPI Ranks" = " Compute Threads Enabled" then you are running 2 > ranks per core. Tom, Thanks for the neatly compiled information above. I can just add, that I have conducted a fairly detailed analysis of Nehalem compared to HarperTown in my paper An evaluation of Intel?s core i7 architecture using a comparative approach presented at ISC?09. Here, I look at different aspect of the memory hierarchy of the two processors. The benefits from hyperthreading on the said 13 SPEC MPI2007 applications are also studied, although using only a single node, where the advantage is more pronounced Thanks, H?kon -------------- next part -------------- An HTML attachment was scrubbed... URL: From deadline at eadline.org Tue Aug 11 06:04:27 2009 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 11 Aug 2009 09:04:27 -0400 (EDT) Subject: [Beowulf] The True Cost of HPC Cluster Ownership Message-ID: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> All, I posted this on ClusterMonkey the other week. It is actually derived from a white paper I wrote for SiCortex. I'm sure those on this list have some experience/opinions with these issues (and other cluster issues!) The True Cost of HPC Cluster Ownership http://www.clustermonkey.net//content/view/262/1/ -- Doug From Daniel.Pfenniger at unige.ch Tue Aug 11 07:38:19 2009 From: Daniel.Pfenniger at unige.ch (Daniel Pfenniger) Date: Tue, 11 Aug 2009 16:38:19 +0200 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> Message-ID: <4A81825B.4000303@unige.ch> Douglas Eadline wrote: > All, > > I posted this on ClusterMonkey the other week. > It is actually derived from a white paper I wrote for > SiCortex. I'm sure those on this list have some > experience/opinions with these issues (and other > cluster issues!) > > The True Cost of HPC Cluster Ownership > > http://www.clustermonkey.net//content/view/262/1/ > This article sounds unbalanced and self-serving. While it I clear that self-made clusters imply added new costs in regard of turn-key clusters, they also empower the buyer using standard and open solutions by an increased independence from the vendor, and increases also its knowledge for future choices. This aspect is hard to measure in monetary terms, but certainly very important for some users. I have experienced all kinds of clusters (turn-key, mostly self-assembled, and partly vendor assembled and tested), and my conclusion is that the best is when the user has at least the choice to determine the degree of vendor integration/lock-in. Bad choices occur because people are badly informed, and the article is so biased that it doesn't improve objective information on this regard, just serves as increasing fear and doubt. Dan From gerry.creager at tamu.edu Tue Aug 11 08:27:48 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue, 11 Aug 2009 10:27:48 -0500 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A81825B.4000303@unige.ch> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> Message-ID: <4A818DF4.8040807@tamu.edu> Daniel Pfenniger wrote: > Douglas Eadline wrote: >> All, >> >> I posted this on ClusterMonkey the other week. >> It is actually derived from a white paper I wrote for >> SiCortex. I'm sure those on this list have some >> experience/opinions with these issues (and other >> cluster issues!) >> >> The True Cost of HPC Cluster Ownership >> >> http://www.clustermonkey.net//content/view/262/1/ >> > > This article sounds unbalanced and self-serving. I thought it read a bit like a chronicle of my recent experiences. > While it I clear that self-made clusters imply added new costs > in regard of turn-key clusters, they also empower the buyer > using standard and open solutions by an increased independence from > the vendor, and increases also its knowledge for future choices. > This aspect is hard to measure in monetary terms, but certainly very > important for some users. > > I have experienced all kinds of clusters (turn-key, mostly > self-assembled, and partly vendor assembled and tested), and my conclusion > is that the best is when the user has at least the choice to determine > the degree of vendor integration/lock-in. Bad choices occur > because people are badly informed, and the article is so biased > that it doesn't improve objective information on this regard, just > serves as increasing fear and doubt. In our experiences over the last two clusters, one was delivered n a bunch of flat boxes andwe spent several weeks racking, stacking, cabling, loading, testing, tweaking and then releasing. Or, was that months. In our more recent cluster, we took delivery of a purported tested system, then lived through 2+ weeks of vendor cabling (pretty, if an extended time to achieve), hardware failures, replacements, BIOS upgrades (wholesale; shouldn't have been needed), and more hardware failures. Next time, I want to either get a complete turnkey system or buy from the various sources and just do it myself, knowing the pitfalls. gerry -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From dnlombar at ichips.intel.com Tue Aug 11 08:40:56 2009 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Tue, 11 Aug 2009 08:40:56 -0700 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: <4A7CCDBF.7070904@ldeo.columbia.edu> <4A7F871F.1050805@ldeo.columbia.edu> Message-ID: <20090811154056.GA6987@nlxdcldnl2.cl.intel.com> On Mon, Aug 10, 2009 at 01:02:51PM -0700, Rahul Nabar wrote: > On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain wrote: > > Well, as there are only 8 "real" cores, running a computationally intensive > > process across 16 should *definitely* do worse than across 8. Some workloads will benefit materially from SMT, some are neutral, and some will degrade. For those that degrade, simply not oversubscribing the physical cores will get best performance. > > However, it's > > not so surprising that you're seeing peak performance with 2-4 threads. > > ?Nehalem can actually overclock itself when only some of the cores are busy > > -- it's called Turbo Mode. ?That *could* be what you're seeing. > > That could very well be it! Is there any way to test if the CPU has > overclocked itself? There's an application note on the subect at: Be aware this document is very technical, talking about MSRs & performance counters. > Or can I turn the "turbo mode" off and check? That would work, but... Alternately, take a look at -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From landman at scalableinformatics.com Tue Aug 11 09:16:49 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 11 Aug 2009 12:16:49 -0400 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A818DF4.8040807@tamu.edu> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> Message-ID: <4A819971.9060905@scalableinformatics.com> Gerry Creager wrote: > Daniel Pfenniger wrote: >> Douglas Eadline wrote: [...] >> This article sounds unbalanced and self-serving. > > I thought it read a bit like a chronicle of my recent experiences. I think that this article is fine, not unbalanced. What I like to point out to customers and partners is There is a cost to *EVERYTHING* Heinlein called it TANSTAAFL. Every single decision you make carries with it a set of costs. What purchasing agents, looking at the absolute rock bottom prices do not seem to grasp, is that those costs can *easily* swamp any purported gains from a lower price, and raise the actual landed price, due to expending valuable resource time (Gerry et al) for months on end working to solve problems that *should* have been solved previously. There is a cost to going cheap. This cost is time, and loss of productivity. If your time (your students time) is free, and you don't need to pay for consequences (loss of grants, loss of revenue, loss of productivity, ...) in delayed delivery of results from computing or storage systems, then, by all means, roll these things yourself, and deal with the myriad of debugging issues in making the complex beasts actually work. You have hardware stack issues, software stack issues, interaction issues, ... What I am saying is that Doug is onto something here. It ain't easy. Doug simply expressed that it isn't. As for the article being self serving? I dunno, I don't think so. Doug runs a consultancy called Basement Supercomputing that provides services for such folks. I didn't see overt advertisements, or even, really, covert "hire us" messages. I think this was fine as a white paper, and Doug did note that it started life as one. My $0.02 -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From tjrc at sanger.ac.uk Tue Aug 11 09:32:53 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Tue, 11 Aug 2009 17:32:53 +0100 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A81825B.4000303@unige.ch> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> Message-ID: <74DEA3F5-CFF1-4D66-8C33-046CF8114732@sanger.ac.uk> On 11 Aug 2009, at 3:38 pm, Daniel Pfenniger wrote: > Douglas Eadline wrote: >> All, >> I posted this on ClusterMonkey the other week. >> It is actually derived from a white paper I wrote for >> SiCortex. I'm sure those on this list have some >> experience/opinions with these issues (and other >> cluster issues!) >> The True Cost of HPC Cluster Ownership >> http://www.clustermonkey.net//content/view/262/1/ > > This article sounds unbalanced and self-serving. > > While it I clear that self-made clusters imply added new costs > in regard of turn-key clusters, they also empower the buyer > using standard and open solutions by an increased independence from > the vendor, and increases also its knowledge for future choices. > This aspect is hard to measure in monetary terms, but certainly very > important for some users. I agree. Some of the biggest IT problems I've encountered have been a direct result of vendor lock-in. Companies get bought, and products crushed. The wind changes direction, and products get dropped, side- lined, changed more or less on a whim. Rash promises made by vendors which never come true. In our case it was the dismembering of DEC during its various acquisitions which hurt us, and that saga contains examples of most of the above. And then the customer has to start again, which can be enormously expensive in terms of researching new ways to go, and migrating services. Just one part of that saga (the abandonment of the AdvFS filesystem) cost us more than six months of continuous work to get past, just copying the data onto something else. That was years ago; with the petabytes of data we have now, it would be even worse. Once bitten, twice shy. If you've made the investment in house to have a vendor-agnostic setup, which we now have, we have complete freedom to choose whatever tin vendor we like, at least as far as our compute nodes go. Our configuration, deployment and management software stack works on anything, so it's very little skin off our nose to change vendor. Regards, Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From gerry.creager at tamu.edu Tue Aug 11 10:00:52 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue, 11 Aug 2009 12:00:52 -0500 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A819971.9060905@scalableinformatics.com> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: <4A81A3C4.7000901@tamu.edu> +1 Joe Landman wrote: > Gerry Creager wrote: >> Daniel Pfenniger wrote: >>> Douglas Eadline wrote: > > [...] > >>> This article sounds unbalanced and self-serving. >> >> I thought it read a bit like a chronicle of my recent experiences. > > I think that this article is fine, not unbalanced. What I like to point > out to customers and partners is > > There is a cost to *EVERYTHING* > > Heinlein called it TANSTAAFL. Every single decision you make carries > with it a set of costs. > > What purchasing agents, looking at the absolute rock bottom prices do > not seem to grasp, is that those costs can *easily* swamp any purported > gains from a lower price, and raise the actual landed price, due to > expending valuable resource time (Gerry et al) for months on end working > to solve problems that *should* have been solved previously. > > There is a cost to going cheap. This cost is time, and loss of > productivity. If your time (your students time) is free, and you don't > need to pay for consequences (loss of grants, loss of revenue, loss of > productivity, ...) in delayed delivery of results from computing or > storage systems, then, by all means, roll these things yourself, and > deal with the myriad of debugging issues in making the complex beasts > actually work. You have hardware stack issues, software stack issues, > interaction issues, ... > > What I am saying is that Doug is onto something here. It ain't easy. > Doug simply expressed that it isn't. > > As for the article being self serving? I dunno, I don't think so. Doug > runs a consultancy called Basement Supercomputing that provides services > for such folks. I didn't see overt advertisements, or even, really, > covert "hire us" messages. I think this was fine as a white paper, and > Doug did note that it started life as one. > > My $0.02 > -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From bill at cse.ucdavis.edu Tue Aug 11 10:06:32 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 11 Aug 2009 10:06:32 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: <4A81A518.2030805@cse.ucdavis.edu> Rahul Nabar wrote: > Exactly! But I thought this was the big advance with the Nehalem that > it has removed the CPU<->Cache<->RAM bottleneck. Not sure I'd say removed, but they have made a huge improvement. To the point where a single socket intel is better than a dual socket barcelona. > So if the code scaled > with the AMD Barcelona then it would continue to scale with the > Nehalem right? That is a gross over simplification. So sure with a microbenchmark testing memory bandwidth only that wouldn't be a terrible approximation. Something like vasp is far from a simple micro benchmark. > I'm posting a copy of my scaling plot here if it helps. > > http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg Looks to me like you fit in the barcelona 512KB L2 cache (and get good scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling). Were the binaries compiled specifically to target both architectures? As a first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's compiler for intel. But portland group does a good job at both in most cases. > Hyperthreading OFF > 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration > X5550 I"m curious about the hyperthreading on data point as well. > Even if we explained away the bizzare performance of the 4 node case > to the Turbo effect what is most confusing is how the 8 core data > point could be so much slower than the corresponding 8 core point on a > old AMD Barcelona. A doubling of the can have that effect. The Intel L3 can no come anywhere close to feeding 4 cores running flat out. From kus at free.net Tue Aug 11 10:19:08 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Tue, 11 Aug 2009 21:19:08 +0400 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: Message-ID: In message from Rahul Nabar (Sun, 9 Aug 2009 22:42:25 -0500): >(a) I am seeing strange scaling behaviours with Nehlem cores. eg A >specific DFT (Density Functional Theory) code we use is maxing out >performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are >actually slower than 2 and 4 cores (depending on setup) If this results are for HyperThreading "ON", it may be not too strange because of "virtual cores" competition. But if this results are for switched off Hyperthreading - it's strange. I have usual good DFT scaling w/number of cores on G03 - about in 7 times for 8 cores. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From Daniel.Pfenniger at unige.ch Tue Aug 11 10:28:22 2009 From: Daniel.Pfenniger at unige.ch (Daniel Pfenniger) Date: Tue, 11 Aug 2009 19:28:22 +0200 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A819971.9060905@scalableinformatics.com> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: <4A81AA36.8050503@unige.ch> Joe Landman wrote: > Gerry Creager wrote: >> Daniel Pfenniger wrote: >>> Douglas Eadline wrote: > > [...] > >>> This article sounds unbalanced and self-serving. >> >> I thought it read a bit like a chronicle of my recent experiences. Mine were not so bad, so I found the tone too pessimistic. > I think that this article is fine, not unbalanced. What I like to point > out to customers and partners is > > There is a cost to *EVERYTHING* Well, not really surprising. The point is to be quantitative, not subjective (fear, etc.). Each solution has a cost and alert people will choose the best one for them, not for the vendor. If many people choose IKEA furniture over traditional vendors it is because the cost differential is favourable for them, even taking all the overheads into account. When commodity clusters came in the 90's the gain was easily a factor 10 at purchase. In my case the maintenance and licenses costs of turn-key locked-in hardware added 20-25% of purchase cost every year. With such a high cost we could have hired an engineer full-time instead, but it was not possible because of the locked-in nature of such machines. The self-made solution was clearly the best. Today one finds intermediate solutions where the hardware is composed of compatible elements, and the software is open source. Some vendors offer almost ready to run and tested hardware for a reasonable margin, adding less than a factor 2 to the original hardware cost, without horrendous maintenance fee and restrictive license. The locked-in effect is low, yet not completely zero. This is probably the best solution for many budget-conscious users. > > Heinlein called it TANSTAAFL. Every single decision you make carries > with it a set of costs. > > What purchasing agents, looking at the absolute rock bottom prices do > not seem to grasp, is that those costs can *easily* swamp any purported > gains from a lower price, and raise the actual landed price, due to > expending valuable resource time (Gerry et al) for months on end working > to solve problems that *should* have been solved previously. > > There is a cost to going cheap. This cost is time, and loss of > productivity. If your time (your students time) is free, and you don't > need to pay for consequences (loss of grants, loss of revenue, loss of > productivity, ...) in delayed delivery of results from computing or > storage systems, then, by all means, roll these things yourself, and > deal with the myriad of debugging issues in making the complex beasts > actually work. You have hardware stack issues, software stack issues, > interaction issues, ... You forget to mention that turn-key locked-in systems in my experience entail inefficiency costs because the user cannot decide what to do when completely ignoring what is going on. Many problems may be solved in minutes when the user controls the cluster, but may need days or weeks for fixes from the vendor. A balanced presentation should weight all the aspects of running a cluster. > > What I am saying is that Doug is onto something here. It ain't easy. > Doug simply expressed that it isn't. > As for the article being self serving? I dunno, I don't think so. Doug > runs a consultancy called Basement Supercomputing that provides services > for such folks. I didn't see overt advertisements, or even, really, > covert "hire us" messages. I think this was fine as a white paper, and > Doug did note that it started life as one. You may have noticed that this article was originally written on demand of SiCortex... Dan From Craig.Tierney at noaa.gov Tue Aug 11 10:40:03 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Tue, 11 Aug 2009 11:40:03 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: <4A81ACF3.60802@noaa.gov> Rahul Nabar wrote: > On Mon, Aug 10, 2009 at 12:48 PM, Bruno Coutinho wrote: >> This is often caused by cache competition or memory bandwidth saturation. >> If it was cache competition, rising from 4 to 6 threads would make it worse. >> As the code became faster with DDR3-1600 and much slower with Xeon 5400, >> this code is memory bandwidth bound. >> Tweaking CPU affinity to avoid thread jumping among cores of the will not >> help much, as the big bottleneck is memory bandwidth. >> To this code, CPU affinity will only help in NUMA machines to maintain >> memory access in local memory. >> >> >> If the machine has enough bandwidth to feed the cores, it will scale. > > Exactly! But I thought this was the big advance with the Nehalem that > it has removed the CPU<->Cache<->RAM bottleneck. So if the code scaled > with the AMD Barcelona then it would continue to scale with the > Nehalem right? > > I'm posting a copy of my scaling plot here if it helps. > > http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg > > To remove most possible confounding factors this particular Nehlem > plot is produced with the following settings: > > Hyperthreading OFF > 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration > X5550 > > Even if we explained away the bizzare performance of the 4 node case > to the Turbo effect what is most confusing is how the 8 core data > point could be so much slower than the corresponding 8 core point on a > old AMD Barcelona. > > Something's wrong here that I just do not understand. BTW, any other > VASP users here? Anybody have any Nehalem experience? > Rahul, What are you doing to ensure that you have both memory and processor affinity enabled? Craig > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Craig Tierney (craig.tierney at noaa.gov) From landman at scalableinformatics.com Tue Aug 11 11:01:37 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 11 Aug 2009 14:01:37 -0400 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A81AA36.8050503@unige.ch> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> <4A81AA36.8050503@unige.ch> Message-ID: <4A81B201.9000206@scalableinformatics.com> Daniel Pfenniger wrote: >> There is a cost to *EVERYTHING* > > Well, not really surprising. The point is to be quantitative, > not subjective (fear, etc.). Each solution has a cost and alert > people will choose the best one for them, not for the vendor. Sadly, not always (choosing the best one for them). *Many* times the solution is dictated to them via some group with an agreement with some vendor. Decisions about which one are best are often seconded behind which brand to select. I've had too many conversations that went "we agree your solution is better but we can't buy it because you aren't brand X". Which is not a good reason for selection or omission of a vendor. > If many people choose IKEA furniture over traditional vendors > it is because the cost differential is favourable for them, > even taking all the overheads into account. Agreed. But furniture is not a computer (though I guess it could be ...) > When commodity clusters came in the 90's the gain was easily a > factor 10 at purchase. In my case the maintenance and licenses > costs of turn-key locked-in hardware added 20-25% of purchase > cost every year. With such a high cost we could have hired an > engineer full-time instead, but it was not possible because of > the locked-in nature of such machines. The self-made solution > was clearly the best. For some users, this is the right route. For Guy Coates and his team, for you, and a number of others. I agree it can be good. But there are far too many people that think a cluster is a pile-o-PCs + a cheap switch + a cluster distro. Its the "how do I make it work when it fails" aspect we tend to see people worrying online about. I am arguing for commodity systems. But some gear is just plain junk. Not all switches are created equal. Some inexpensive switches do a far better job than some of the expensive ones. Some brand name machines are wholly inappropriate as compute nodes, yet they are used. A big part of this process is making reasonable selections. Understanding the issues with all of these, understanding the interplay. I am not arguing for vendor locking (believe it or not). I simply argue for sane choices. > Today one finds intermediate solutions where the hardware is > composed of compatible elements, and the software is open source. > Some vendors offer almost ready to run and tested hardware for > a reasonable margin, adding less than a factor 2 to the original > hardware cost, without horrendous maintenance fee and restrictive > license. The locked-in effect is low, yet not completely zero. > This is probably the best solution for many budget-conscious > users. Yes. This is what we stress. We unfortunately have run into purchasing groups that like to try to save a buck, and will buy almost-but-not-quite-the-same-thing for the clusters we have put together, which makes it very hard to pre-build, and pre-test. Worse, when we see what they have purchased, and see that it really didn't come close to the spec we used, well .... I fail to see how being required to purchase the right thing after purchasing the wrong thing that you can't return, saves you money. We have had this happen too many times. >> Heinlein called it TANSTAAFL. Every single decision you make carries >> with it a set of costs. >> >> What purchasing agents, looking at the absolute rock bottom prices do >> not seem to grasp, is that those costs can *easily* swamp any >> purported gains from a lower price, and raise the actual landed price, >> due to expending valuable resource time (Gerry et al) for months on >> end working to solve problems that *should* have been solved previously. >> >> There is a cost to going cheap. This cost is time, and loss of >> productivity. If your time (your students time) is free, and you >> don't need to pay for consequences (loss of grants, loss of revenue, >> loss of productivity, ...) in delayed delivery of results from >> computing or storage systems, then, by all means, roll these things >> yourself, and deal with the myriad of debugging issues in making the >> complex beasts actually work. You have hardware stack issues, >> software stack issues, interaction issues, ... > > You forget to mention that turn-key locked-in systems in my experience > entail > inefficiency costs because the user cannot decide what to do when I can't mention your experience as I don't have a clue as to what you have experienced. Vendor lock in is IMO not a great thing. It increases costs, makes systems more expensive to support, reduces choices later on. Yet, we run head first into vendor lock-in in many purchasing departments. They prefer buying from one vendor with whom they have struck agreements. Which don't work to their benefit, but do for the vendors. > completely ignoring what is going on. Many problems may be solved in > minutes when the user controls the cluster, but may need days > or weeks for fixes from the vendor. A balanced presentation should > weight all the aspects of running a cluster. Yes. Doug's presentation did show you one aspect, and if you want more to "balance" the joy of clustered systems, certainly, his work can be expanded and amplified upon. > >> >> What I am saying is that Doug is onto something here. It ain't easy. >> Doug simply expressed that it isn't. > >> As for the article being self serving? I dunno, I don't think so. >> Doug runs a consultancy called Basement Supercomputing that provides >> services for such folks. I didn't see overt advertisements, or even, >> really, covert "hire us" messages. I think this was fine as a white >> paper, and Doug did note that it started life as one. > > You may have noticed that this article was originally written on demand > of SiCortex... That wasn't lost on me :(. Actually one of the things we are actively talking about relative to our high performance storage is "Freedom from bricking". If a theoretical bus hits the company a day after you get your boxes from us, our units are still supportable, and you can pay another organization to support them. We aren't aware of other vendors doing what we are doing that could (honestly) make such a claim. Even the ones that use the (marketing) label of "Open storage solutions". Yeah. Open. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From kus at free.net Tue Aug 11 11:12:32 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Tue, 11 Aug 2009 22:12:32 +0400 Subject: [Beowulf] numactl & SuSE11.1 In-Reply-To: Message-ID: It's interesting, that for this hard&software configuration disabling of NUMA in BIOS gives more high STREAM results in comparison w/"NUMA enabled". I.e. for NUMA "off": 8723/8232/10388/10317 MB/s for NUMA "on": 5620/5217/6795/6767 MB/s (both for OMP_NUM_THREADS=1 and ifort 11.1 compiler). The situation for Opteron's is opposite: NUMA mode gives more high throughput. In message from "Mikhail Kuzminsky" (Mon, 10 Aug 2009 21:43:56 +0400): >I'm sorry for my mistake: >the problem is on Nehalem Xeon under SuSE -11.1, but w/kernel >2.6.27.7-9 (w/Supermicro X8DT mobo). For Opteron 2350 w/SuSE 10.3 (w/ >more old 2.6.22.5-31 -I erroneously inserted this string in my >previous message) numactl works OK (w/Tyan mobo). > >NUMA is enabled in BIOS. Of course, CONFIG_NUMA (and CONFIG_NUMA_EMU) >are setted to "y" in both kernels. > >Unfortunately I (i.e. root) can't change files in >/sys/devices/system/node (or rename directory node2 to node1) :-( - as >it's possible w/some files in /proc filesystem. It's interesting, that >extraction from dmesg show, that IT WAS NODE1, but then node2 is >appear ! > >ACPI: SRAT BF79A4B0, 0150 (r1 041409 OEMSRAT 1 INTL 1) >ACPI: SSDT BF79FAC0, 249F (r1 DpgPmm CpuPm 12 INTL 20051117) >ACPI: Local APIC address 0xfee00000 >SRAT: PXM 0 -> APIC 0 -> Node 0 >SRAT: PXM 0 -> APIC 2 -> Node 0 >SRAT: PXM 0 -> APIC 4 -> Node 0 >SRAT: PXM 0 -> APIC 6 -> Node 0 >SRAT: PXM 1 -> APIC 16 -> Node 1 >SRAT: PXM 1 -> APIC 18 -> Node 1 >SRAT: PXM 1 -> APIC 20 -> Node 1 >SRAT: PXM 1 -> APIC 22 -> Node 1 >SRAT: Node 0 PXM 0 0-a0000 >SRAT: Node 0 PXM 0 100000-c0000000 >SRAT: Node 0 PXM 0 100000000-1c0000000 >SRAT: Node 2 PXM 257 1c0000000-340000000 >(here !!) > >NUMA: Allocated memnodemap from 1c000 - 22880 >NUMA: Using 20 for the hash shift. >Bootmem setup node 0 0000000000000000-00000001c0000000 > NODE_DATA [0000000000022880 - 000000000003a87f] > bootmap [000000000003b000 - 0000000000072fff] pages 38 >(8 early reservations) ==> bootmem [0000000000 - 01c0000000] > #0 [0000000000 - 0000001000] BIOS data page ==> [0000000000 - >0000001000] > #1 [0000006000 - 0000008000] TRAMPOLINE ==> [0000006000 - >0000008000] > #2 [0000200000 - 0000bf27b8] TEXT DATA BSS ==> [0000200000 - >0000bf27b8] > #3 [0037a3b000 - 0037fef104] RAMDISK ==> [0037a3b000 - >0037fef104] > #4 [000009cc00 - 0000100000] BIOS reserved ==> [000009cc00 - >0000100000] > #5 [0000010000 - 0000013000] PGTABLE ==> [0000010000 - >0000013000] > #6 [0000013000 - 000001c000] PGTABLE ==> [0000013000 - >000001c000] > #7 [000001c000 - 0000022880] MEMNODEMAP ==> [000001c000 - >0000022880] >Bootmem setup node 2 00000001c0000000-0000000340000000 > NODE_DATA [00000001c0000000 - 00000001c0017fff] > bootmap [00000001c0018000 - 00000001c0047fff] pages 30 >(8 early reservations) ==> bootmem [01c0000000 - 0340000000] > #0 [0000000000 - 0000001000] BIOS data page > #1 [0000006000 - 0000008000] TRAMPOLINE > #2 [0000200000 - 0000bf27b8] TEXT DATA BSS > #3 [0037a3b000 - 0037fef104] RAMDISK > #4 [000009cc00 - 0000100000] BIOS reserved > #5 [0000010000 - 0000013000] PGTABLE > #6 [0000013000 - 000001c000] PGTABLE > #7 [000001c000 - 0000022880] MEMNODEMAP >found SMP MP-table at [ffff8800000ff780] 000ff780 > [ffffe20000000000-ffffe20006ffffff] PMD -> >[ffff880028200000-ffff88002e1fffff] on node 0 > [ffffe20007000000-ffffe2000cffffff] PMD -> >[ffff8801c0200000-ffff8801c61fffff] on node 2 Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From mathog at caltech.edu Tue Aug 11 11:46:05 2009 From: mathog at caltech.edu (David Mathog) Date: Tue, 11 Aug 2009 11:46:05 -0700 Subject: [Beowulf] The True Cost of HPC Cluster Ownership Message-ID: Joe Landman wrote: > I am arguing for commodity systems. But some gear is just plain junk. > Not all switches are created equal. Some inexpensive switches do a far > better job than some of the expensive ones. Some brand name machines > are wholly inappropriate as compute nodes, yet they are used. > > A big part of this process is making reasonable selections. > Understanding the issues with all of these, understanding the interplay. A lot of this issue boils down to a lack of available information, or perhaps, the cost of obtaining the information. Consider for instance the switches you cited above. How is the average site going to decide which switch is better before purchase? On paper, going by the published specs, they will often look identical. The two companies may be equally reputable. Still, one device may be a piece of junk and the other best in class. On very rare occasions there will be an independent review available. Only a large site is likely to have the resources to obtain samples of each switch and test them extensively. The best most of us can do is ask around if "switch XYZ is OK" before making the leap. With compute nodes performance information is more readily available, often in reviews, but again, rarely any reliability information. And we have all seen models which crunch nicely but have innate reliability problems that don't turn up in a 3 day review, and then bite hard during continuous use. Again, large sites can obtain a test unit and beat on it for a few months, but small sites usually cannot. At least in this case knowledge does build up over time in the community, so if one waits for a machine to be in the field for a year, it may be possible to ask around and find out if it is a good idea to buy some. (But don't wait too long, the sales life for computer models is not very long!) For this reason, unless a site is very well funded, buying cutting edge compute nodes is a rather large gamble. If the resources to run these tests isn't present in house, one may essentially buy the expertise by paying enough to a reputable company to run the tests. Either way, knowing costs money. Ideally there would be accepted standards for testing performance and reliability of each class of equipment, and the manufacturers would run these tests themselves, or farm it out to neutral entities, and then publish this information. It would certainly be a compelling sales tool, at least from my perspective. In practice, it usually seems like the manufacturers spend more time hiding equipment defects than they do in proving and publishing its strengths. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rpnabar at gmail.com Tue Aug 11 11:57:14 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 13:57:14 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A81ACF3.60802@noaa.gov> References: <4A81ACF3.60802@noaa.gov> Message-ID: On Tue, Aug 11, 2009 at 12:40 PM, Craig Tierney wrote: > What are you doing to ensure that you have both memory and processor > affinity enabled? > All I was using now was the flag: --mca mpi_paffinity_alone 1 Is there anything else I ought to be doing as well? -- Rahul From rpnabar at gmail.com Tue Aug 11 12:04:34 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 14:04:34 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A81A518.2030805@cse.ucdavis.edu> References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley wrote: > Looks to me like you fit in the barcelona 512KB L2 cache (and get good > scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling). Thanks Bill! I never realized that the L2 cache of the Nehalem is actually smaller than that of the Barcelona! I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe. THe size of the L2 cache is fixed across the steppings of the Nehlem isn't it? > Were the binaries compiled specifically to target both architectures? ?As a > first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's > compiler for intel. ?But portland group does a good job at both in most cases. We used the intel compilers. One of my fellow grad students did the actual compilation for VASP but I believe he used the "correct" [sic] flags to the best of our knowledge. I could post them on the list perhaps. There was no cross-compilation. We compiled a fresh binary for the Nehalem. > I"m curious about the hyperthreading on data point as well. Didn't test for VASP yet but for our other two DFT codes i.e. DACAPO and GPAW hyperthreading "off" seems to be about 10% faster. > A doubling of the can have that effect. ?The Intel L3 can no come anywhere > close to feeding 4 cores running flat out. Could you explain this more? I am a little lost with the processor dynamics. Does this mean using a quad core for HPC on the Nehlem is not likely to work well for scaling? Or do you imply a solution so that I could fix this somehow? Thanks again! -- Rahul From Craig.Tierney at noaa.gov Tue Aug 11 13:03:51 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Tue, 11 Aug 2009 14:03:51 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81ACF3.60802@noaa.gov> Message-ID: <4A81CEA7.3030802@noaa.gov> Rahul Nabar wrote: > On Tue, Aug 11, 2009 at 12:40 PM, Craig Tierney wrote: >> What are you doing to ensure that you have both memory and processor >> affinity enabled? >> > > All I was using now was the flag: > > --mca mpi_paffinity_alone 1 > > Is there anything else I ought to be doing as well? > That should be adequate. Craig -- Craig Tierney (craig.tierney at noaa.gov) From rpnabar at gmail.com Tue Aug 11 15:24:04 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 17:24:04 -0500 Subject: [Beowulf] performance tweaks and optimum memory configs for a Nehalem In-Reply-To: References: Message-ID: On Tue, Aug 11, 2009 at 12:19 PM, Mikhail Kuzminsky wrote: > If this results are for HyperThreading "ON", it may be not too strange > because of "virtual cores" competition. > > But if this results are for switched off Hyperthreading - it's strange. > I have usual good DFT scaling w/number of cores on G03 - about in 7 times > for 8 cores. Yes, it is very strange and I still cannot explain it very well. Do you have scaling info for VASP? You did mention DFT codes so I was wondering. I still haven't found much info for VASP on Nehlems. Maybe it is some feature of this particular code. All the other tests make sense. And I just find it hard to believe that the Nehalem which has been so far touted to be such a good proc. can be outperformed by by one year old AMD Barcelonas.......I feel it's something I am doing wrong. -- Rahul From rpnabar at gmail.com Tue Aug 11 15:31:08 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 17:31:08 -0500 Subject: [Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year? In-Reply-To: References: <20090406080836.GC30865@bx9.net> Message-ID: On Thu, Apr 9, 2009 at 11:35 AM, Douglas J. Trainor wrote: > Rahul, > > I think Greg et al. are correct. ?Does your SC1435 have a Delta Electronics > switching power supply? ?I bet you have a 600 watt Delta. > > Intel recently had problems with outsourced 350 watt "FHJ350WPS" switching > power supplies that apparently affected 5% of some server lines. ?These were > loading imbalance problems between the 3.3 volt and 12 volt lines. ?The > affected power supplies had a minimum loading requirement that was not met. > ?The over-voltage protection circuit would kick in on the 3.3V line. > ?However, in these cases, the Intel machines would not reboot. ?Intel is > modifying the 3.3 volt minimum loading from 1.2 amps to 0.2 amps to fix the > problem. A while ago I had posted about these crashing SC1435's that I had. I received lots of good suggestions on this group. Thanks all! A lot of persistence with the vendor succeed in making their Engineering team do long-run tests on one of our captured machines. It needed to be tested for over one month and then they finally replicated the failure. Whew! (In the past they had aborted tests way before this time period) They won't give me many internal details but apparantly it is caused by an "hardware issue more likely caused certain motherboards with Opterons" [sic] So, thank again and it does seem that we finally got down to the cause of this irritating problem! Just posted this in case it helps any other SC1435 admins in a similar boat! Cheers! -- Rahul From coutinho at dcc.ufmg.br Tue Aug 11 15:57:11 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Tue, 11 Aug 2009 19:57:11 -0300 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: 2009/8/11 Rahul Nabar > On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley > wrote: > > Looks to me like you fit in the barcelona 512KB L2 cache (and get good > > scaling) and do not fit in the nehalem 256KB L2 cache (and get poor > scaling). > > Thanks Bill! I never realized that the L2 cache of the Nehalem is > actually smaller than that of the Barcelona! > > I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe. > THe size of the L2 cache is fixed across the steppings of the Nehlem > isn't it? I think that probably it only will be fixed on newer models or only in Westmere (Nehalem shrink to 32nm). > > > > Were the binaries compiled specifically to target both architectures? As > a > > first guess I suggest trying pathscale (RIP) or open64 for amd, and > intel's > > compiler for intel. But portland group does a good job at both in most > cases. > > We used the intel compilers. One of my fellow grad students did the > actual compilation for VASP but I believe he used the "correct" [sic] > flags to the best of our knowledge. I could post them on the list > perhaps. There was no cross-compilation. We compiled a fresh binary > for the Nehalem. > > > I"m curious about the hyperthreading on data point as well. > > Didn't test for VASP yet but for our other two DFT codes i.e. DACAPO > and GPAW hyperthreading "off" seems to be about 10% faster. > > > > A doubling of the can have that effect. The Intel L3 can no come > anywhere > > close to feeding 4 cores running flat out. > > Could you explain this more? I am a little lost with the processor > dynamics. Does this mean using a quad core for HPC on the Nehlem is > not likely to work well for scaling? Or do you imply a solution so > that I could fix this somehow? > Nehalem and Barcelona have the following cache architecture: L1 cache: 64KB (32kb data, 32kb instruction), per core L2 cache: Barcelona :512kb, Nehalem: 256kb, per core L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores. Both in Barcelona and Nehalem, the "uncore" (everything outside a core, like L3 and memory controllers) runs at lower speed than the cores and all cores communicate through L3, so it must handle some coherence signals too. This makes impossible to L3 feed all cores at full speed if L2 caches have big miss ratios. So, what is happening with your program is something like: Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate, but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate. So in Nehelem the shared L3 cache has to handle much more requests from all cores than Barcelona, becoming a big bottleneck. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Tue Aug 11 16:07:40 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 18:07:40 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: On Tue, Aug 11, 2009 at 5:57 PM, Bruno Coutinho wrote: > Nehalem and Barcelona have the following cache architecture: > > L1 cache: 64KB (32kb data, 32kb instruction), per core > L2 cache: Barcelona :512kb, Nehalem: 256kb, per core > L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores. > > > Both in Barcelona and Nehalem, the "uncore" (everything outside a core, like > L3 and memory controllers) runs at lower speed than the cores and all cores > communicate through L3, so it must handle some coherence signals too. > This makes impossible to L3 feed all cores at full speed if L2 caches have > big miss ratios. > > So, what is happening with your program is something like: > > Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate, > but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate. > So in Nehelem the shared L3 cache has to handle much more requests from all > cores than Barcelona, becoming a big bottleneck. Thanks Bruno! That makes a lot of sense now. Assuming that is what is happening is there any way of still using the Nehalems fruitfully for this code? Any smart tricks / hacks? The reason is that the Nehalems seem to scale and perform beautifully for my other codes. The only other option is to relapse back to the AMDs. I believe the Shanghai would be a choice or an Instanbul. I assume the cache structure there is as good as the Barcelona if not better! Any experiences with these chips on the group? Funnily, I haven't heard of any such Nehalem (-ive) stories anywhere else. Am I the first one to hit this cache bottleneck? I doubt it. Any other cache heavy users? -- Rahul From coutinho at dcc.ufmg.br Tue Aug 11 16:27:55 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Tue, 11 Aug 2009 20:27:55 -0300 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: 2009/8/11 Rahul Nabar > On Tue, Aug 11, 2009 at 5:57 PM, Bruno Coutinho > wrote: > > Nehalem and Barcelona have the following cache architecture: > > > > L1 cache: 64KB (32kb data, 32kb instruction), per core > > L2 cache: Barcelona :512kb, Nehalem: 256kb, per core > > L3 cache: Barcelona: 2MB, Nehalem: 8MB , shared among all cores. > > > > > > Both in Barcelona and Nehalem, the "uncore" (everything outside a core, > like > > L3 and memory controllers) runs at lower speed than the cores and all > cores > > communicate through L3, so it must handle some coherence signals too. > > This makes impossible to L3 feed all cores at full speed if L2 caches > have > > big miss ratios. > > > > So, what is happening with your program is something like: > > > > Working set fits Barcelona 512kb L2 cache, so it has 10% miss rate, > > but is doesn't fits Nehalem 256km L2 cache, so it has 50% miss rate. > > So in Nehelem the shared L3 cache has to handle much more requests from > all > > cores than Barcelona, becoming a big bottleneck. > > Thanks Bruno! That makes a lot of sense now. Assuming that is what is > happening is there any way of still using the Nehalems fruitfully for > this code? Any smart tricks / hacks? You can use profilers that monitor hardware performance counters like oprofile or papi to measure miss ratios and verify if that is what is happening. But solving it is a much larger problem. :) > > > The reason is that the Nehalems seem to scale and perform beautifully > for my other codes. > > The only other option is to relapse back to the AMDs. I believe the > Shanghai would be a choice or an Instanbul. I assume the cache > structure there is as good as the Barcelona if not better! Any > experiences with these chips on the group? > > Funnily, I haven't heard of any such Nehalem (-ive) stories anywhere > else. Am I the first one to hit this cache bottleneck? I doubt it. Any > other cache heavy users? > > -- > Rahul > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgb at phy.duke.edu Tue Aug 11 18:50:26 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 11 Aug 2009 21:50:26 -0400 (EDT) Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A819971.9060905@scalableinformatics.com> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: On Tue, 11 Aug 2009, Joe Landman wrote: > There is a cost to going cheap. This cost is time, and loss of productivity. > If your time (your students time) is free, and you don't need to pay for > consequences (loss of grants, loss of revenue, loss of productivity, ...) in > delayed delivery of results from computing or storage systems, then, by all > means, roll these things yourself, and deal with the myriad of debugging > issues in making the complex beasts actually work. You have hardware stack > issues, software stack issues, interaction issues, ... Oh, damn, might as well demonstrate that I'm not dead yet. I'm getting better. Actually, I'm just getting back from Beaufort and so fish are not calling me and neither is the mountain of unpacking, so I might as well chip in. My own experiences in this regard are that one can span a veritable spectrum of outcomes from great and very cost efficient to horrible and expensive in money and time. Larger projects the odds of the latter go up as what is a small inefficiency in 16-32 systems becomes and enormous and painful one for 1024. I'll skip the actual anecdotes -- most of them are probably in the archives anyway -- and just go straight to the (IMO) conclusions. The price of your systems should scale with the number you buy. Building an 8 node starter cluster? Tiger.com $350 specials are fine. Building a professional/production cluster with 32 or more nodes? Go rackmount (already a modest premium) and start kicking in for service contracts and a few extras to help keep things smooth. Building 32 nodes with an eye on expandibility? Go tier 1 or tier 2 vendor (or a professional and experienced cluster consultant, such as Joe), with four year service, after asking on list to see if the vendor is competent and uses or builds good hardware. IBM nodes are great. Penguin nodes (in my own experience) are great. Dell nodes are "ok", sort of the low end of high end. I don't have much experience with HP in a cluster setting. And do not, not, not, get no-name nodes from a cheap online vendor unless you value pain. This is advice that works for anything on the DIY side -- even a 1024 node cluster can be built by just you (or you and a friend/flunky) as long as you allow a realistic amount of time to install it and debug it -- say 10-15 minutes a node in production with electric screwdrivers to rack them from delivery box to rack (after a bit of practice, and you'll get LOTS of practice:-) Call it a day per rack, so yeah, 3-4 human-weeks of FTE. PLUS at least 10 minutes of install/debugging time, on average, per node. These aren't fixed numbers -- I'm sure there are humans who can rack/derack a node in five minutes, and if you get your vendor to premount the rails then ANYBODY can rack a node in two or three minutes in production. Then there are people like me, who might edge over closer to twenty minutes, or circumstances like "oops, this back of rack screws is the wrong size, time for crazed phone calls to and overnights from the vendor" that can ruin your expected average fast. (Software) install time depends on your general competence in linux, clustering, and how much energy you expended ahead of time setting up servers to accomodate the cluster install. If you are a linux god, a cluster god, and have a thoroughly debugged e.g. kickstart server (and got the vendor to default the BIOS to "DHCP boot" on fallthrough from an naked hard drive) then you might knock the install time down to making a table entry and turning on the systems -- and debugging the ones that failed to boot and install, or (in the case of diskless systems) boot and operate. A less gifted and experienced sysadmin might have to hook up a console to each system and hand install it, but nowadays even doing this isn't very time consuming as many installs can proceed in parallel. For non-DIY clusters -- turnkey, or contract built by somebody else -- the same general principles apply. If the cluster is fairly small, you aren't horribly at risk if you get relatively inexpensive nodes, bearing in mind that you're still trading off money SOMEWHERE later to save money now. If you are getting a medium large cluster or if downtime is very expensive, don't skimp on nodes -- get nodes that have a solid vendor standing behind them, with guaranteed onsite service for 3-4 years (the expected service life of your cluster). Here, in addition, you need to be damn sure you get your turnkey cluster from somebody who is not an idiot, who knows what they are doing and can actually deliver a functional cluster no less efficiently than described above and who will stand with you through the inevitable problems that will surface installing a larger cluster. In a nutshell, the "cost of going cheap" isn't linear, with or without student/cheap labor. For small clusters installed by somebody who knows what they are doing and e.g. operated and used by the owner or the owner's lab including students, operated by departmental sysadmins with cluster experience and enough warm bodies to have some opportunity cost labor handy -- sure, go cheap -- if a node or two is DOA or fails, so what? It takes you an extra day or two to get the cluster going, but most of that time is waiting for parts -- OC time is much smaller, and everybody has other things to do while waiting. But as clusters get larger, the marginal cost of the differential failure rate between cheap and expensive scales up badly and can easily exceed the OC labor pool's capacity, especially if by bad luck you get a cheap node and it turns out to be a "lemon" and the faraway dot com that sold it to you refuses to fix or replace it. The turnover from cheap to much more expensive than just getting good nodes from a reputable vendor (which don't usually cost THAT much more than cheap) can happen real fast, and the time wasted can go from a few days to months equally fast. So be aware of this. It is easy to find people on this list with horror stories associated with building larger clusters with cheap nodes. With smaller clusters it isn't horrible -- it is annoying. You can often afford to throw e.g. a bad motherboard away and just buy another one and reinstall a better one in the nodes one at a time for eight or a dozen nodes. You can't do this, sanely, for 64, or 128, or 1024. One last thing to be aware of is the politics of grants. Few people out there buy nodes out of pocket. They pay for clusters using OPM (Other People's Money). In many cases it is MUCH EASIER to budget expensive nodes, with onsite service and various guarantees, up front in the initial purchase in a grant than it is to budget less money on more cheaper nodes and then budget ENOUGH money to be able to handle any possible failure contingency in the next three years of the grant cycle. Granting agencies actually might even prefer it this way (they should if they have any sense). Imagine their excitement if halfway through the computation they are funding your entire cluster blows a cheap capacitor made in Taiwan (same one on every motherboard) and your cheap vendor vanishes like dust in the wind, bankrupted like everybody else who sold motherboards with that cap on them. Now they have a choice -- buy you a NEW cluster so you can finish, or write off the whole project (and quite possibly write off you as well, for ever and ever). Conservatism may cost you a few nodes that you could have added if you went cheap, but it is INSURANCE that the nodes you get will be around to reliably complete the computation. rgb > > What I am saying is that Doug is onto something here. It ain't easy. Doug > simply expressed that it isn't. > > As for the article being self serving? I dunno, I don't think so. Doug runs > a consultancy called Basement Supercomputing that provides services for such > folks. I didn't see overt advertisements, or even, really, covert "hire us" > messages. I think this was fine as a white paper, and Doug did note that it > started life as one. > > My $0.02 > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics Inc. > email: landman at scalableinformatics.com > web : http://scalableinformatics.com > http://scalableinformatics.com/jackrabbit > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From rpnabar at gmail.com Tue Aug 11 20:30:39 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 11 Aug 2009 22:30:39 -0500 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A819971.9060905@scalableinformatics.com> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: On Tue, Aug 11, 2009 at 11:16 AM, Joe Landman wrote: > > There is a cost to going cheap. ?This cost is time, and loss of > productivity. ?If your time (your students time) is free, and you don't need > to pay for consequences (loss of grants, loss of revenue, loss of > productivity, ...) in delayed delivery of results from computing or (1) Why always consider it a "loss" of your student's time? I was one such "student" think there is enormous learning potential here. Of course, my systems never did match the uptime / performance of a "turnkey" solution but the skills learnt in setting one up are rarely gained otherwise. At a university research is one goal; but learning is definitely another. (2) A key problem that I don't know how to work around for turnkey solutions: "How do I pharase the contract and performance gurrantee so that I get the vendor to do all the things that I want?" Many of us run codes that are not very high volume nor very standardized. Everybody wants to tweak and do something new. Especially in research. In a such a scenario I don't want the vendor to "just give me boxes with an OS" but also get my code installed, compiled, running and optimized. Plus schedulers and some such. Not just install them but set up fairshares that reflect user situations. Most "turnkey" options seem to do just fine for the early parts (as far as I can see) but those are the easy steps. The problem is that , by their very nature, the later steps are harder to "define". And when a problem lacks definition "turnkey" solutions are hard to spec. What good "in house" sys admin manpower does is handle the hairy issues of the latter steps. And once you invest in developing (or training or hiring) good quality sys-admins then taking intelligent decisions about selection, installation, and commissioning are easy enough for the well-trained guys anyways! And if you don't invest in good-quality internal computer guys then the vendors are gonna take you for a ride all the time! -- Rahul From landman at scalableinformatics.com Tue Aug 11 20:54:36 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 11 Aug 2009 23:54:36 -0400 Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: <4A823CFC.5050605@scalableinformatics.com> Rahul Nabar wrote: > On Tue, Aug 11, 2009 at 11:16 AM, Joe > Landman wrote: >> There is a cost to going cheap. This cost is time, and loss of >> productivity. If your time (your students time) is free, and you don't need >> to pay for consequences (loss of grants, loss of revenue, loss of >> productivity, ...) in delayed delivery of results from computing or > > > (1) Why always consider it a "loss" of your student's time? I was one Time is a zero sum game irrespective of how much coffee you consume. If you wind up spending large fractions of your time on computing, you spend less time on research. Students as cheap/free labor means they aren't getting their research work done (unless their research is on how to build/maintain the cluster). > such "student" think there is enormous learning potential here. Of Yes, there is much to learn. Even some meta-learning, such as when not to spend time on things. > course, my systems never did match the uptime / performance of a > "turnkey" solution but the skills learnt in setting one up are rarely > gained otherwise. At a university research is one goal; but learning > is definitely another. Hmm.... usually the process of research and the process of learning went hand in hand. I agree that people *should* get a grounding in all aspects of their research, and should get their hands dirty to a degree. But you shouldn't have them spend the time they should be doing research in focusing exclusively on managing resources (unless you are trying to teach them how to do time and resource management, which is a very important skill for scientists). > (2) A key problem that I don't know how to work around for turnkey > solutions: "How do I pharase the contract and performance gurrantee so > that I get the vendor to do all the things that I want?" It starts out with you defining the goals you wish to achieve, and then working through the path to achieve them. Decide which portion you wish to do, and find a partner to help you do what you don't want them to do. A good vendor *will* partner with you to solve real problems. > Many of us run codes that are not very high volume nor very > standardized. Everybody wants to tweak and do something new. > Especially in research. In a such a scenario I don't want the vendor > to "just give me boxes with an OS" but also get my code installed, We like to get our hands on the code early, with test cases, so we understand how it will behave on the hardware before our customer gets it ... precisely so we can help answer questions, and solve problems. Most vendors just want to deliver the boxes. > compiled, running and optimized. Plus schedulers and some such. Not > just install them but set up fairshares that reflect user situations. :) I sometimes joke that you know you have your scheduler set up right when everyone hates you. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From deadline at eadline.org Wed Aug 12 07:29:43 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed, 12 Aug 2009 10:29:43 -0400 (EDT) Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: <4A81825B.4000303@unige.ch> References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> Message-ID: <33802.192.168.1.213.1250087383.squirrel@mail.eadline.org> > Douglas Eadline wrote: >> All, >> >> I posted this on ClusterMonkey the other week. >> It is actually derived from a white paper I wrote for >> SiCortex. I'm sure those on this list have some >> experience/opinions with these issues (and other >> cluster issues!) >> >> The True Cost of HPC Cluster Ownership >> >> http://www.clustermonkey.net//content/view/262/1/ >> > > This article sounds unbalanced and self-serving. > > While it I clear that self-made clusters imply added new costs > in regard of turn-key clusters, they also empower the buyer > using standard and open solutions by an increased independence from > the vendor, and increases also its knowledge for future choices. > This aspect is hard to measure in monetary terms, but certainly very > important for some users. > > I have experienced all kinds of clusters (turn-key, mostly > self-assembled, and partly vendor assembled and tested), and my conclusion > is that the best is when the user has at least the choice to determine > the degree of vendor integration/lock-in. Bad choices occur > because people are badly informed, and the article is so biased > that it doesn't improve objective information on this regard, just > serves as increasing fear and doubt. A few comments, First, the paper was originally commissioned by SiCortex. I removed promotional parts and updated the paper to reflect what I believe are valid points worth considering when procuring and HPC cluster. (i.e. I believe informing people about unknown potential costs is a good thing (tm)) The simple premise is "due to the nature of clusters, there are costs that were once part of the HPC purchase price, that are not included anymore and you may have to absorb the cost" Second, I did not advocate lock-in of any kind. Indeed, the thing I like about clusters and open software is there is lock-in protection. I do suggest that many people have two choices: 1) go at it on your own and understand that it will probably be a learning experience. It will in all likelihood take longer than you thought. 2) get qualified help so your cluster is up an running ASAP I don't favor either case, My intention was to assist those who do not understand the nature of cluster computing with some of the issues we have all faced at one time or another. As far as being objective, I have experienced first hand all the situations I mentioned. As far as case 2, there is some form of service lock-in, but I consider this a good thing. If you can find someone that keeps things working for you and/or you don't have the time or personnel and/or you want to focus on science or engineering, then paying them is not a bad idea. And by the way, this is normally the case in most industrial settings, plus they need CYA strategy as well. Finally, as far as self serving, well I can tell you the phone is not ringing off the hook. It was not my intention to generate business from an article on ClusterMonkey. I wish it were that easy. Finally, because ClusterMonkey.net is an open community site I encourage you contribute to the conversation. -- Doug From bill at cse.ucdavis.edu Wed Aug 12 08:14:09 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 12 Aug 2009 08:14:09 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A81CEA7.3030802@noaa.gov> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> Message-ID: <4A82DC41.2060805@cse.ucdavis.edu> I've been working on a pthread memory benchmark that is loosely modeled on McCalpin's stream. It's been quite a challenge to remove all the noise/lost performance from the benchmark to get close to performance I expected. Some of the obstacles: * For the compilers that tend to be better at stream (open64 and pathscale), you lose the performance if you just replace double a[],b[],c[] with double *a,*b,*c. Patch[1] available. I don't have a work around for this, suggestions welcome. Is it really necessary for dynamic arrays to be substantially slower than static? * You have to be very careful with pointer alignment both with cache lines, and each other * cpu_affinity (by CPU id) * numa (by socket id) The results are relatively smooth graphs, here's an example, it's uselessly busy until you toggle off a few graphs (by clicking on the key): http://cse.ucdavis.edu/bill/pstream.svg The biggest puzzle I have now is what the previous generation intel quads, the current generation AMD quads, and numerous other CPUs show a big benefit in L1, while the nehalem shows no benefit. [1] http://cse.ucdavis.edu/bill/stream-malloc.patch From kus at free.net Wed Aug 12 08:14:25 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Wed, 12 Aug 2009 19:14:25 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A81ACF3.60802@noaa.gov> Message-ID: In message from Craig Tierney (Tue, 11 Aug 2009 11:40:03 -0600): >Rahul Nabar wrote: >> On Mon, Aug 10, 2009 at 12:48 PM, Bruno >>Coutinho wrote: >>> This is often caused by cache competition or memory bandwidth >>>saturation. >>> If it was cache competition, rising from 4 to 6 threads would make >>>it worse. >>> As the code became faster with DDR3-1600 and much slower with Xeon >>>5400, >>> this code is memory bandwidth bound. >>> Tweaking CPU affinity to avoid thread jumping among cores of the >>>will not >>> help much, as the big bottleneck is memory bandwidth. >>> To this code, CPU affinity will only help in NUMA machines to >>>maintain >>> memory access in local memory. >>> >>> >>> If the machine has enough bandwidth to feed the cores, it will >>>scale. >> >> Exactly! But I thought this was the big advance with the Nehalem >>that >> it has removed the CPU<->Cache<->RAM bottleneck. So if the code >>scaled >> with the AMD Barcelona then it would continue to scale with the >> Nehalem right? >> >> I'm posting a copy of my scaling plot here if it helps. >> >> http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg >> >> To remove most possible confounding factors this particular Nehlem >> plot is produced with the following settings: >> >> Hyperthreading OFF >> 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration >> X5550 >> >> Even if we explained away the bizzare performance of the 4 node case >> to the Turbo effect what is most confusing is how the 8 core data >> point could be so much slower than the corresponding 8 core point on >>a >> old AMD Barcelona. >> >> Something's wrong here that I just do not understand. BTW, any other >> VASP users here? Anybody have any Nehalem experience? >> > >Rahul, >What are you doing to ensure that you have both memory and processor >affinity enabled? >Craig As I mentioned here in "numactl&SuSE11.1' thread, on some kernels there is wrong behaviour for Nehalem (bad /sys/devices/system/node directory content). This bug is presented, in particular, in default OpenSuSE 11 kernels (2.6.27.7-9 and 2.6.29-6), and (as it was writted in the corresponding thread discussion) in FC11 2.6.29 kernel. I found that in such situation disabling of NUMA in BIOS gives only increase of STREAM throughput. Therefore I think this (Rahul) problem is not due to BIOS settings. Unfortunately I've no data about VASP itself. It's interesting, do somebody have "normally working" w/Nehalem - in the sense of NUMA - kernels ? AFAIK more old 2.6 kernels (from SuSE 10.3) works OK, but I didn't check. May be error in NUMA support is the reason of Rahul problem ? Mikhail > > >> -- >> Rahul >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>Computing >> To change your subscription (digest mode or unsubscribe) visit >>http://www.beowulf.org/mailman/listinfo/beowulf >> > > >-- >Craig Tierney (craig.tierney at noaa.gov) >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From jlb17 at duke.edu Wed Aug 12 08:43:03 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 12 Aug 2009 11:43:03 -0400 (EDT) Subject: [Beowulf] The True Cost of HPC Cluster Ownership In-Reply-To: References: <35809.192.168.1.213.1249995867.squirrel@mail.eadline.org> <4A81825B.4000303@unige.ch> <4A818DF4.8040807@tamu.edu> <4A819971.9060905@scalableinformatics.com> Message-ID: On Tue, 11 Aug 2009 at 9:50pm, Robert G. Brown wrote > In a nutshell, the "cost of going cheap" isn't linear, with or without > student/cheap labor. For small clusters installed by somebody who knows > what they are doing and e.g. operated and used by the owner or the > owner's lab including students, operated by departmental sysadmins with > cluster experience and enough warm bodies to have some opportunity cost > labor handy -- sure, go cheap -- if a node or two is DOA or fails, so > what? It takes you an extra day or two to get the cluster going, but > most of that time is waiting for parts -- OC time is much smaller, and > everybody has other things to do while waiting. But as clusters get > larger, the marginal cost of the differential failure rate between cheap > and expensive scales up badly and can easily exceed the OC labor pool's > capacity, especially if by bad luck you get a cheap node and it turns > out to be a "lemon" and the faraway dot com that sold it to you refuses > to fix or replace it. The turnover from cheap to much more expensive > than just getting good nodes from a reputable vendor (which don't > usually cost THAT much more than cheap) can happen real fast, and the > time wasted can go from a few days to months equally fast. One thing I haven't seen addressed is to look at the proposed usage of the cluster. If most of the code to be run on the cluster is embarrassingly parallel, then the cost of a node going down or the network being less than optimal is fairly low. In this case, IMO, it's pretty easy to make the argument to go the DIY route (depending on size and available labor pool, of course, as others have mentioned). If, OTOH, you intend to run tightly coupled MPI code across the entire cluster, then it becomes very valuable to ensure that everything is working together just so. There a turn-key vendor (and/or highly skilled third party) can make more sense. In other words, the answer, as always, is "It depends." -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From Craig.Tierney at noaa.gov Wed Aug 12 09:32:08 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 10:32:08 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: <4A82EE88.20908@noaa.gov> Mikhail Kuzminsky wrote: > In message from Craig Tierney (Tue, 11 Aug 2009 > 11:40:03 -0600): >> Rahul Nabar wrote: >>> On Mon, Aug 10, 2009 at 12:48 PM, Bruno >>> Coutinho wrote: >>>> This is often caused by cache competition or memory bandwidth >>>> saturation. >>>> If it was cache competition, rising from 4 to 6 threads would make >>>> it worse. >>>> As the code became faster with DDR3-1600 and much slower with Xeon >>>> 5400, >>>> this code is memory bandwidth bound. >>>> Tweaking CPU affinity to avoid thread jumping among cores of the >>>> will not >>>> help much, as the big bottleneck is memory bandwidth. >>>> To this code, CPU affinity will only help in NUMA machines to maintain >>>> memory access in local memory. >>>> >>>> >>>> If the machine has enough bandwidth to feed the cores, it will scale. >>> >>> Exactly! But I thought this was the big advance with the Nehalem that >>> it has removed the CPU<->Cache<->RAM bottleneck. So if the code scaled >>> with the AMD Barcelona then it would continue to scale with the >>> Nehalem right? >>> >>> I'm posting a copy of my scaling plot here if it helps. >>> >>> http://dl.getdropbox.com/u/118481/nehalem_scaling.jpg >>> >>> To remove most possible confounding factors this particular Nehlem >>> plot is produced with the following settings: >>> >>> Hyperthreading OFF >>> 24GB memory i.e. 6 banks of 4GB. i.e. optimum memory configuration >>> X5550 >>> >>> Even if we explained away the bizzare performance of the 4 node case >>> to the Turbo effect what is most confusing is how the 8 core data >>> point could be so much slower than the corresponding 8 core point on a >>> old AMD Barcelona. >>> >>> Something's wrong here that I just do not understand. BTW, any other >>> VASP users here? Anybody have any Nehalem experience? >>> >> >> Rahul, >> What are you doing to ensure that you have both memory and processor >> affinity enabled? >> Craig > > As I mentioned here in "numactl&SuSE11.1' thread, on some kernels there > is wrong behaviour for Nehalem (bad /sys/devices/system/node directory > content). This bug is presented, in particular, in default OpenSuSE 11 > kernels (2.6.27.7-9 and 2.6.29-6), and (as it was writted in the > corresponding thread discussion) in FC11 2.6.29 kernel. > > I found that in such situation disabling of NUMA in BIOS gives only > increase of STREAM throughput. Therefore I think this (Rahul) problem is > not due to BIOS settings. Unfortunately I've no data about VASP itself. > > It's interesting, do somebody have "normally working" w/Nehalem - in the > sense of NUMA - kernels ? AFAIK more old 2.6 kernels (from SuSE 10.3) > works OK, but I didn't check. May be error in NUMA support is the reason > of Rahul problem ? > What do you mean normally? I am running Centos 5.3 with 2.6.18-128.2.1 right now on a 448 node Nehalem cluster. I am so far happy with how things work. The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support where nodes would just start randomly run slow. Upgrading the kernel fixed that. But that performance problem was either all or none, I don't recall it exhibiting itself in the way that Rahul described. Craig > Mikhail >> >> >>> -- >>> Rahul >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> >> -- >> Craig Tierney (craig.tierney at noaa.gov) >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> -- >> ??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >> ? ????? ???????? ??????????? ??????????? >> MailScanner, ? ?? ???????? >> ??? ??? ?? ???????? ???????????? ????. >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Craig Tierney (craig.tierney at noaa.gov) From rpnabar at gmail.com Wed Aug 12 10:56:11 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 12:56:11 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A82EE88.20908@noaa.gov> References: <4A82EE88.20908@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney wrote: > What do you mean normally? ?I am running Centos 5.3 with 2.6.18-128.2.1 > right now on a 448 node Nehalem cluster. ?I am so far happy with how things work. > The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support > where nodes would just start randomly run slow. ?Upgrading the kernel > fixed that. ?But that performance problem was either all or none, I don't recall > it exhibiting itself in the way that Rahul described. > For me it shows: Linux version 2.6.18-128.el5 (mockbuild at builder10.centos.org) I am a bit confused with the numbering scheme, now. Is this older or newer than Craigs? You are right Craig, I haven't noticed any random slowdowns but my data is statistically sparse. I only have a single Nehalem+CentOS test node right now. -- Rahul From rpnabar at gmail.com Wed Aug 12 10:58:38 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 12:58:38 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81ACF3.60802@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 10:14 AM, Mikhail Kuzminsky wrote: > As I mentioned here in "numactl&SuSE11.1' thread, on some kernels there is > wrong behaviour for Nehalem (bad /sys/devices/system/node directory > content). This bug is presented, in particular, in default OpenSuSE 11 > kernels (2.6.27.7-9 and 2.6.29-6), and (as it was writted in the > corresponding thread discussion) in FC11 2.6.29 kernel. Is there a way to check if I have this bug? ls /sys/devices/system/node/ node0 node1 Don't know what exactly is "bad content" in here. I'll see if I can find a bug report online. > It's interesting, do somebody have "normally working" w/Nehalem - in the > sense of NUMA - kernels ? AFAIK more old 2.6 kernels (from SuSE 10.3) works > OK, but I didn't check. May be error in NUMA support is the reason of Rahul > problem ? > Any way to test? Are there any NUMA support tests or benchmarks? -- Rahul From rpnabar at gmail.com Wed Aug 12 11:00:41 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 13:00:41 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A82EE88.20908@noaa.gov> References: <4A82EE88.20908@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney wrote: > What do you mean normally? ?I am running Centos 5.3 with 2.6.18-128.2.1 > right now on a 448 node Nehalem cluster. ?I am so far happy with how things work. > The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support > where nodes would just start randomly run slow. ?Upgrading the kernel > fixed that. ?But that performance problem was either all or none, I don't recall > it exhibiting itself in the way that Rahul described. > I was trying another angle. Playing with the power profiles. Just downloaded cpufreq-utils via yum. Tried to see what profile was loaded: cpufreq-info cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006 Report errors and bugs to cpufreq at vger.kernel.org, please. analyzing CPU 0: no or unknown cpufreq driver is active on this CPU analyzing CPU 1: no or unknown cpufreq driver is active on this CPU analyzing CPU 2: no or unknown cpufreq driver is active on this CPU analyzing CPU 3: no or unknown cpufreq driver is active on this CPU analyzing CPU 4: no or unknown cpufreq driver is active on this CPU analyzing CPU 5: no or unknown cpufreq driver is active on this CPU analyzing CPU 6: no or unknown cpufreq driver is active on this CPU analyzing CPU 7: no or unknown cpufreq driver is active on this CPU Is this lack of the right drivers indicative of a deeper fault or is this fairly local to this issue? This could be a clue or a red herring. Just thought that I ought to post it. -- Rahul From Craig.Tierney at noaa.gov Wed Aug 12 11:02:15 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 12:02:15 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> Message-ID: <4A8303A7.50801@noaa.gov> Rahul Nabar wrote: > On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney wrote: >> What do you mean normally? I am running Centos 5.3 with 2.6.18-128.2.1 >> right now on a 448 node Nehalem cluster. I am so far happy with how things work. >> The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support >> where nodes would just start randomly run slow. Upgrading the kernel >> fixed that. But that performance problem was either all or none, I don't recall >> it exhibiting itself in the way that Rahul described. >> > > For me it shows: > > Linux version 2.6.18-128.el5 (mockbuild at builder10.centos.org) > > I am a bit confused with the numbering scheme, now. Is this older or > newer than Craigs? You are right Craig, I haven't noticed any random > slowdowns but my data is statistically sparse. I only have a single > Nehalem+CentOS test node right now. > When you run uname -a you don't get something like: [ctierney at wfe7 serial]$ uname -a Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux We did build our kernel from source, only because we ripped out the IB so we could build from the latest OFED stack. Try: # rpm -qa | grep kernel And see what version is listed. We have found a few performance problems so far. 1) Nodes would start going slow, really slow. However, when they started to go slow they stayed slow and the problem was cleared by a reboot. This problem was resolved by upgrading to the kernel we use now. 2) Nodes are reporting too many System Events that look like single-bit errors. This again would show up as nodes that would start to go slow, and wouldn't be resolved until a reboot. We no longer things we had lots of bad memory, and the latest BIOS may have fixed it. We are upload that bios now and will start checking. The only time I was getting variability in timings was when I wasn't pinning processes and memory correctly. My tests have always used all the cores in a node though. I think that OpenMPI is doing the correct thing with mpi_affinity_alone. For mvapich, we wrote a wrapper script (similar to TACC) that uses numactl directly to pin memory and threads. Craig -- Craig Tierney (craig.tierney at noaa.gov) From Craig.Tierney at noaa.gov Wed Aug 12 11:06:23 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 12:06:23 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> Message-ID: <4A83049F.8000009@noaa.gov> Rahul Nabar wrote: > On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney wrote: >> What do you mean normally? I am running Centos 5.3 with 2.6.18-128.2.1 >> right now on a 448 node Nehalem cluster. I am so far happy with how things work. >> The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support >> where nodes would just start randomly run slow. Upgrading the kernel >> fixed that. But that performance problem was either all or none, I don't recall >> it exhibiting itself in the way that Rahul described. >> > > I was trying another angle. Playing with the power profiles. Just > downloaded cpufreq-utils via yum. Tried to see what profile was > loaded: > > cpufreq-info > cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006 > Report errors and bugs to cpufreq at vger.kernel.org, please. > analyzing CPU 0: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 1: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 2: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 3: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 4: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 5: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 6: > no or unknown cpufreq driver is active on this CPU > analyzing CPU 7: > no or unknown cpufreq driver is active on this CPU > > Is this lack of the right drivers indicative of a deeper fault or is > this fairly local to this issue? This could be a clue or a red > herring. Just thought that I ought to post it. > I read there are several different tools to manage the CPU frequency. If you are using Centos/Redhat try: cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors Does it list any? If not, that might be why cpufreq-info can't find anything. Craig -- Craig Tierney (craig.tierney at noaa.gov) From gus at ldeo.columbia.edu Wed Aug 12 11:09:04 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 12 Aug 2009 14:09:04 -0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A82DC41.2060805@cse.ucdavis.edu> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> Message-ID: <4A830540.1070101@ldeo.columbia.edu> Hi Bill, list Bill: This is very interesting indeed. Thanks for sharing! Bill's graph seem to show that Shanghai and Barcelona scale (almost) linearly with the number of cores, whereas Nehalem stops scaling and flattens out at 4 cores. The Nehalem 8 cores and 4 cores curves are virtually indistinguishable, and for very large arrays 4 cores is ahead. Only for huge arrays (>16M) Nehalem gets ahead of Shanghai and Barcelona. Did I interpret the graph right? Wasn't this type of scaling problem that plagued the Clovertown and Harpertown? Any possibility that kernels, BIOS, etc, are not yet ready for Nehalem? Thanks, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Bill Broadley wrote: > I've been working on a pthread memory benchmark that is loosely modeled on > McCalpin's stream. It's been quite a challenge to remove all the noise/lost > performance from the benchmark to get close to performance I expected. Some > of the obstacles: > * For the compilers that tend to be better at stream (open64 and pathscale), > you lose the performance if you just replace double a[],b[],c[] with > double *a,*b,*c. Patch[1] available. I don't have a work around for > this, suggestions welcome. Is it really necessary for dynamic arrays > to be substantially slower than static? > * You have to be very careful with pointer alignment both with cache lines, > and each other > * cpu_affinity (by CPU id) > * numa (by socket id) > > The results are relatively smooth graphs, here's an example, it's uselessly > busy until you toggle off a few graphs (by clicking on the key): > > http://cse.ucdavis.edu/bill/pstream.svg > > The biggest puzzle I have now is what the previous generation intel quads, the > current generation AMD quads, and numerous other CPUs show a big benefit in > L1, while the nehalem shows no benefit. > > [1] http://cse.ucdavis.edu/bill/stream-malloc.patch > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From bill at Princeton.EDU Wed Aug 12 11:11:45 2009 From: bill at Princeton.EDU (Bill Wichser) Date: Wed, 12 Aug 2009 14:11:45 -0400 Subject: [Beowulf] FYI - Dell M610 BIOS 1.0.4 Message-ID: <4A8305E1.1030705@princeton.edu> We just upgraded our BIOS on the Dell M610 blades from 1.0.4 to 1.1.4 and found that memory performance using nodeperf benchmark has almost doubled. If anyone has an old BIOS on these types of nodes I'd HIGHLY recommend updating the BIOS. Bill From rpnabar at gmail.com Wed Aug 12 11:14:29 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 13:14:29 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A83049F.8000009@noaa.gov> References: <4A82EE88.20908@noaa.gov> <4A83049F.8000009@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 1:06 PM, Craig Tierney wrote: > I read there are several different tools to manage the CPU frequency. > If you are using Centos/Redhat try: > > cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors > > Does it list any? ?If not, that might be why cpufreq-info can't > find anything. Thanks again Craig! cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors: No such file or directory But I distinctly remember several "power performance " options being listed in the BIOS. It is funny that the governors aren't listed. I am not sure how I can fix this. On my older AMD Barcelonas I am used to setting the governor to "performance" and that way it then does not drop its frequency ever. -- Rahul From rpnabar at gmail.com Wed Aug 12 11:16:02 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 13:16:02 -0500 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A8303A7.50801@noaa.gov> References: <4A82EE88.20908@noaa.gov> <4A8303A7.50801@noaa.gov> Message-ID: On Wed, Aug 12, 2009 at 1:02 PM, Craig Tierney wrote: > When you run uname -a you don't get something like: > [ctierney at wfe7 serial]$ uname -a > Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux uname -a Linux node25 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux So it does seem that yours might be a little newer than mine (the 2.1 suffix?) > > We did build our kernel from source, only because we ripped out > the IB so we could build from the latest OFED stack. We didn't. We used the latest from the CentOS website. > Try: > > # rpm -qa | grep kernel > rpm -qa | grep kernel kernel-devel-2.6.18-128.el5 kernel-headers-2.6.18-128.el5 kernel-2.6.18-128.el5 From bill at cse.ucdavis.edu Wed Aug 12 11:19:59 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 12 Aug 2009 11:19:59 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A830540.1070101@ldeo.columbia.edu> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> <4A830540.1070101@ldeo.columbia.edu> Message-ID: <4A8307CF.7090700@cse.ucdavis.edu> Gus Correa wrote: > Hi Bill, list > > Bill: This is very interesting indeed. Thanks for sharing! > > Bill's graph seem to show that Shanghai and Barcelona scale > (almost) linearly with the number of cores, whereas Nehalem stops > scaling and flattens out at 4 cores. Right. That's not really surprising since the core i7 has only 4 cores. I wasn't testing a dual socket nehalem. So on a single socket core i7 that I tested the hyperthreading provided no additional performance. None to surprising since hyperthreading is about sharing idle functional units, but doesn't do much when the cache or memory system is saturated. > The Nehalem 8 cores and 4 cores curves are virtually indistinguishable, Yes, but it was 8 threads on 4 cores, vs 4 threads on 4 cores. I'd expect something less memory intensive and more cpu intensive would show a big difference. In fact many of the HPC codes I've tried see a benefit. > and for very large arrays 4 cores is ahead. > Only for huge arrays (>16M) Nehalem gets ahead > of Shanghai and Barcelona. Yes, impressive that a single socket intel has more main memory bandwidth then a dual socket shanghai. > Did I interpret the graph right? > Wasn't this type of scaling problem that plagued > the Clovertown and Harpertown? Heh, the mention single socket core i7 has substantially more (2-4x) memory bandwidth of the previous generation intels. > Any possibility that kernels, BIOS, etc, are not yet ready for Nehalem? They look good for me, still trying to find out why I don't see better performance inside L1 though. From Craig.Tierney at noaa.gov Wed Aug 12 11:21:39 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 12:21:39 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> <4A83049F.8000009@noaa.gov> Message-ID: <4A830833.7090209@noaa.gov> Rahul Nabar wrote: > On Wed, Aug 12, 2009 at 1:06 PM, Craig Tierney wrote: >> I read there are several different tools to manage the CPU frequency. >> If you are using Centos/Redhat try: >> >> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors >> >> Does it list any? If not, that might be why cpufreq-info can't >> find anything. > > Thanks again Craig! > > cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors: > No such file or directory > > But I distinctly remember several "power performance " options being > listed in the BIOS. It is funny that the governors aren't listed. I am > not sure how I can fix this. > > On my older AMD Barcelonas I am used to setting the governor to > "performance" and that way it then does not drop its frequency ever. > We have ours set to performance and then copy the setting from scaling_max_freq to scaling_min_freq to make sure the speed is set right. When everything works we will play with modifying that in the prolog/epilog of the batch system to try and reduce power consumption during idle time. Craig -- Craig Tierney (craig.tierney at noaa.gov) From Craig.Tierney at noaa.gov Wed Aug 12 11:22:15 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 12 Aug 2009 12:22:15 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> <4A8303A7.50801@noaa.gov> Message-ID: <4A830857.3020104@noaa.gov> Rahul Nabar wrote: > On Wed, Aug 12, 2009 at 1:02 PM, Craig Tierney wrote: >> When you run uname -a you don't get something like: > >> [ctierney at wfe7 serial]$ uname -a >> Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux > > uname -a > Linux node25 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 > x86_64 x86_64 GNU/Linux > > So it does seem that yours might be a little newer than mine (the 2.1 suffix?) > > >> We did build our kernel from source, only because we ripped out >> the IB so we could build from the latest OFED stack. > > We didn't. We used the latest from the CentOS website. > >> Try: >> >> # rpm -qa | grep kernel >> > > rpm -qa | grep kernel > kernel-devel-2.6.18-128.el5 > kernel-headers-2.6.18-128.el5 > kernel-2.6.18-128.el5 Weird. You might try and see if there are later packages to install. Craig -- Craig Tierney (craig.tierney at noaa.gov) From kus at free.net Wed Aug 12 11:50:16 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Wed, 12 Aug 2009 22:50:16 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A830540.1070101@ldeo.columbia.edu> Message-ID: In message from Gus Correa (Wed, 12 Aug 2009 14:09:04 -0400): >Hi Bill, list > >Bill: This is very interesting indeed. Thanks for sharing! > >Bill's graph seem to show that Shanghai and Barcelona scale >(almost) linearly with the number of cores, whereas Nehalem stops >scaling and flattens out at 4 cores. >The Nehalem 8 cores and 4 cores curves are virtually >indistinguishable, >and for very large arrays 4 cores is ahead. >Only for huge arrays (>16M) Nehalem gets ahead >of Shanghai and Barcelona. IMHO, if arrays are not "huge", they will fit in cache L3 (8MB !). Or on X axe are presented Mwords ? Mikhail > >Did I interpret the graph right? >Wasn't this type of scaling problem that plagued >the Clovertown and Harpertown? >Any possibility that kernels, BIOS, etc, are not yet ready for >Nehalem? > >Thanks, >Gus Correa >--------------------------------------------------------------------- >Gustavo Correa >Lamont-Doherty Earth Observatory - Columbia University >Palisades, NY, 10964-8000 - USA >--------------------------------------------------------------------- > >Bill Broadley wrote: >> I've been working on a pthread memory benchmark that is loosely >>modeled on >> McCalpin's stream. It's been quite a challenge to remove all the >>noise/lost >> performance from the benchmark to get close to performance I >>expected. Some >> of the obstacles: >> * For the compilers that tend to be better at stream (open64 and >>pathscale), >> you lose the performance if you just replace double a[],b[],c[] >>with >> double *a,*b,*c. Patch[1] available. I don't have a work around >>for >> this, suggestions welcome. Is it really necessary for dynamic >>arrays >> to be substantially slower than static? >> * You have to be very careful with pointer alignment both with cache >>lines, >> and each other >> * cpu_affinity (by CPU id) >> * numa (by socket id) >> >> The results are relatively smooth graphs, here's an example, it's >>uselessly >> busy until you toggle off a few graphs (by clicking on the key): >> >> http://cse.ucdavis.edu/bill/pstream.svg >> >> The biggest puzzle I have now is what the previous generation intel >>quads, the >> current generation AMD quads, and numerous other CPUs show a big >>benefit in >> L1, while the nehalem shows no benefit. >> >> [1] http://cse.ucdavis.edu/bill/stream-malloc.patch >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>Computing >> To change your subscription (digest mode or unsubscribe) visit >>http://www.beowulf.org/mailman/listinfo/beowulf > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From rpnabar at gmail.com Wed Aug 12 11:55:45 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 13:55:45 -0500 Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs Message-ID: I am a bit confused about the high "used" memory that top is showing on one of my machines? Is this "leaky" memory caused by codes that did not return all their memory? Can I identify who is hogging the memory? Any other ways to "release" this memory? I can see no user processes really (even the load average is close to zero), but yet 7 GB out of our total of 16GB seems to be used. ################################################ top - 13:45:00 up 4 days, 20:07, 2 users, load average: 0.00, 0.00, 0.00 Tasks: 146 total, 1 running, 145 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16508824k total, 7148804k used, 9360020k free, 307040k buffers Swap: 8385920k total, 0k used, 8385920k free, 6380236k cached ########################################################### On the other hand, I recall reading somewhere before that due to the paging mechanism Linux is also supposed to start using as much memory as you give it? Just confused if this is something I need to worry about or not. Incidentally the way I discovered this was because users reported that their codes were running ~30% faster right after a machine reboot as opposed to after a few days running. Do people do anything special to make sure that in a scheduler based environment (say PBS) the last job releases all its memory resources before the new one starts running? [apologize for the multi-posting ; I first posted this on a generic linux list but then thought that HPC guys might be more sensitive and concerned about such memory issues] -- Rahul From mdidomenico4 at gmail.com Wed Aug 12 12:06:08 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 12 Aug 2009 15:06:08 -0400 Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: My first guess is that this is likely files being cached. We have NFS mounts and this happens alot... I don't regularly, but on occasion i've sync'd the filesystems and dumped the caches with a prolog script. Linux doesn't seem to empty the page cache faster enough, i'm sure theres a more eloquent way to do this On Wed, Aug 12, 2009 at 2:55 PM, Rahul Nabar wrote: > I am a bit confused about the high "used" memory that top is showing on one > of my machines? Is this "leaky" memory caused by codes that did not return > all their memory? Can I identify who is hogging the memory? Any other ways > to "release" this memory? > > I can see no user processes really (even the load average is close to > zero), but yet 7 GB out of our total of 16GB seems to be used. > > ################################################ > top - 13:45:00 up 4 days, 20:07, 2 users, load average: 0.00, 0.00, 0.00 > Tasks: 146 total, 1 running, 145 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 16508824k total, 7148804k used, 9360020k free, 307040k buffers > Swap: 8385920k total, 0k used, 8385920k free, 6380236k cached > ########################################################### > > On the other hand, I recall reading somewhere before that due to the > paging mechanism > Linux is also supposed to start using as much memory as you give it? Just > confused if this is something I need to worry about or not. > > Incidentally the way I discovered this was because users reported that their > codes were running ~30% faster right after a machine reboot as opposed to > after a few days running. Do people do anything special to make sure > that in a scheduler based environment (say PBS) the last job releases > all its memory resources before the new one starts running? > > [apologize for the multi-posting ; I first posted this on a generic > linux list but then thought that HPC guys might be more sensitive and > concerned about such memory issues] > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From jlb17 at duke.edu Wed Aug 12 12:07:56 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 12 Aug 2009 15:07:56 -0400 (EDT) Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: On Wed, 12 Aug 2009 at 1:55pm, Rahul Nabar wrote > Mem: 16508824k total, 7148804k used, 9360020k free, 307040k buffers > Swap: 8385920k total, 0k used, 8385920k free, 6380236k cached > ########################################################### > > On the other hand, I recall reading somewhere before that due to the > paging mechanism > Linux is also supposed to start using as much memory as you give it? Just > confused if this is something I need to worry about or not. Yes, Linux caches as much in memory as it can, and this is a good thing. But that memory gets released when it's needed by an active process. So, above, you have 7148804k used, but the vast majority (6380236k), is cache. If you use 'free', it'll show you the amount not counting buffers+cache. In short -- situation normal, nothing to see here, move along. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From hahn at mcmaster.ca Wed Aug 12 12:07:57 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 12 Aug 2009 15:07:57 -0400 (EDT) Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: > I am a bit confused about the high "used" memory that top is showing on one > of my machines? Is this "leaky" memory caused by codes that did not return > all their memory? Can I identify who is hogging the memory? Any other ways > to "release" this memory? free memory is WASTED memory. linux tries hard to keep only a smallish, limited amount of memory wasted. if you add up rss of all processes, the difference between that and 'used' is normally dominated by kernel page-cache. see /proc/sys/vm/drop_caches on how to force the kernel to throw away FS-related caches. also, I often do this: awk '{print $3*$4,$0}' /proc/slabinfo|sort -rn|head to get a quick snapshot of kinds of memory use. > Linux is also supposed to start using as much memory as you give it? Just > confused if this is something I need to worry about or not. you should never worry about paging (swapping, thrashing) until you see nontrivial swapin (NOT out) traffic. (ie, the 'si' column in "vmstat 1"). > Incidentally the way I discovered this was because users reported that their > codes were running ~30% faster right after a machine reboot as opposed to > after a few days running. isn't this one of the anomalous nehalem machines we've been talking about? if so, it's become clear that the kernel isn't managing the memory numa-aware, so the problem is probably just poor numa-layout/balance of allocations. > that in a scheduler based environment (say PBS) the last job releases > all its memory resources before the new one starts running? you could drop_caches, but this would also hurt you sometimes. From rpnabar at gmail.com Wed Aug 12 12:12:39 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 12 Aug 2009 14:12:39 -0500 Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: On Wed, Aug 12, 2009 at 2:07 PM, Mark Hahn wrote: > isn't this one of the anomalous nehalem machines we've been talking about? > if so, it's become clear that the kernel isn't managing the memory > numa-aware, so the problem is probably just poor numa-layout/balance > of allocations. Thanks Mark. It is one of those Nehalems. -- Rahul From gus at ldeo.columbia.edu Wed Aug 12 12:27:06 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 12 Aug 2009 15:27:06 -0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A8307CF.7090700@cse.ucdavis.edu> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> <4A830540.1070101@ldeo.columbia.edu> <4A8307CF.7090700@cse.ucdavis.edu> Message-ID: <4A83178A.7020503@ldeo.columbia.edu> Hi Bill, list Bill: Many thanks for all the answers. Thanks also for the important clarification. So, the graphs you sent before compare dual socket Shanghai and Barcelona, to single socket Nehalem, right? This changes the perception a lot, as one should at most compare the 4-thread Shanghai and Barcelona curves (assuming the threads were running on a single socket) to the 4-thread Nehalem curves, right? The 8-thread curves are different animals. Would you have the (full) comparison to dual socket Nehalem, perhaps using the SMT feature also, and up to 16 threads? The benefit of SMT in HPC codes you mention matches what I saw with SMT on PPC IBM machines running climate models. (I don't have access to Nehalems to try the same codes for now.) Thank you, Gus Correa Bill Broadley wrote: > Gus Correa wrote: >> Hi Bill, list >> >> Bill: This is very interesting indeed. Thanks for sharing! >> >> Bill's graph seem to show that Shanghai and Barcelona scale >> (almost) linearly with the number of cores, whereas Nehalem stops >> scaling and flattens out at 4 cores. > > Right. That's not really surprising since the core i7 has only 4 cores. I > wasn't testing a dual socket nehalem. So on a single socket core i7 that I > tested the hyperthreading provided no additional performance. None to > surprising since hyperthreading is about sharing idle functional units, but > doesn't do much when the cache or memory system is saturated. > >> The Nehalem 8 cores and 4 cores curves are virtually indistinguishable, > > Yes, but it was 8 threads on 4 cores, vs 4 threads on 4 cores. I'd expect > something less memory intensive and more cpu intensive would show a big > difference. In fact many of the HPC codes I've tried see a benefit. > >> and for very large arrays 4 cores is ahead. >> Only for huge arrays (>16M) Nehalem gets ahead >> of Shanghai and Barcelona. > > Yes, impressive that a single socket intel has more main memory bandwidth then > a dual socket shanghai. > >> Did I interpret the graph right? >> Wasn't this type of scaling problem that plagued >> the Clovertown and Harpertown? > > Heh, the mention single socket core i7 has substantially more (2-4x) memory > bandwidth of the previous generation intels. > >> Any possibility that kernels, BIOS, etc, are not yet ready for Nehalem? > > They look good for me, still trying to find out why I don't see better > performance inside L1 though. From tjrc at sanger.ac.uk Wed Aug 12 14:10:47 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed, 12 Aug 2009 22:10:47 +0100 Subject: [Beowulf] confused about high values of "used" memory under "top" even without running jobs In-Reply-To: References: Message-ID: On 12 Aug 2009, at 8:07 pm, Mark Hahn wrote: > also, I often do this: > awk '{print $3*$4,$0}' /proc/slabinfo|sort -rn|head > to get a quick snapshot of kinds of memory use. That's a little gem! Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From bill at cse.ucdavis.edu Wed Aug 12 19:42:30 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 12 Aug 2009 19:42:30 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: <4A837D96.8000506@cse.ucdavis.edu> Rahul Nabar wrote: > On Tue, Aug 11, 2009 at 12:06 PM, Bill Broadley wrote: >> Looks to me like you fit in the barcelona 512KB L2 cache (and get good >> scaling) and do not fit in the nehalem 256KB L2 cache (and get poor scaling). > > Thanks Bill! I never realized that the L2 cache of the Nehalem is > actually smaller than that of the Barcelona! Indeed. Usually a doubling of cache size doesn't make a huge difference, but of course there are the occasional times when it makes a big difference. > I have an E5520 and a X5550. Both have the 8 MB L3 cache I believe. > THe size of the L2 cache is fixed across the steppings of the Nehlem > isn't it? I believe so, at least so far. >> Were the binaries compiled specifically to target both architectures? As a >> first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's >> compiler for intel. But portland group does a good job at both in most cases. > > We used the intel compilers. One of my fellow grad students did the > actual compilation for VASP but I believe he used the "correct" [sic] > flags to the best of our knowledge. I could post them on the list > perhaps. There was no cross-compilation. We compiled a fresh binary > for the Nehalem. I'd make sure the compiler is fairly current. I believe both the barcelona/shanghai and the core i7/nehalem have some significant tweaks that if the compiler isn't aware of the new functionality you leave significant performance on the table. In particular the newest SSE features won't be of any benefit without direct compiler support. >> A doubling of the can have that effect. The Intel L3 can no come anywhere >> close to feeding 4 cores running flat out. > > Could you explain this more? I am a little lost with the processor > dynamics. In general each step through the memory hierarchy (registers, l1, l2, l3, and main memory) approximately double latency and halve the bandwidth available. So for instance if you fit in L1 caches you might well be able to enjoy 160GB/sec, but if you more than 1MB on a nehalem chip you will be in L3 with only 48GB/sec or so. Check out: (the slightly updated) http://cse.ucdavis.edu/bill/pstream.svg So if you compare the 2MB lines the core i7 with 4 threads running can handle 47GB/sec. The dual socket barcelona or shanghai system can handle 128GB/sec. So even a dual socket Nehalem, even with one of the faster clocks (I tested 2.6 GHz) and perfect scaling the dual nehelam would only get 95GB/sec still well below the amd score. Of course there are many other things going on and it might well be other differences in the architecture responsible for the difference. Even if it was memory bandwidth there was many other parts of the graph where the single socket intel does substantially better than half the AMD, and in the case of accessing main memory the single socket intel is faster than the dual socket AMD. So basically it comes down to fun handwaving about the architecture, but if you are making a price/performance decision collect a bunch of production runs and get out a stop watch. Your vasp difference in performance and scaling might well disappear with different inputs. > Does this mean using a quad core for HPC on the Nehlem is > not likely to work well for scaling? Or do you imply a solution so > that I could fix this somehow? I didn't test a dual socket nehalem because I didn't have access, I hope to have numbers soonish. In the mean time contact me off list if you want the code to try it yourself. From eugen at leitl.org Thu Aug 13 10:03:07 2009 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 13 Aug 2009 19:03:07 +0200 Subject: [Beowulf] HPC as service Message-ID: <20090813170307.GM25322@leitl.org> (cloud, huh) http://www.penguincomputing.com/POD/HPC_as_a_service HPC as a Service? - A new offering from Penguin Computing Penguin Computing is proud to provide a new HPC as a Service offering for its high performance computing customers. For more details, see Penguin on Demand. Penguin POD Datasheet What is High Performance Computing as a Service? HPC as a Service is a computing model where users have on-demand access to dynamically scalable, high-performance clusters optimized for parallel computing including the expertise needed to set up, optimize and run their applications over the Internet. The traditional barriers associated with high-performance computing such as the initial capital outlay, time to procure the system, effort to optimize the software environment, engineering their system for peak demand and continuing operating costs have been removed. Instead, HPC as a Service users have a virtualized, scalable cluster available on demand that operates and has the same performance characteristics as a physical HPC cluster located in their data room. HPC as a Service inside the Cloud There are different definitions of cloud computing, but at the core ?Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.?1 HPC as a Service extends this model by making concentrated, non-virtualized high-performance computing resources available in the cloud. HPC As A Service HPC as a Service provides users with a number of key benefits: 1. HPC resources scale with demand and are available with no capital outlay ? only the resources used are actually paid for. 2. Experts in high-performance computing help setup and optimize the software environment and can help trouble-shoot issues that might occur. 3. Faster time-to-results especially for computational requirements that greatly exceed the existing computing capacity. 4. Accounts are provided on an individual user basis, and users are billed for the time they use POD. 5. Computing costs are reduced, particularly where the workflow has spikes in demand. 6. Access from anywhere in the worlds with high-speed data transfer in and out. 7. Exchanging hot-swappable 2TB disk drives overnight through Federal Express provides a secure and convenient way to transfer 25GB+ data sets. From ellis at runnersroll.com Thu Aug 13 10:30:08 2009 From: ellis at runnersroll.com (Ellis Wilson III) Date: Thu, 13 Aug 2009 17:30:08 +0000 Subject: [Beowulf] HPC as service In-Reply-To: <20090813170307.GM25322@leitl.org> References: <20090813170307.GM25322@leitl.org> Message-ID: <352377585-1250184566-cardhu_decombobulator_blackberry.rim.net-1068022804-@bxe1087.bisx.prod.on.blackberry> Wow. Well so much for my dream of completing my doctorate, buying an abandoned factory, building a nice cluster and offerring this exact service :(. Nice job penguin. Ellis -----Original Message----- From: Eugen Leitl Date: Thu, 13 Aug 2009 19:03:07 To: Subject: [Beowulf] HPC as service (cloud, huh) http://www.penguincomputing.com/POD/HPC_as_a_service HPC as a Service? - A new offering from Penguin Computing Penguin Computing is proud to provide a new HPC as a Service offering for its high performance computing customers. For more details, see Penguin on Demand. Penguin POD Datasheet What is High Performance Computing as a Service? HPC as a Service is a computing model where users have on-demand access to dynamically scalable, high-performance clusters optimized for parallel computing including the expertise needed to set up, optimize and run their applications over the Internet. The traditional barriers associated with high-performance computing such as the initial capital outlay, time to procure the system, effort to optimize the software environment, engineering their system for peak demand and continuing operating costs have been removed. Instead, HPC as a Service users have a virtualized, scalable cluster available on demand that operates and has the same performance characteristics as a physical HPC cluster located in their data room. HPC as a Service inside the Cloud There are different definitions of cloud computing, but at the core ?Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.?1 HPC as a Service extends this model by making concentrated, non-virtualized high-performance computing resources available in the cloud. HPC As A Service HPC as a Service provides users with a number of key benefits: 1. HPC resources scale with demand and are available with no capital outlay ? only the resources used are actually paid for. 2. Experts in high-performance computing help setup and optimize the software environment and can help trouble-shoot issues that might occur. 3. Faster time-to-results especially for computational requirements that greatly exceed the existing computing capacity. 4. Accounts are provided on an individual user basis, and users are billed for the time they use POD. 5. Computing costs are reduced, particularly where the workflow has spikes in demand. 6. Access from anywhere in the worlds with high-speed data transfer in and out. 7. Exchanging hot-swappable 2TB disk drives overnight through Federal Express provides a secure and convenient way to transfer 25GB+ data sets. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From amjad11 at gmail.com Thu Aug 13 12:15:04 2009 From: amjad11 at gmail.com (amjad ali) Date: Fri, 14 Aug 2009 00:15:04 +0500 Subject: [Beowulf] parallelization problem Message-ID: <428810f20908131215l6f9d6364w54657ea03c01501d@mail.gmail.com> Hi, all, I am parallelizing a CFD 2D code in FORTRAN+OPENMPI. Suppose that the grid (all triangles) is partitioned among 8 processes using METIS. Each process has different number of neighboring processes. Suppose each process has n elements/faces whose data it needs to sends to corresponding neighboring processes, and it has m number of elements/faces on which it needs to get data from corresponding neighboring processes. Values of n and m are different for each process. Another aim is to hide the communication behind computation. For this I do the following for each process: DO j = 1 to n CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, ireq(j), ierr) ENDDO DO k = 1 to m CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, ierr) ENDDO This solves my problem. But it gives memory leakage; RAM gets filled after few thousands of iteration. What is the solution/remedy? How should I tackle this? In another CFD code I removed this problem of memory-filling by following (in that code n=m) : DO j = 1 to n CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, ireq(j), ierr) ENDDO CALL MPI_WAITALL(n,ireq,status,ierr) DO k = 1 to n CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, ierr) ENDDO But this is not working in current code; and the previous code was not giving correct results with large number of processes. Please suggest solution. THANKS A LOT FOR YOUR KIND ATTENTION. With best regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.janssens at opencfd.co.uk Thu Aug 13 13:45:01 2009 From: m.janssens at opencfd.co.uk (Mattijs Janssens) Date: Thu, 13 Aug 2009 21:45:01 +0100 Subject: [Beowulf] parallelization problem In-Reply-To: <428810f20908131215l6f9d6364w54657ea03c01501d@mail.gmail.com> References: <428810f20908131215l6f9d6364w54657ea03c01501d@mail.gmail.com> Message-ID: <200908132145.01517.m.janssens@opencfd.co.uk> Do a non-blocking receive as well so MPI_IRECV instead of MPI_RECV and make sure you have an MPI_WAITALL for all the requests (both from sending and from receiving). Kind regards, Mattijs On Thursday 13 August 2009 20:15:04 amjad ali wrote: > Hi, all, > > > > I am parallelizing a CFD 2D code in FORTRAN+OPENMPI. Suppose that the grid > (all triangles) is partitioned among 8 processes using METIS. Each process > has different number of neighboring processes. Suppose each process has n > elements/faces whose data it needs to sends to corresponding neighboring > processes, and it has m number of elements/faces on which it needs to get > data from corresponding neighboring processes. Values of n and m are > different for each process. Another aim is to hide the communication behind > computation. For this I do the following for each process: > > > > DO j = 1 to n > > CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, > ireq(j), ierr) > > ENDDO > > > > DO k = 1 to m > > CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, > ierr) > > ENDDO > > > > > > This solves my problem. But it gives memory leakage; RAM gets filled after > few thousands of iteration. What is the solution/remedy? How should I > tackle this? > > > > In another CFD code I removed this problem of memory-filling by following > (in that code n=m) : > > > > DO j = 1 to n > > CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, > ireq(j), ierr) > > ENDDO > > > > CALL MPI_WAITALL(n,ireq,status,ierr) > > > > DO k = 1 to n > > CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, > ierr) > > ENDDO > > > > But this is not working in current code; and the previous code was not > giving correct results with large number of processes. > > > > Please suggest solution. > > > > THANKS A LOT FOR YOUR KIND ATTENTION. > > > > With best regards, > > Amjad Ali. -- Mattijs Janssens OpenCFD Ltd. 9 Albert Road, Caversham, Reading RG4 7AN. Tel: +44 (0)118 9471030 Email: M.Janssens at OpenCFD.co.uk URL: http://www.OpenCFD.co.uk From christian at myri.com Wed Aug 12 08:36:35 2009 From: christian at myri.com (Christian Bell) Date: Wed, 12 Aug 2009 11:36:35 -0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A82DC41.2060805@cse.ucdavis.edu> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> Message-ID: On Aug 12, 2009, at 11:14 AM, Bill Broadley wrote: > * For the compilers that tend to be better at stream (open64 and > pathscale), > you lose the performance if you just replace double a[],b[],c[] with > double *a,*b,*c. Patch[1] available. I don't have a work around for > this, suggestions welcome. Is it really necessary for dynamic arrays > to be substantially slower than static? Yes -- when pointers, the compiler assumes (by default) that the pointers can alias each other, which can prevent aggressive optimizations that are otherwise possible with arrays. C99 has introduced the 'restrict' keyword to allow programmers to assert that pointers of the same type cannot alias each other. However, restrict is just a hint and some compilers may or may not take advantage of it. You can also consult your compiler's documentation to see if there are other compiler-specific hints (asserting no loop-carried dependencies, loop fusion/fission). I remember stacking half a dozen pragmas over a 3-line loop on a Cray C compiler years ago to ensure that accesses where suitably optimized (or in this case, vectorized). . . christian From tom.elken at qlogic.com Thu Aug 13 16:37:07 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Thu, 13 Aug 2009 16:37:07 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> Message-ID: <35AAF1E4A771E142979F27B51793A48885F52B6CC9@AVEXMB1.qlogic.org> > On Behalf Of Christian Bell > On Aug 12, 2009, at 11:14 AM, Bill Broadley wrote: > > > Is it really necessary for dynamic arrays > > to be substantially slower than static? > > Yes -- when pointers, the compiler assumes (by default) that the > pointers can alias each other, which can prevent aggressive > optimizations that are otherwise possible with arrays. ... > I remember stacking half a dozen pragmas over a > 3-line loop on a Cray C compiler years ago to ensure that accesses > where suitably optimized (or in this case, vectorized). To add some details to what Christian says, the HPC Challenge version of STREAM uses dynamic arrays and is hard to optimize. I don't know what's best with current compiler versions, but you could try some of these that were used in past HPCC submissions with your program, Bill: PathScale 2.2.1 on Opteron: Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 -OPT:alias=restrict:align_unsafe=on -CG:movnti=1 Intel C/C++ Compiler 10.1 on Harpertown CPUs: Base OPT flags: -O2 -xT -ansi-alias -ip -i-static Intel recently used Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: -O2 -xSSE4.2 -ansi-alias -ip and got good STREAM results in their HPCC submission on their ENdeavor cluster. -Tom > > > . . christian > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From richard.walsh at comcast.net Thu Aug 13 16:56:33 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Thu, 13 Aug 2009 23:56:33 +0000 (UTC) Subject: [Beowulf] Wake on LAN supported on both built-in interfaces ... ?? Message-ID: <1236706861.11245011250207793299.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> All, I have a head node that am trying to get WOL set up on. It is a SuperMicro motherboard (X8DTi-F) with two built in interfaces (eth0, eth1). I am told by SuperMicro support that both interfaces support WOL fully, but when I probe them with ethtool only eth0 indicates that it supports WOL with: ... Supports Wake-on: umbg Wake-on: g ... $ethtool eth1 ... yields: ... Supports Wake-on: d Wake-on: d ... Attempting: ethtool -s eth1 wol g fails indicating (as expected) indicating: "Cannot set new wake-on-lan settings: Operation not supported not setting wol" The same command on eth0 works fine. I have set up eth0 to be my internal (private interface) and eth1 to be the internet facing interface. I would like avoid reworking everything ... ;-) ... Any thoughts? Go arounds? Is SuperMicro telling me the truth about both interfaces providing full support. Perhaps I am missing some configuration option in the BIOS. SuperMicro says I am not, but ... Thanks, rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: From bill at cse.ucdavis.edu Thu Aug 13 17:09:24 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Thu, 13 Aug 2009 17:09:24 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <35AAF1E4A771E142979F27B51793A48885F52B6CC9@AVEXMB1.qlogic.org> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> <35AAF1E4A771E142979F27B51793A48885F52B6CC9@AVEXMB1.qlogic.org> Message-ID: <4A84AB34.2050209@cse.ucdavis.edu> Tom Elken wrote: > To add some details to what Christian says, the HPC Challenge version of > STREAM uses dynamic arrays and is hard to optimize. I don't know what's > best with current compiler versions, but you could try some of these that > were used in past HPCC submissions with your program, Bill: Thanks for the heads up, I've checked the specbench.org compiler options for hints on where to start with optimization flags, but I didn't know about the dynamic stream. Is the HPC challenge code open source? > PathScale 2.2.1 on Opteron: > Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 > STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 -OPT:alias=restrict:align_unsafe=on -CG:movnti=1 Alas my pathscale license expired and I believe with sci-cortex's death (RIP) I can't renew it. I tried open64-4.2.2 with those flags and on a nehalem single socket: $ opencc -O4 -fopenmp stream.c -o stream-open64 -static $ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static $ ./stream-open64 Total memory required = 457.8 MB. Function Rate (MB/s) Avg time Min time Max time Copy: 22061.4958 0.0145 0.0145 0.0146 Scale: 22228.4705 0.0144 0.0144 0.0145 Add: 20659.2638 0.0233 0.0232 0.0233 Triad: 20511.0888 0.0235 0.0234 0.0235 Dynamic: $ ./stream-open64-malloc Function Rate (MB/s) Avg time Min time Max time Copy: 14436.5155 0.0222 0.0222 0.0222 Scale: 14667.4821 0.0218 0.0218 0.0219 Add: 15739.7070 0.0305 0.0305 0.0305 Triad: 15770.7775 0.0305 0.0304 0.0305 > Intel C/C++ Compiler 10.1 on Harpertown CPUs: > Base OPT flags: -O2 -xT -ansi-alias -ip -i-static > Intel recently used > Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: > -O2 -xSSE4.2 -ansi-alias -ip > and got good STREAM results in their HPCC submission on their ENdeavor cluster. $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o stream-icc-malloc $ ./stream-icc | grep ":" STREAM version $Revision: 5.9 $ Copy: 14767.0512 0.0022 0.0022 0.0022 Scale: 14304.3513 0.0022 0.0022 0.0023 Add: 15503.3568 0.0031 0.0031 0.0031 Triad: 15613.9749 0.0031 0.0031 0.0031 $ ./stream-icc-malloc | grep ":" STREAM version $Revision: 5.9 $ Copy: 14604.7582 0.0022 0.0022 0.0022 Scale: 14480.2814 0.0022 0.0022 0.0022 Add: 15414.3321 0.0031 0.0031 0.0031 Triad: 15738.4765 0.0031 0.0030 0.0031 So ICC does manage zero penalty, alas no faster than open64 with the penalty. I'll attempt to track down the HPCC stream source code to see if their dynamic arrays are any friendlier than mine (I just use malloc). In any case many thanks for the pointer. Oh, my dynamic tweak: $ diff stream.c stream-malloc.c 43a44 > # include 97c98 < static double a[N+OFFSET], --- > /* static double a[N+OFFSET], 99c100,102 < c[N+OFFSET]; --- > c[N+OFFSET]; */ > > double *a, *b, *c; 134a138,142 > > a=(double *)malloc(sizeof(double)*(N+OFFSET)); > b=(double *)malloc(sizeof(double)*(N+OFFSET)); > c=(double *)malloc(sizeof(double)*(N+OFFSET)); > 283c291,293 < --- > free(a); > free(b); > free(c); From kus at free.net Fri Aug 14 08:08:32 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 14 Aug 2009 19:08:32 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A84AB34.2050209@cse.ucdavis.edu> Message-ID: In message from Bill Broadley (Thu, 13 Aug 2009 17:09:24 -0700): >Tom Elken wrote: >> To add some details to what Christian says, the HPC Challenge >>version of >> STREAM uses dynamic arrays and is hard to optimize. I don't know >>what's >> best with current compiler versions, but you could try some of these >>that >> were used in past HPCC submissions with your program, Bill: > >Thanks for the heads up, I've checked the specbench.org compiler >options for >hints on where to start with optimization flags, but I didn't know >about the >dynamic stream. > >Is the HPC challenge code open source? Yes, they are open. > >> PathScale 2.2.1 on Opteron: >> Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 >> STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 >>-OPT:alias=restrict:align_unsafe=on -CG:movnti=1 > >Alas my pathscale license expired and I believe with sci-cortex's >death (RIP) >I can't renew it. Now I understand that I was sage :-) (we purchased perpetual acafemic license). ВТW, do somebody know about Pathscale compilers future (if it will be) ? Mikhail > >I tried open64-4.2.2 with those flags and on a nehalem single socket: > >$ opencc -O4 -fopenmp stream.c -o stream-open64 -static >$ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static > >$ ./stream-open64 >Total memory required = 457.8 MB. >Function Rate (MB/s) Avg time Min time Max time >Copy: 22061.4958 0.0145 0.0145 0.0146 >Scale: 22228.4705 0.0144 0.0144 0.0145 >Add: 20659.2638 0.0233 0.0232 0.0233 >Triad: 20511.0888 0.0235 0.0234 0.0235 > >Dynamic: >$ ./stream-open64-malloc > >Function Rate (MB/s) Avg time Min time Max time >Copy: 14436.5155 0.0222 0.0222 0.0222 >Scale: 14667.4821 0.0218 0.0218 0.0219 >Add: 15739.7070 0.0305 0.0305 0.0305 >Triad: 15770.7775 0.0305 0.0304 0.0305 > >> Intel C/C++ Compiler 10.1 on Harpertown CPUs: >> Base OPT flags: -O2 -xT -ansi-alias -ip -i-static >> Intel recently used >> Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: >> -O2 -xSSE4.2 -ansi-alias -ip >> and got good STREAM results in their HPCC submission on their >>ENdeavor cluster. > >$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc >$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o >stream-icc-malloc > >$ ./stream-icc | grep ":" >STREAM version $Revision: 5.9 $ >Copy: 14767.0512 0.0022 0.0022 0.0022 >Scale: 14304.3513 0.0022 0.0022 0.0023 >Add: 15503.3568 0.0031 0.0031 0.0031 >Triad: 15613.9749 0.0031 0.0031 0.0031 >$ ./stream-icc-malloc | grep ":" >STREAM version $Revision: 5.9 $ >Copy: 14604.7582 0.0022 0.0022 0.0022 >Scale: 14480.2814 0.0022 0.0022 0.0022 >Add: 15414.3321 0.0031 0.0031 0.0031 >Triad: 15738.4765 0.0031 0.0030 0.0031 > >So ICC does manage zero penalty, alas no faster than open64 with the >penalty. > >I'll attempt to track down the HPCC stream source code to see if >their dynamic >arrays are any friendlier than mine (I just use malloc). > >In any case many thanks for the pointer. > >Oh, my dynamic tweak: >$ diff stream.c stream-malloc.c >43a44 >> # include >97c98 >< static double a[N+OFFSET], >--- >> /* static double a[N+OFFSET], >99c100,102 >< c[N+OFFSET]; >--- >> c[N+OFFSET]; */ >> >> double *a, *b, *c; >134a138,142 >> >> a=(double *)malloc(sizeof(double)*(N+OFFSET)); >> b=(double *)malloc(sizeof(double)*(N+OFFSET)); >> c=(double *)malloc(sizeof(double)*(N+OFFSET)); >> >283c291,293 >< >--- >> free(a); >> free(b); >> free(c); > > > > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From kus at free.net Fri Aug 14 11:47:01 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 14 Aug 2009 22:47:01 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A84AB34.2050209@cse.ucdavis.edu> Message-ID: In message from Bill Broadley (Thu, 13 Aug 2009 17:09:24 -0700): Do I unerstand correctly that this results are for 4 cores& 4 openmp threads ? And what is DDR3 RAM: DDR3/1066 ? Mikhail >I tried open64-4.2.2 with those flags and on a nehalem single socket: >$ opencc -O4 -fopenmp stream.c -o stream-open64 -static >$ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static >$ ./stream-open64 >Total memory required = 457.8 MB. >Function Rate (MB/s) Avg time Min time Max time >Copy: 22061.4958 0.0145 0.0145 0.0146 >Scale: 22228.4705 0.0144 0.0144 0.0145 >Add: 20659.2638 0.0233 0.0232 0.0233 >Triad: 20511.0888 0.0235 0.0234 0.0235 >Dynamic: >$ ./stream-open64-malloc > >Function Rate (MB/s) Avg time Min time Max time >Copy: 14436.5155 0.0222 0.0222 0.0222 >Scale: 14667.4821 0.0218 0.0218 0.0219 >Add: 15739.7070 0.0305 0.0305 0.0305 >Triad: 15770.7775 0.0305 0.0304 0.0305 > >> Intel C/C++ Compiler 10.1 on Harpertown CPUs: >> Base OPT flags: -O2 -xT -ansi-alias -ip -i-static >> Intel recently used >> Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: >> -O2 -xSSE4.2 -ansi-alias -ip >> and got good STREAM results in their HPCC submission on their >>ENdeavor cluster. > >$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc >$ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o >stream-icc-malloc > >$ ./stream-icc | grep ":" >STREAM version $Revision: 5.9 $ >Copy: 14767.0512 0.0022 0.0022 0.0022 >Scale: 14304.3513 0.0022 0.0022 0.0023 >Add: 15503.3568 0.0031 0.0031 0.0031 >Triad: 15613.9749 0.0031 0.0031 0.0031 >$ ./stream-icc-malloc | grep ":" >STREAM version $Revision: 5.9 $ >Copy: 14604.7582 0.0022 0.0022 0.0022 >Scale: 14480.2814 0.0022 0.0022 0.0022 >Add: 15414.3321 0.0031 0.0031 0.0031 >Triad: 15738.4765 0.0031 0.0030 0.0031 > >So ICC does manage zero penalty, alas no faster than open64 with the >penalty. > >I'll attempt to track down the HPCC stream source code to see if >their dynamic >arrays are any friendlier than mine (I just use malloc). > >In any case many thanks for the pointer. > >Oh, my dynamic tweak: >$ diff stream.c stream-malloc.c >43a44 >> # include >97c98 >< static double a[N+OFFSET], >--- >> /* static double a[N+OFFSET], >99c100,102 >< c[N+OFFSET]; >--- >> c[N+OFFSET]; */ >> >> double *a, *b, *c; >134a138,142 >> >> a=(double *)malloc(sizeof(double)*(N+OFFSET)); >> b=(double *)malloc(sizeof(double)*(N+OFFSET)); >> c=(double *)malloc(sizeof(double)*(N+OFFSET)); >> >283c291,293 >< >--- >> free(a); >> free(b); >> free(c); > > > > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From bill at cse.ucdavis.edu Fri Aug 14 12:47:31 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Fri, 14 Aug 2009 12:47:31 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: Message-ID: <4A85BF53.3010106@cse.ucdavis.edu> Mikhail Kuzminsky wrote: > In message from Bill Broadley (Thu, 13 Aug 2009 > 17:09:24 -0700): > > Do I unerstand correctly that this results are for 4 cores& 4 openmp > threads ? And what is DDR3 RAM: DDR3/1066 ? 4 cores and 8 openmp threads. 4 threads is slightly faster: Function Rate (MB/s) Avg time Min time Max time Copy: 23670.3046 0.0135 0.0135 0.0136 Scale: 23304.9257 0.0138 0.0137 0.0139 Add: 21951.8053 0.0219 0.0219 0.0219 Triad: 21538.2451 0.0223 0.0223 0.0224 I put DDR3-1333 in the machine, but the bios seems to want to run them at 1066, I'm not sure exactly what speed they are running at. From mathog at caltech.edu Fri Aug 14 13:14:18 2009 From: mathog at caltech.edu (David Mathog) Date: Fri, 14 Aug 2009 13:14:18 -0700 Subject: [Beowulf] Wake on LAN supported on both built-in interfaces ... ?? Message-ID: richard.walsh at comcast.net wrote: > I have a head node that am trying to get WOL set up on. > > It is a SuperMicro motherboard (X8DTi-F) with two built > in interfaces (eth0, eth1). I am told by SuperMicro support > that both interfaces support WOL fully, but when I probe them > with ethtool only eth0 indicates that it supports WOL with: > That board has "Intel? 82576 Dual-Port Gigabit Ethernet" and Intel provides some information on that here: http://edc.intel.com/Link.aspx?id=2372 where it says: Wake-on-LAN support Packet recognition and wake-up for LAN on motherboard applications without software configuration and nothing more. That is ambiguous, it requires that at least one interface support WOL, but it does not say explicitly that both do. Most likely the hardware does support on both ports but the driver is confused somehow by the dual chip. Try contacting the author of the linux driver and/or Intel directly. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From tom.elken at qlogic.com Fri Aug 14 13:57:53 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Fri, 14 Aug 2009 13:57:53 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A85BF53.3010106@cse.ucdavis.edu> References: <4A85BF53.3010106@cse.ucdavis.edu> Message-ID: <35AAF1E4A771E142979F27B51793A48885F52B6D52@AVEXMB1.qlogic.org> > On Behalf Of Bill Broadley > I put DDR3-1333 in the machine, but the bios seems to want to run them > at > 1066, How many dimms per memory channel do you have? My understanding (which may be a few months old) is that if you have more than one dimm per memory channel, DDR3-1333 dimms will run at 1066 speed; i.e. on your 1-CPU system, if you have 6 dimms, you have 2 per memory channel. > I'm not sure exactly what speed they are running at. Your results look excellent, so I wouldn't be surprised if they are running at 1333. -Tom > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From kus at free.net Fri Aug 14 16:10:42 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Sat, 15 Aug 2009 03:10:42 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <35AAF1E4A771E142979F27B51793A48885F52B6D52@AVEXMB1.qlogic.org> Message-ID: In message from Tom Elken (Fri, 14 Aug 2009 13:57:53 -0700): >> On Behalf Of Bill Broadley > >> I put DDR3-1333 in the machine, but the bios seems to want to run >>them >> at >> 1066, > >How many dimms per memory channel do you have? > >My understanding (which may be a few months old) is that if you have >more than one dimm per memory channel, DDR3-1333 dimms will run at >1066 speed; >i.e. on your 1-CPU system, if you have 6 dimms, you have 2 per memory >channel. > >> I'm not sure exactly what speed they are running at. > >Your results look excellent, so I wouldn't be surprised if they are >running at 1333. I have 12-18 GB/s on 4 threads of stream/ifort w/DDR3-1066 on dual E5520 server. But it works under "numa-bad" kernel w/o control of numa-efficient allocation. Mikhail > >-Tom > >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >> Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From kus at free.net Fri Aug 14 16:24:25 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Sat, 15 Aug 2009 03:24:25 +0400 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <4A85EF91.2050705@cse.ucdavis.edu> Message-ID: In message from Bill Broadley (Fri, 14 Aug 2009 16:13:21 -0700): >Mikhail Kuzminsky wrote: >>> Your results look excellent, so I wouldn't be surprised if they are >>> running at 1333. >> >> I have 12-18 GB/s on 4 threads of stream/ifort w/DDR3-1066 on dual >>E5520 >> server. But it works under "numa-bad" kernel w/o control of >> numa-efficient allocation. > >Sounds pretty bad. > >Why 4 threads? You need 8 cores to keep all 6 memory busses busy. For comparison w/your tests: you have only 4 cores. On 8 threads I have 20-26 GB/s. > >Which compiler? ifort pointed above means intel fortran 11.0.38. Mikhail > open64 does substantially better than gcc. > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From amjad11 at gmail.com Fri Aug 14 16:44:44 2009 From: amjad11 at gmail.com (amjad ali) Date: Sat, 15 Aug 2009 04:44:44 +0500 Subject: [Beowulf] METIS Partitioning within program Message-ID: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> Hi all, For my parallel code to run, I first make grid partitioning on command line then for running the parallel code I give hard-code the path of METIS-partition files. It is very cumbersome if I need to run code with different grids and for different -np value. Please tell me how to call METIS partitioning routine from within the program run so that whatever -np value would be we are at ease. THANKS A LOT FOR YOUR ATTENTION. Regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.walsh at comcast.net Sat Aug 15 05:57:51 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Sat, 15 Aug 2009 12:57:51 +0000 (UTC) Subject: [Beowulf] METIS Partitioning within program In-Reply-To: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> Message-ID: <1942209729.9031250341071497.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Amjad, Have you thought of using the system call: "system(const char *string);" Type "man system" for a description. You can pass any string to the shell to be run with this call. For instance: system("date > date.out"); would instruct the shell to place the current date and time in the file date.out. If the command you wish to run changes cyclically y ou would have to manage the changes from inside the program. I am assuming a C program here. Regards, rbw ----- Original Message ----- From: "amjad ali" To: "Beowulf Mailing List" Sent: Friday, August 14, 2009 6:44:44 PM GMT -06:00 US/Canada Central Subject: [Beowulf] METIS Partitioning within program Hi all, For my parallel code to run, I first make grid partitioning on command line then for running the parallel code I give hard-code the path of METIS-partition files. It is very cumbersome if I need to run code with different grids and for different -np value. Please tell me how to call METIS partitioning routine from within the program run so that whatever -np value would be we are at ease. THANKS A LOT FOR YOUR ATTENTION. Regards, Amjad Ali. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.janssens at opencfd.co.uk Mon Aug 17 03:28:13 2009 From: m.janssens at opencfd.co.uk (Mattijs Janssens) Date: Mon, 17 Aug 2009 11:28:13 +0100 Subject: [Beowulf] METIS Partitioning within program In-Reply-To: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> References: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> Message-ID: <200908171128.13601.m.janssens@opencfd.co.uk> > For my parallel code to run, I first make grid partitioning on command line > then for running the parallel code I give hard-code the path of > METIS-partition files. It is very cumbersome if I need to run code with > different grids and for different -np value. Please tell me how to call > METIS partitioning routine from within the program run so that whatever -np > value would be we are at ease. You could call the Metis routines directly from your code. See http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/manual.pdf Regards, Mattijs From hahn at mcmaster.ca Mon Aug 17 09:05:59 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 17 Aug 2009 12:05:59 -0400 (EDT) Subject: [Beowulf] METIS Partitioning within program In-Reply-To: <200908171128.13601.m.janssens@opencfd.co.uk> References: <428810f20908141644o745f6af8m6c91d19639aaec5f@mail.gmail.com> <200908171128.13601.m.janssens@opencfd.co.uk> Message-ID: >> For my parallel code to run, I first make grid partitioning on command line >> then for running the parallel code I give hard-code the path of >> METIS-partition files. It is very cumbersome if I need to run code with >> different grids and for different -np value. Please tell me how to call this sort of thing is not uncommon - I normally recommend that the user submit a script which does the setup step, then mpirun's the parallel code. this wastes some cycles (the setup is normally serial, so wastes n-1 cpus for that duration.) alternatively, submitting a serial setup job followed by a (dependent) parallel job makes sense but incurs more queue time. From mmuratet at hudsonalpha.org Fri Aug 14 15:22:00 2009 From: mmuratet at hudsonalpha.org (Michael Muratet) Date: Fri, 14 Aug 2009 17:22:00 -0500 Subject: [Beowulf] newbie beorun question Message-ID: Greetings I thought I understood the operation of beorun --no-local but now I'm not so sure. I started a task with beorun --no-local command and it started on node 0 which is reasonable because the cluster was empty. I started a second task the same way and the second appears to have started on node 0 as well, leaving all the other nodes empty. How can this happen? I've re-read the man page a couple of times now and realized I don't see where it has a guarantee of load balancing. I don't see anything in the beomap man page, either, that balances loads (except manually). We are working towards submitting everything through torque, but in the meantime, is there any load balancing available via beorun? Thanks Mike From tru at pasteur.fr Sat Aug 15 07:28:23 2009 From: tru at pasteur.fr (Tru Huynh) Date: Sat, 15 Aug 2009 16:28:23 +0200 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A82EE88.20908@noaa.gov> <4A8303A7.50801@noaa.gov> Message-ID: <20090815142823.GR18295@sillage.bis.pasteur.fr> On Wed, Aug 12, 2009 at 01:16:02PM -0500, Rahul Nabar wrote: > We didn't. We used the latest from the CentOS website. Then something is not configured properly, the latest CentOS-5 kernel is kernel-2.6.18-128.4.1.el5.x86_64. Linux darwin.localdomain 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux 2.6.18-128.el5 is the GA released version for 5.3, not the current version see http://mirror.centos.org/centos/5/os/x86_64/CentOS http://mirror.centos.org/centos/5/updates/x86_64/RPMS/ Tru -- Dr Tru Huynh | http://www.pasteur.fr/recherche/unites/Binfs/ mailto:tru at pasteur.fr | tel/fax +33 1 45 68 87 37/19 Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France From peterffaber at web.de Sat Aug 15 08:31:17 2009 From: peterffaber at web.de (Peter Faber) Date: Sat, 15 Aug 2009 17:31:17 +0200 Subject: [Beowulf] parallelization problem Message-ID: <4A86D4C5.7060800@web.de> amjad ali wrote: > I am parallelizing a CFD 2D code in FORTRAN+OPENMPI. Suppose that the grid > (all triangles) is partitioned among 8 processes using METIS. Each process > has different number of neighboring processes. Suppose each process has n > elements/faces whose data it needs to sends to corresponding neighboring > processes, and it has m number of elements/faces on which it needs to get > data from corresponding neighboring processes. Values of n and m are > different for each process. Another aim is to hide the communication behind > computation. For this I do the following for each process: > DO j = 1 to n > > CALL MPI_ISEND (send_data, num, type, dest(j), tag, MPI_COMM_WORLD, ireq(j), > ierr) > > ENDDO > > DO k = 1 to m > > CALL MPI_RECV(recv_data, num, type, source(k), tag, MPI_COMM_WORLD, status, > ierr) > > ENDDO You may want to place the MPI_WAIT somewhere below the MPI_RECV in order to ensure that all receives can be executed and thus all sends be completed. If your program does not work with an MPI_WAIT in place, there may be something wrong with your values for n, m, dest(j) and/or source(k), which may also explain the memory "leak". Perhaps you can check these values with a smaller number of processes? Just my 2 cents... PFF From codestr0m at osunix.org Sat Aug 15 19:45:54 2009 From: codestr0m at osunix.org (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Sat, 15 Aug 2009 19:45:54 -0700 Subject: [Beowulf] PathScale (RIP) WAS: bizarre scaling behavior on a Nehalem In-Reply-To: <4A81A518.2030805@cse.ucdavis.edu> References: <4A81A518.2030805@cse.ucdavis.edu> Message-ID: <4A8772E2.70704@netsyncro.com> Bill Broadley wrote: > ... > Were the binaries compiled specifically to target both architectures? As a > first guess I suggest trying pathscale (RIP) or open64 for amd, and intel's > compiler for intel. But portland group does a good job at both in most cases. > I'm just now catching up on my email and really cringed when I read this... Keep an eye out for PathScale related news in the near future. If you were an old PathScale customer please feel free to contact me offline. If by phone after Wednesday of next week is best. preferred email : codestr0m at osunix.org direct : +1 415-269-8386 ./Christopher From codestr0m at osunix.org Sat Aug 15 20:13:42 2009 From: codestr0m at osunix.org (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Sat, 15 Aug 2009 20:13:42 -0700 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: <35AAF1E4A771E142979F27B51793A48885F52B6CC9@AVEXMB1.qlogic.org> References: <4A81ACF3.60802@noaa.gov> <4A81CEA7.3030802@noaa.gov> <4A82DC41.2060805@cse.ucdavis.edu> <35AAF1E4A771E142979F27B51793A48885F52B6CC9@AVEXMB1.qlogic.org> Message-ID: <4A877966.3030307@netsyncro.com> Tom Elken wrote: >> On Behalf Of Christian Bell >> On Aug 12, 2009, at 11:14 AM, Bill Broadley wrote: >> >> >>> Is it really necessary for dynamic arrays >>> to be substantially slower than static? >>> >> Yes -- when pointers, the compiler assumes (by default) that the >> pointers can alias each other, which can prevent aggressive >> optimizations that are otherwise possible with arrays. >> > ... > >> I remember stacking half a dozen pragmas over a >> 3-line loop on a Cray C compiler years ago to ensure that accesses >> where suitably optimized (or in this case, vectorized). >> > > To add some details to what Christian says, the HPC Challenge version of STREAM uses dynamic arrays and is hard to optimize. I don't know what's best with current compiler versions, but you could try some of these that were used in past HPCC submissions with your program, Bill: > > PathScale 2.2.1 on Opteron: > Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 > STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 -OPT:alias=restrict:align_unsafe=on -CG:movnti=1 > > I hope people don't mind me replying specifically to this PathScale related stuff, but last publicly released version was 3.2 (and a follow-up 3.3-beta) If you're a PathScale customer and interested in this update please feel free to contact me off list. ./Christopher From philippe.blaise at cea.fr Mon Aug 17 02:15:31 2009 From: philippe.blaise at cea.fr (Philippe Blaise) Date: Mon, 17 Aug 2009 11:15:31 +0200 Subject: [Beowulf] MAGMA feedback Message-ID: <4A891FB3.2060700@cea.fr> Does anyone heard about the Magma project ? http://icl.cs.utk.edu/magma/people/index.html If somebody has tried using a linux cluster or whatever machine, associated to GPUS, any feedback is welcome. Thanx Phil. From kus at free.net Wed Aug 19 10:12:41 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Wed, 19 Aug 2009 21:12:41 +0400 Subject: [Beowulf] moving of Linux HDD to other node: udev problem at boot Message-ID: As it was discussed here, there are NUMA problems w/Nehalem on a set of Linux distributions/kernels. I was informed that may be old OpenSuSE 10.3 default kernel (2.6.22) works w/Nehalem OK in the sense of NUMA, i.e. gives right /sys/devices/system/node content. I moved Western Digital SATA HDD w/SuSE 10.3 installed (on dual Barcelona server) to dual Nehalem server (master HDD on Nehalem server) with Supermicro X8DTi mobo. But loading of SuSE 10.3 on Nehalem server was not successful. Grub loader (which menu.lst configuration uses "by-id" identification of disk partitions) works OK. But linux kernel booting didn't finish successfully: /boot/04-udev.sh script (which task is udev initialization) - I think, it's from initrd content - do not see root partition (1st partition on HDD) ! At the boot I see the messages .... SCSI subsystem initialized ACPI Exception (processor_core_0787): Processor device isn't present .... ... Trying manual resume from /dev/sda2 /* it's swap partition*/ resume device /dev/sda2 not found (ignoring) ... Waiting for device /dev/disk/by-id/scsi-SATA-WDC_WD-part1 ... /* echo from udev.sh */ and then the proposal to try again. After finish of this script I don't see any HDDs in /dev. BIOS setting for this SATA device is "enhanced". "compatible" mode gives the same result. What may be the source of the problem ? May be HDD driver used by initrd ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow PS. If I see (after finish of udev.sh script) the content of /sys - it's right in NUMA sense, i.e. /sys/devices/system/node contains normal node0 and node1. From reuti at staff.uni-marburg.de Wed Aug 19 12:07:19 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed, 19 Aug 2009 21:07:19 +0200 Subject: [Beowulf] moving of Linux HDD to other node: udev problem at boot In-Reply-To: References: Message-ID: Am 19.08.2009 um 19:12 schrieb Mikhail Kuzminsky: > As it was discussed here, there are NUMA problems w/Nehalem on a > set of Linux distributions/kernels. I was informed that may be old > OpenSuSE 10.3 default kernel (2.6.22) works w/Nehalem OK in the > sense of NUMA, i.e. gives right /sys/devices/system/node content. > > I moved Western Digital SATA HDD w/SuSE 10.3 installed (on dual > Barcelona server) to dual Nehalem server (master HDD on Nehalem > server) with Supermicro X8DTi mobo. > > But loading of SuSE 10.3 on Nehalem server was not successful. Grub > loader (which menu.lst configuration uses "by-id" identification of > disk partitions) works OK. But linux kernel booting didn't finish > successfully: /boot/04-udev.sh script (which task is udev > initialization) - I think, it's from initrd content - do not see > root partition (1st partition on HDD) ! > At the boot I see the messages > .... > SCSI subsystem initialized > ACPI Exception (processor_core_0787): Processor device isn't present > .... > > ... > Trying manual resume from /dev/sda2 /* it's > swap partition*/ > resume device /dev/sda2 not found (ignoring) There might be an entry "append resume=..." either in lilo.conf or grub's menu.lst > ... > Waiting for device /dev/disk/by-id/scsi-SATA-WDC_WD- > part1 ... /* echo from udev.sh */ Maybe the disk id is different form the one recored in /etc/fstab. What about using plain /dev/sda1 or alike, or mounting by volume label? -- Reuti > > and then the proposal to try again. After finish of this script I > don't see any HDDs in /dev. > > > BIOS setting for this SATA device is "enhanced". "compatible" mode > gives the same result. > > What may be the source of the problem ? May be HDD driver used by > initrd ? > Mikhail Kuzminsky > Computer Assistance to Chemical Research Center > Zelinsky Institute of Organic Chemistry RAS > Moscow > PS. If I see (after finish of udev.sh script) the content of /sys > - it's right in NUMA sense, i.e. > /sys/devices/system/node contains normal node0 and node1. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From kus at free.net Thu Aug 20 07:29:55 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Thu, 20 Aug 2009 18:29:55 +0400 Subject: [Beowulf] moving of Linux HDD to other node: udev problem at boot In-Reply-To: Message-ID: In message from Reuti (Wed, 19 Aug 2009 21:07:19 +0200): >Maybe the disk id is different form the one recored in /etc/fstab. > What about using plain /dev/sda1 or alike, or mounting by volume >label? At the moment of problem /etc/fstab, as I understand, isn't used. And /dev/sda* files are not created by udev :-( Mikhail > >-- Reuti > > >> >> and then the proposal to try again. After finish of this script I >> don't see any HDDs in /dev. >> >> >> BIOS setting for this SATA device is "enhanced". "compatible" mode >> gives the same result. >> >> What may be the source of the problem ? May be HDD driver used by >> initrd ? >> Mikhail Kuzminsky >> Computer Assistance to Chemical Research Center >> Zelinsky Institute of Organic Chemistry RAS >> Moscow >> PS. If I see (after finish of udev.sh script) the content of /sys >> - it's right in NUMA sense, i.e. >> /sys/devices/system/node contains normal node0 and node1. >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >> Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From cousins at umit.maine.edu Thu Aug 20 09:01:46 2009 From: cousins at umit.maine.edu (Steve Cousins) Date: Thu, 20 Aug 2009 12:01:46 -0400 (EDT) Subject: [Beowulf] Anybody going with AMD Istanbul 6-core CPU's? Message-ID: I haven't seen anybody here talking about the 6-core AMD CPU's yet. Is anybody trying these out? Anybody have real-world comparisons (say WRF) of scalability of a 12-core system vs. a 16 thread Nehalem system? Thanks, Steve From eagles051387 at gmail.com Thu Aug 20 09:10:43 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Thu, 20 Aug 2009 18:10:43 +0200 Subject: [Beowulf] amd 3 and 6 core processors Message-ID: someone brought this up in another post and instead of hyjacking that post i started a new one. what are the advantages of having processors that defy the normal development and progressions of cores. 2 4 8 etc like intel follows but amd seems to have done 2 3 4 6 ? what advantage will these kinds of processors have over the intel 4 vs amd 3 and intel 8 vs amd 6? -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From reuti at staff.uni-marburg.de Thu Aug 20 10:02:49 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Thu, 20 Aug 2009 19:02:49 +0200 Subject: [Beowulf] moving of Linux HDD to other node: udev problem at boot In-Reply-To: References: Message-ID: <17C4B67F-303C-4C6B-BF47-614AD6D2F84E@staff.uni-marburg.de> Am 20.08.2009 um 16:29 schrieb Mikhail Kuzminsky: > In message from Reuti (Wed, 19 Aug > 2009 21:07:19 +0200): > >> Maybe the disk id is different form the one recored in /etc/fstab. >> What about using plain /dev/sda1 or alike, or mounting by volume >> label? > > At the moment of problem /etc/fstab, as I understand, isn't used. > And /dev/sda* files are not created by udev :-( Is it the same motherboard/chipset in both machines? You might need a different initrd otherwise. -- Reuti > > Mikhail >> >> -- Reuti >> >> >>> >>> and then the proposal to try again. After finish of this script >>> I don't see any HDDs in /dev. >>> >>> >>> BIOS setting for this SATA device is "enhanced". "compatible" >>> mode gives the same result. >>> >>> What may be the source of the problem ? May be HDD driver used >>> by initrd ? >>> Mikhail Kuzminsky >>> Computer Assistance to Chemical Research Center >>> Zelinsky Institute of Organic Chemistry RAS >>> Moscow >>> PS. If I see (after finish of udev.sh script) the content of / >>> sys - it's right in NUMA sense, i.e. >>> /sys/devices/system/node contains normal node0 and node1. >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>> Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >> Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> -- >> ??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >> ? ????? ???????? ??????????? ??????????? >> MailScanner, ? ?? ???????? >> ??? ??? ?? ???????? ???????????? ????. >> > From reuti at staff.uni-marburg.de Thu Aug 20 11:06:07 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Thu, 20 Aug 2009 20:06:07 +0200 Subject: [Beowulf] moving of Linux HDD to other node: udev problem at boot In-Reply-To: References: Message-ID: <485D2643-4A7C-4D9D-94D8-E80355994908@staff.uni-marburg.de> Am 20.08.2009 um 19:33 schrieb Mikhail Kuzminsky: > In message from Reuti (Thu, 20 Aug > 2009 19:02:49 +0200): >> Am 20.08.2009 um 16:29 schrieb Mikhail Kuzminsky: >> >>> In message from Reuti (Wed, 19 Aug >>> 2009 21:07:19 +0200): >>> >>>> Maybe the disk id is different form the one recored in /etc/ >>>> fstab. What about using plain /dev/sda1 or alike, or mounting >>>> by volume label? >>> >>> At the moment of problem /etc/fstab, as I understand, isn't >>> used. And /dev/sda* files are not created by udev :-( >> >> Is it the same motherboard/chipset in both machines? > > Of course, no - taking into account that one ("source" w/10.3) is w/ > Opteron 2350, and 2nd is based on Nehalem :-) > >> You might need a different initrd otherwise. > > AFAIK, initrd (as the kernel itself) is "universal" for EM64T/x86-64, The problem is not the type of CPU, but the chipset (i.e. the necessary kernel module) with which the HDD is accessed. -- Reuti From lindahl at pbm.com Thu Aug 20 11:23:25 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 20 Aug 2009 11:23:25 -0700 Subject: [Beowulf] moving of Linux HDD to other node: udev problem at boot In-Reply-To: <485D2643-4A7C-4D9D-94D8-E80355994908@staff.uni-marburg.de> References: <485D2643-4A7C-4D9D-94D8-E80355994908@staff.uni-marburg.de> Message-ID: <20090820182325.GC22812@bx9.net> On Thu, Aug 20, 2009 at 08:06:07PM +0200, Reuti wrote: >> AFAIK, initrd (as the kernel itself) is "universal" for EM64T/x86-64, > > The problem is not the type of CPU, but the chipset (i.e. the necessary > kernel module) with which the HDD is accessed. There are 2 aspects to this: 1: /etc/modprobe.conf or equivalent 2: the initrd on a non-rescue disk is generally specialized to only include modules for devices in (1). Solution? Boot in a rescue disk, chroot to your system disk, modify /etc/modprobe.conf appropriately, run mkinitrd. -- g From lindahl at pbm.com Thu Aug 20 11:26:21 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 20 Aug 2009 11:26:21 -0700 Subject: [Beowulf] amd 3 and 6 core processors In-Reply-To: References: Message-ID: <20090820182621.GD22812@bx9.net> On Thu, Aug 20, 2009 at 06:10:43PM +0200, Jonathan Aquilina wrote: > someone brought this up in another post and instead of hyjacking that post i > started a new one. what are the advantages of having processors that defy > the normal development and progressions of cores. > > 2 4 8 etc like intel follows but amd seems to have done 2 3 4 6 ? Intel has done 6 core processors in the recent past, and there are currently 3 memory controllers on Nehalem cpus. The point of a 6 core processor is that's all that fits, plus there isn't enough memory bandwidth to support 8 cores with many workloads. The point of 3 core processors was selling at a lower price point, and being able to sell defective chips as good. -- greg From mathog at caltech.edu Thu Aug 20 11:29:17 2009 From: mathog at caltech.edu (David Mathog) Date: Thu, 20 Aug 2009 11:29:17 -0700 Subject: [Beowulf] Re:moving of Linux HDD to other node: udev problem at boot Message-ID: "Mikhail Kuzminsky" wrote: > I moved Western Digital SATA HDD w/SuSE 10.3 installed (on dual > Barcelona server) to dual Nehalem server (master HDD on Nehalem > server) with Supermicro X8DTi mobo. Which means any number of drivers will have to change. The boot could only succeed if all of these new drivers are present in the distro AND the installation isn't hardwired to use information from the previous system. The first may be true, the second is almost certainly false. On Mandriva, and probably Red Hat, and maybe Suse, even cloning between "identical" systems requires that that the file: /etc/udev/rules.d/61-net_config.rules be removed before reboot as it holds a copy of the MAC from the previous system, and no two machines (should) have the same MAC even if they are otherwise identical. There are a lot of other files in the same directory which I believe hold similar machine specific information. Similarly, your /etc/modprobe.conf will almost certainly load modules which are not appropriate for the new system. If there is an /etc/sysconfig directory there may be files there that also hold machine specific information. The /etc/sensors.conf configuration will also certainly also be incorrect. And remember, this assumes the distro even has all the right pieces somewhere on the disk - which it may not. Perhaps you can successfully boot the system in safe mode and then run whatever configuration tool Suse provides to reset all of these hardware specific files? Otherwise, it might be easier to wipe the disk and do a clean install. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From eagles051387 at gmail.com Thu Aug 20 11:45:07 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Thu, 20 Aug 2009 20:45:07 +0200 Subject: [Beowulf] amd 3 and 6 core processors In-Reply-To: <20090820182621.GD22812@bx9.net> References: <20090820182621.GD22812@bx9.net> Message-ID: a friend of mine told me that the amd tri cores were quads with one core disbaled? On Thu, Aug 20, 2009 at 8:26 PM, Greg Lindahl wrote: > On Thu, Aug 20, 2009 at 06:10:43PM +0200, Jonathan Aquilina wrote: > > > someone brought this up in another post and instead of hyjacking that > post i > > started a new one. what are the advantages of having processors that defy > > the normal development and progressions of cores. > > > > 2 4 8 etc like intel follows but amd seems to have done 2 3 4 6 ? > > Intel has done 6 core processors in the recent past, and there are > currently 3 memory controllers on Nehalem cpus. > > The point of a 6 core processor is that's all that fits, plus there isn't > enough memory bandwidth to support 8 cores with many workloads. > > The point of 3 core processors was selling at a lower price point, and > being able to sell defective chips as good. > > -- greg > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Thu Aug 20 12:33:38 2009 From: mathog at caltech.edu (David Mathog) Date: Thu, 20 Aug 2009 12:33:38 -0700 Subject: [Beowulf] Re: amd 3 and 6 core processors Message-ID: Jonathan Aquilina wrote: > a friend of mine told me that the amd tri cores were quads with one core > disabled? Probably. It will often be the case that the disabled core is defective, maybe not fully dead, but it did not pass all of its tests. It is common practice to recycle multicore CPUs with one bad CPU and sell it as a lower performance part. Similarly, chips that won't run at full speed, but will pass all tests at a lower speed, may be stamped as a lower performance part and shipped as that. It makes good business sense to do this since it lets them recover the otherwise wasted production costs on these partially defective devices. They may also disable the 4th core even if works perfectly, and sell it as a 3 core device, when they have an order for the tricore that needs to be shipped and not enough quadcore chips on hand with one bad core to fill it. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From coutinho at dcc.ufmg.br Thu Aug 20 12:34:41 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Thu, 20 Aug 2009 16:34:41 -0300 Subject: [Beowulf] amd 3 and 6 core processors In-Reply-To: References: <20090820182621.GD22812@bx9.net> Message-ID: Yes. Amd tri-cores were quad cores with a defective core, so they disable the defective core And amd isn't alone. Nvidia Geforce GTX 260s were GTX 280s with disabled stream processors. 2009/8/20 Jonathan Aquilina > a friend of mine told me that the amd tri cores were quads with one core > disbaled? > > > On Thu, Aug 20, 2009 at 8:26 PM, Greg Lindahl wrote: > >> On Thu, Aug 20, 2009 at 06:10:43PM +0200, Jonathan Aquilina wrote: >> >> > someone brought this up in another post and instead of hyjacking that >> post i >> > started a new one. what are the advantages of having processors that >> defy >> > the normal development and progressions of cores. >> > >> > 2 4 8 etc like intel follows but amd seems to have done 2 3 4 6 ? >> >> Intel has done 6 core processors in the recent past, and there are >> currently 3 memory controllers on Nehalem cpus. >> >> The point of a 6 core processor is that's all that fits, plus there isn't >> enough memory bandwidth to support 8 cores with many workloads. >> >> The point of 3 core processors was selling at a lower price point, and >> being able to sell defective chips as good. >> >> -- greg >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > > -- > Jonathan Aquilina > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kus at free.net Thu Aug 20 15:23:13 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 21 Aug 2009 02:23:13 +0400 Subject: [Beowulf] Re: moving of Linux HDD to other node: udev problem at boot In-Reply-To: Message-ID: In message from "David Mathog" (Thu, 20 Aug 2009 11:29:17 -0700): >"Mikhail Kuzminsky" wrote: >> I moved Western Digital SATA HDD w/SuSE 10.3 installed (on dual >> Barcelona server) to dual Nehalem server (master HDD on Nehalem >> server) with Supermicro X8DTi mobo. > >Which means any number of drivers will have to change. The boot >could >only succeed if all of these new drivers are present in the distro >AND >the installation isn't hardwired to use information from the previous >system. The first may be true, It was, of course, the main hope > the second is almost certainly false. ... and the second can be resolved IMHO w/o difficult problems. >On Mandriva, and probably Red Hat, and maybe Suse, even cloning >between >"identical" systems requires that that the file: > > /etc/udev/rules.d/61-net_config.rules > >be removed before reboot as it holds a copy of the MAC from the >previous >system, and no two machines (should) have the same MAC even if they >are >otherwise identical. SuSE have this "problem", but at least 11.1 have special setting to avoid such udev behaviour. And updating of network settings isn't a problem. > There are a lot of other files in the same >directory which I believe hold similar machine specific information. >Similarly, your /etc/modprobe.conf will almost certainly load modules >which are not appropriate for the new system. Is there some modules which depends from processors ? The NIC drivers isn't a problem. > If there is an /etc/sysconfig directory there may be files there that also hold machine specific information. The /etc/sensors.conf configuration will also certainly also be incorrect. Of course, lm_sensors and NICs settings have to be changed. But HDDs for example was the same (excluding size). >Perhaps you can successfully boot the system in safe mode and then >run >whatever configuration tool Suse provides to reset all of these >hardware >specific files? The problem don't depends from kind of load (safemode or usual). >David Mathog >mathog at caltech.edu >Manager, Sequence Analysis Facility, Biology Division, Caltech From richard.walsh at comcast.net Thu Aug 20 15:35:34 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Thu, 20 Aug 2009 22:35:34 +0000 (UTC) Subject: [Beowulf] Re: amd 3 and 6 core processors In-Reply-To: Message-ID: <778088694.1715931250807734460.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> >----- Original Message ----- >From: "David Mathog" >To: beowulf at beowulf.org >Sent: Thursday, August 20, 2009 2:33:38 PM GMT -06:00 US/Canada Central >Subject: [Beowulf] Re: amd 3 and 6 core processors > >Jonathan Aquilina wrote: > >> a friend of mine told me that the amd tri cores were quads with one core >> disabled? > >Probably. It will often be the case that the disabled core is >defective, maybe not fully dead, but it did not pass all of its tests. >It is common practice to recycle multicore CPUs with one bad CPU and >sell it as a lower performance part. Similarly, chips that won't run at >full speed, but will pass all tests at a lower speed, may be stamped as >a lower performance part and shipped as that. It makes good business >sense to do this since it lets them recover the otherwise wasted >production costs on these partially defective devices. They may also >disable the 4th core even if works perfectly, and sell it as a 3 core >device, when they have an order for the tricore that needs to be shipped >and not enough quadcore chips on hand with one bad core to fill it. Many good points above and in Greg's earlier note. Its all about yield and what you can fit on the chip at a given line width. In the past, binning by clock was the primary (only?) choice to bring up yields. As chips have grown in size and evolved toward multi-core, degrading cores has been a economic side-benefit. IBM was one of the first to use this approach (first with dual-core too), when they sold dual-core Power series chips with one core disable to give the remaining core maximum bandwidth. There is little benefit in developing processing for real 2, 3, 4, 5, 6, 7, ... etc. core chips. Better to start with a standard process and core-count, and degrade it to fill lower power and performance bins. The Nehalem micro-architecture is available as a dual core offering. It is not clear to me (someone here may know), whether this is not just a degraded quad-core, or a true dual core. This pinout is different, so perhaps it is a true dual-core. I would also like to know how Intel and AMD are disabling/degrading the cores. They very like have built in circuits that they can "burn out" to ensure physical incapacity. Still, perhaps it is done another way. With Nehalem and its on-chip power management unit, dynamic "soft" disabling may be all that is needed. As folks here are I am sure aware, Intel will have a true 8-core offering in the next 3 to 6 months which puts them in a position to offer 5 and 7 core degraded processors as well. rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: From kus at free.net Thu Aug 20 15:41:35 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 21 Aug 2009 02:41:35 +0400 Subject: [Beowulf] moving of Linux HDD to other node: udev problem at boot In-Reply-To: <20090820182325.GC22812@bx9.net> Message-ID: In message from Greg Lindahl (Thu, 20 Aug 2009 11:23:25 -0700): >On Thu, Aug 20, 2009 at 08:06:07PM +0200, Reuti wrote: > >>> AFAIK, initrd (as the kernel itself) is "universal" for >>>EM64T/x86-64, >> >> The problem is not the type of CPU, but the chipset (i.e. the >>necessary >> kernel module) with which the HDD is accessed. > >There are 2 aspects to this: > >1: /etc/modprobe.conf or equivalent >2: the initrd on a non-rescue disk is generally specialized to only >include modules for devices in (1). > >Solution? Boot in a rescue disk, chroot to your system disk, modify >/etc/modprobe.conf appropriately, run mkinitrd. Thanks, it's good idea ! The problem is (I think) just in 10.3 initrd image. Unfortunately it's in some inconsistence w/my source hope - move HDD ASAP (As Simple As Possible :-)) ). Mikhail From eagles051387 at gmail.com Thu Aug 20 16:24:35 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Fri, 21 Aug 2009 01:24:35 +0200 Subject: [Beowulf] Re: amd 3 and 6 core processors In-Reply-To: <778088694.1715931250807734460.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <778088694.1715931250807734460.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: rbw that is actually true 4th quater of this year intel is release its first 8 core with hyperthreading processor to the xenon market. amd currently already has their 6 core out. i understand the reasoning you made about recycling them David, which saves the company money as a whole on manufacturing especially since they wont need another plant to prossibly produce the lower end product. On Fri, Aug 21, 2009 at 12:35 AM, wrote: > > > > > >----- Original Message ----- > >From: "David Mathog" > >To: beowulf at beowulf.org > >Sent: Thursday, August 20, 2009 2:33:38 PM GMT -06:00 US/Canada Central > >Subject: [Beowulf] Re: amd 3 and 6 core processors > > > >Jonathan Aquilina wrote: > > > >> a friend of mine told me that the amd tri cores were quads with one core > >> disabled? > > > >Probably. It will often be the case that the disabled core is > >defective, maybe not fully dead, but it did not pass all of its tests. > >It is common practice to recycle multicore CPUs with one bad CPU and > >sell it as a lower performance part. Similarly, chips that won't run at > >full speed, but will pass all tests at a lower speed, may be stamped as > >a lower performance part and shipped as that. It makes good business > >sense to do this since it lets them recover the otherwise wasted > >production costs on these partially defective devices. They may also > >disable the 4th core even if works perfectly, and sell it as a 3 core > >device, when they have an order for the tricore that needs to be shipped > >and not enough quadcore chips on hand with one bad core to fill it. > > Many good points above and in Greg's earlier note. Its all about yield > and what you can fit on the chip at a given line width. > > In the past, binning by clock was the primary (only?) choice to bring up > yields. As chips have grown in size and evolved toward multi-core, > degrading cores has been a economic side-benefit. IBM was one of > the first to use this approach (first with dual-core too), when they sold > dual-core > Power series chips with one core disable to give the remaining core > maximum bandwidth. There is little benefit in developing processing > for real 2, 3, 4, 5, 6, 7, ... etc. core chips. Better to start with a > standard > process and core-count, and degrade it to fill lower power and performance > bins. The Nehalem micro-architecture is available as a dual core offering. > It > is not clear to me (someone here may know), whether this is not just a > degraded quad-core, or a true dual core. This pinout is different, so > perhaps it is a true dual-core. I would also like to know how Intel and > AMD are disabling/degrading the cores. They very like have built > in circuits that they can "burn out" to ensure physical incapacity. Still, > perhaps it is done another way. With Nehalem and its on-chip power > management unit, dynamic "soft" disabling may be all that is needed. > > As folks here are I am sure aware, Intel will have a true 8-core offering > in the next 3 to 6 months which puts them in a position to offer 5 and > 7 core degraded processors as well. > > rbw > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From henning.fehrmann at aei.mpg.de Fri Aug 21 06:25:32 2009 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Fri, 21 Aug 2009 15:25:32 +0200 Subject: [Beowulf] HD undetectable errors Message-ID: <20090821132532.GA16945@gretchen.aei.mpg.de> Hello, a typical rate for data not recovered in a read operation on a HD is 1 per 10^15 bit reads. If one fills a 100TByte file server the probability of loosing data is of the order of 1. Off course, one could circumvent this problem by using RAID5 or RAID6. Most of the controller do not check the parity if they read data and here the trouble begins. I can't recall the rate for undetectable errors but this might be few orders of magnitude smaller than 1 per 10^15 bit reads. However, given the fact that one deals nowadays with few hundred TBytes of data this might happen from time to time without being realized. One could lower the rate by forcing the RAID controller to check the parity information in a read process. Are there RAID controller which are able to perform this? Another solution might be the useage of file systems which have additional checksums for the blocks like zfs or qfs. This even prevents data corruption due to undetected bit flips on the bus or the RAID controller. Does somebody know the size of the checksum and the rate of undetected errors for qfs? For zfs it is 256 bit per 512Byte data. One option is the fletcher2 algorithm to compute the checksum. Does somebody know the rate of undetectable bit flips for such a setting? Are there any other file systems doing block-wise checksumming? Thank you, Henning Fehrmann From cap at nsc.liu.se Fri Aug 21 09:33:17 2009 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Fri, 21 Aug 2009 18:33:17 +0200 Subject: [Beowulf] HD undetectable errors In-Reply-To: <20090821132532.GA16945@gretchen.aei.mpg.de> References: <20090821132532.GA16945@gretchen.aei.mpg.de> Message-ID: <200908211833.21485.cap@nsc.liu.se> On Friday 21 August 2009, Henning Fehrmann wrote: > Hello, > > a typical rate for data not recovered in a read operation on a HD is > 1 per 10^15 bit reads. I think Seagate claims 10^15 _sectors_ but I may have mis-read it. > If one fills a 100TByte file server the probability of loosing data > is of the order of 1. > Off course, one could circumvent this problem by using RAID5 or RAID6. > Most of the controller do not check the parity if they read data and > here the trouble begins. Peter Kelemen at CERN has done some interesting things, like: http://cern.ch/Peter.Kelemen/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf > I can't recall the rate for undetectable errors but this might be few > orders of magnitude smaller than 1 per 10^15 bit reads. However, given > the fact that one deals nowadays with few hundred TBytes of data this > might happen from time to time without being realized. > > One could lower the rate by forcing the RAID controller to check the > parity information in a read process. Are there RAID controller which > are able to perform this? Yes. But most wont and it will hur quite a lot performance wise. I know, for example, that our IBM DS4700 with updated firmware can enable "verify-on-read". > Another solution might be the useage of file systems which have additional > checksums for the blocks like zfs or qfs. This even prevents data > corruption due to undetected bit flips on the bus or the RAID > controller. This is, IMHO, probably a better approach. As you can read in the article I referenced above the controller layer (and really any layer) adds yet another source of silent corruptions so the higher up the better. Also, the file system has information about the data layout that it can use to do this more efficiently than lower layers. > Does somebody know the size of the checksum and the rate of undetected > errors for qfs? Remember that you have to calculate this against the amount of corrupt data, not the total amount of data. /Peter > For zfs it is 256 bit per 512Byte data. > One option is the fletcher2 algorithm to compute the checksum. > Does somebody know the rate of undetectable bit flips for such a > setting? > > Are there any other file systems doing block-wise checksumming? > > > Thank you, > Henning Fehrmann -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From orion at cora.nwra.com Fri Aug 21 10:22:30 2009 From: orion at cora.nwra.com (Orion Poplawski) Date: Fri, 21 Aug 2009 11:22:30 -0600 Subject: [Beowulf] Help for terrible NFS write performance Message-ID: <4A8ED7D6.8080601@cora.nwra.com> I'm trying to improve the terrible NFS (write in particular) performance I'm seeing. Pure network performance does not appear to be an issue as I can hit 120MB/s reading which should be about the limit for gigE. Perhaps the local disk performance is not what it should be. Any help would be greatly appreciated. Using bonnie++ for benchmarks. Server: Dual proc dual core opteron 2GHz 8GB RAM CentOS 4.7 kernel 2.6.9-78.0.22.plus.c4smp 3 8-port Marvell MV88SX6081 SATAII controllers sata_mv 3.6.2 driver Ethernet controller: nVidia Corporation MCP55 Ethernet (rev a3) MTU 8982 Arrays are linux md arrays of 6 disks with 2 on each controller. 64k cunks. ext3 filesystem. "working" - raid0 ST31000340AS 1TB drives local perf: 224-240MB/s write, 135MB/s rewrite, 390-400MB/s read "cora6" - raid5 ST31500341AS 1.5TB drives local perf: 84MB/s write, 42MB/s rewrite, 161-166MB/s read /etc/exports: /export *.cora.nwra.com(rw,sync,fsid=0) /export/cora6 *.cora.nwra.com(rw,sync,nohide) /export/working *.cora.nwra.com(rw,sync,nohide) /etc/sysconfig/nfs: # This entry should be "yes" if you are using RPCSEC_GSS_KRB5 (auth=krb5,krb5i, or krb5p) SECURE_NFS="no" RPCNFSDCOUNT=32 MOUNTD_NFS_V2=no client: Dual Opteron 246 3GB RAM CentOS 5.3 kernel 2.6.18-144.el5 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03) mounts: NFSv4 rsize=32768,wsize=32768 "working" - 60MB/s write, 12MB/s rewrite, 120MB/s read "cora6" - 21MB/s write, 7.7MB/s rewrite, 120MB/s read Thanks again! -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion at cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From peter.st.john at gmail.com Fri Aug 21 10:22:54 2009 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 21 Aug 2009 13:22:54 -0400 Subject: [Beowulf] modular motherboard prototype Message-ID: Slashdot pointed out a prototype modular motherboard; a big mobo is made from snapping together a bunch of little ones: http://www.wired.com/gadgetlab/2009/08/modular-motherboard/ 'David Ackley , associate professor of computer science at the University of New Mexico and one of the contributors to the project [says] ?We have a CPU, RAM, data storage and serial ports for connectivity on every two square inches.? ' ... Each X Machina module has a 72 MHz processor (currently an ARM chip), a solid state drive of 16KB and 128KB of storage in an EEPROM ... chip. There?s also an LED for display output and a button for user interaction.' The modules share power & data through a single connector, recognize the presence of neighbors and can load each others programs automatically. If the connectors are the usual male/female (not certain from the picture) then snapping them together would be a bear, since the vertical pair of pins wouldn't be aligned when the horizontal pair hasn't been snapped in yet, and vice versa. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Fri Aug 21 10:38:17 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 21 Aug 2009 13:38:17 -0400 Subject: [Beowulf] Help for terrible NFS write performance In-Reply-To: <4A8ED7D6.8080601@cora.nwra.com> References: <4A8ED7D6.8080601@cora.nwra.com> Message-ID: <4A8EDB89.7090409@scalableinformatics.com> Orion Poplawski wrote: > I'm trying to improve the terrible NFS (write in particular) performance > I'm seeing. Pure network performance does not appear to be an issue as > I can hit 120MB/s reading which should be about the limit for gigE. > Perhaps the local disk performance is not what it should be. Any help > would be greatly appreciated. Using bonnie++ for benchmarks. > > Server: > > Dual proc dual core opteron 2GHz > 8GB RAM > CentOS 4.7 > kernel 2.6.9-78.0.22.plus.c4smp > 3 8-port Marvell MV88SX6081 SATAII controllers > sata_mv 3.6.2 driver > Ethernet controller: nVidia Corporation MCP55 Ethernet (rev a3) > MTU 8982 > > Arrays are linux md arrays of 6 disks with 2 on each controller. 64k > cunks. ext3 filesystem. If I had to bet, ext3 would have much to do with this ... though, honestly, md RAID write performance over NFS is nothing to write home about. We can get ~350MB/s on our DeltaV's, but this takes lots of work. > > "working" - raid0 ST31000340AS 1TB drives > local perf: 224-240MB/s write, 135MB/s rewrite, 390-400MB/s read > "cora6" - raid5 ST31500341AS 1.5TB drives > local perf: 84MB/s write, 42MB/s rewrite, 161-166MB/s read > > /etc/exports: > /export *.cora.nwra.com(rw,sync,fsid=0) > /export/cora6 *.cora.nwra.com(rw,sync,nohide) > /export/working *.cora.nwra.com(rw,sync,nohide) Ok. There it is... Sync. Don't need to see anything else. That is it. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From kus at free.net Fri Aug 21 10:53:16 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 21 Aug 2009 21:53:16 +0400 Subject: [Beowulf] nearly future of Larrabee Message-ID: AFAIK Larrabee-based product(s) will appear soon - at begin of 2010. Unfortunatley I didn't see enough appropriate technical information. What new is known from SIGGRAPH 2009 ? There was 2 ideas of Larrabee-based hardware a) Whole computers on Larrabee CPU(s) b) GPGPU card. Recently I didn't see any words about Larrabee-based servers - only about graphical cards. If Larrabee will work as CPU - then I beleive that linux kernel developers will work in this direction. But I didn't find anything about Larrabee in 2.6. So Q1. Is there the plans to build Larrabee-based motherboards (in particular in 2010) ? If Larrabee will be in the form of graphical card (the most probable case) - Q2. What will be the interface - one slot PCI-E v.2 x16 ? It's known now, that DP will be hardware supported and (AFAIK) that 512-bit operands (i.e. 8 DP words) will be supported in ISA. Q3. Does it means that Larrabee will give essential speedup also on relative short vectors ? And is there some preliminary articles w/estimation of Larrabee DP performance ? One of declared potential advantages of Larrabee is support by compilers. There is now PGI Fortran w/NVidia GPGPU extensions. PGI Accelerator-2010 will include support of CUDA on the base of OpenMP-like comments to compiler. So Q4. Is there some rumours about direct Larrabee support w/Intel ifort or PGI compilers in 2010 ? (By "direct" I mean automatic compiler vectorization of "pure" Fortran/C source, maximim w/additional commemts). Q5. How much may costs Larrabee-based hardware in 2010 ? I hope it'll be lower $10000. Any more exact predictions ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From orion at cora.nwra.com Fri Aug 21 10:56:37 2009 From: orion at cora.nwra.com (Orion Poplawski) Date: Fri, 21 Aug 2009 11:56:37 -0600 Subject: [Beowulf] Help for terrible NFS write performance In-Reply-To: <4A8EDB89.7090409@scalableinformatics.com> References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> Message-ID: <4A8EDFD5.9080909@cora.nwra.com> On 08/21/2009 11:38 AM, Joe Landman wrote: > Orion Poplawski wrote: >> /etc/exports: >> /export *.cora.nwra.com(rw,sync,fsid=0) >> /export/cora6 *.cora.nwra.com(rw,sync,nohide) >> /export/working *.cora.nwra.com(rw,sync,nohide) > > Ok. There it is... Sync. > > Don't need to see anything else. > > That is it. Okay, how afraid should I be of using async? man exports states: async This option allows the NFS server to violate the NFS protocol and reply to requests before any changes made by that request have been committed to stable storage (e.g. disc drive). Using this option might improve performance with version 2 only, but at the cost that an unclean server restart (i.e. a crash) can cause data to be lost or corrupted. I tend to like to avoid data loss or corruption. Thanks! -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion at cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From lindahl at pbm.com Fri Aug 21 10:59:48 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 21 Aug 2009 10:59:48 -0700 Subject: [Beowulf] HD undetectable errors In-Reply-To: <20090821132532.GA16945@gretchen.aei.mpg.de> References: <20090821132532.GA16945@gretchen.aei.mpg.de> Message-ID: <20090821175948.GA18621@bx9.net> On Fri, Aug 21, 2009 at 03:25:32PM +0200, Henning Fehrmann wrote: > Are there any other file systems doing block-wise checksumming? Cloud filesystems (GFS, HDFS) and Lustre all keep checksums, which end up being user-level checksums in files stored in traditional filesystems. I have a petabyte of disk and wouldn't dream of storing data without checksums. -- greg From skylar at cs.earlham.edu Fri Aug 21 11:03:00 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Fri, 21 Aug 2009 11:03:00 -0700 Subject: [Beowulf] Help for terrible NFS write performance In-Reply-To: <4A8EDFD5.9080909@cora.nwra.com> References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> <4A8EDFD5.9080909@cora.nwra.com> Message-ID: <4A8EE154.2060308@cs.earlham.edu> Orion Poplawski wrote: > Okay, how afraid should I be of using async? man exports states: > > async This option allows the NFS server to violate the NFS > protocol > and reply to requests before any changes made by that > request > have been committed to stable storage (e.g. disc drive). > > Using this option might improve performance with > version 2 > only, but at the cost that an unclean server restart > (i.e. a > crash) can cause data to be lost or corrupted. > > > I tend to like to avoid data loss or corruption. > > Thanks! > It depends on your applications. async doesn't remove the capability of writing synchronously to an NFS export, but it removes the implicit sync after every write. If your applications know to fsync() when they absolutely need to have stuff written to disk (and of course check errno after fsync()), the server will respect that and flush the data for that file and only reply if the data was flushed properly. NB: This likely depends on both your NFS server and client implementations. Use with caution. YMMV. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 253 bytes Desc: OpenPGP digital signature URL: From landman at scalableinformatics.com Fri Aug 21 11:09:18 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 21 Aug 2009 14:09:18 -0400 Subject: [Beowulf] Help for terrible NFS write performance In-Reply-To: References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> Message-ID: <4A8EE2CE.7030405@scalableinformatics.com> Cunningham, Dave wrote: > Are you suggesting that running nosync in an hpc environment might be > a good idea ? We had that discussion with our integrator and were > assured that running nosync was the path to damnation and would > probably grow hair on our palms. ... that may be, but the flip side is that your write performance will be terrible with sync. Its a question of which risk is larger. Sync forces the write to wait to return to the user until the data is committed to disk. Which is technically not true, as it is committed to cache on the drive unless you turned off all write caching. Even then, drivers don't necessarily wait and verify that the data got to disk. They verify that the data was flushed to disk, but not that the data on the disk is what you thought it was. This said, we haven't seen this problem (corrupted fs data) as an issue for a crashed NFS server in a while (years). YMMV. > What are your thoughts on the tradeoff ? Performance or "safety". Sync doesn't give you guarantees that your data is on disk, it merely guarantees that the relevant semantics have been honored. We have found that with a good journaling file system (not ext3), that this is not usually an issue. Then again, you are using md raid, which means you don't have a nice battery backed raid behind you to cache IO ops, such as writes that didn't finish making it to disk. So unless you have turned off write caching on the drives themselves, the sync is superfluous. Bug me offline if you want to talk more about this. Joe > > Dave Cunningham > > -----Original Message----- From: beowulf-bounces at beowulf.org > [mailto:beowulf-bounces at beowulf.org] On Behalf Of Joe Landman Sent: > Friday, August 21, 2009 10:38 AM To: Orion Poplawski Cc: Beowulf List > Subject: Re: [Beowulf] Help for terrible NFS write performance > > Orion Poplawski wrote: >> I'm trying to improve the terrible NFS (write in particular) >> performance I'm seeing. Pure network performance does not appear >> to be an issue as I can hit 120MB/s reading which should be about >> the limit for gigE. Perhaps the local disk performance is not what >> it should be. Any help would be greatly appreciated. Using >> bonnie++ for benchmarks. >> >> Server: >> >> Dual proc dual core opteron 2GHz 8GB RAM CentOS 4.7 kernel >> 2.6.9-78.0.22.plus.c4smp 3 8-port Marvell MV88SX6081 SATAII >> controllers sata_mv 3.6.2 driver Ethernet controller: nVidia >> Corporation MCP55 Ethernet (rev a3) MTU 8982 >> >> Arrays are linux md arrays of 6 disks with 2 on each controller. >> 64k cunks. ext3 filesystem. > > If I had to bet, ext3 would have much to do with this ... though, > honestly, md RAID write performance over NFS is nothing to write home > about. We can get ~350MB/s on our DeltaV's, but this takes lots of > work. > >> "working" - raid0 ST31000340AS 1TB drives local perf: 224-240MB/s >> write, 135MB/s rewrite, 390-400MB/s read "cora6" - raid5 >> ST31500341AS 1.5TB drives local perf: 84MB/s write, 42MB/s rewrite, >> 161-166MB/s read >> >> /etc/exports: /export *.cora.nwra.com(rw,sync,fsid=0) >> /export/cora6 *.cora.nwra.com(rw,sync,nohide) /export/working >> *.cora.nwra.com(rw,sync,nohide) > > Ok. There it is... Sync. > > Don't need to see anything else. > > That is it. > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Fri Aug 21 11:14:51 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 21 Aug 2009 14:14:51 -0400 Subject: [Beowulf] Help for terrible NFS write performance In-Reply-To: <4A8EDFD5.9080909@cora.nwra.com> References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> <4A8EDFD5.9080909@cora.nwra.com> Message-ID: <4A8EE41B.7000605@scalableinformatics.com> Orion Poplawski wrote: > On 08/21/2009 11:38 AM, Joe Landman wrote: >> Orion Poplawski wrote: >>> /etc/exports: >>> /export *.cora.nwra.com(rw,sync,fsid=0) >>> /export/cora6 *.cora.nwra.com(rw,sync,nohide) >>> /export/working *.cora.nwra.com(rw,sync,nohide) >> >> Ok. There it is... Sync. >> >> Don't need to see anything else. >> >> That is it. > > Okay, how afraid should I be of using async? man exports states: > > async This option allows the NFS server to violate the NFS protocol > and reply to requests before any changes made by that > request > have been committed to stable storage (e.g. disc drive). > > Using this option might improve performance with version 2 > only, but at the cost that an unclean server restart > (i.e. a > crash) can cause data to be lost or corrupted. If your NFS server crashes, you *could* lose data. Not you *will* lose data. But since you are using md raid without battery backed cache, chances of data loss could be higher (all the RAID calculations happen in RAM). The file system could be properly resilient. What would you do if the server crashed? Would you have users restart their runs? Its all a question of risk, real and imagined (or more correctly ... over-emphasized). > I tend to like to avoid data loss or corruption. So do most people. If your node crashes, do you get data loss? If your server crashes, will you get data loss? I am guessing that if you do an hdparm -W /dev/sd* you will find your write caching on, on each drive. If so, sync is less of an issue, and you have bigger worries. This is BTW an area where having a big fast RAID card with a large cache is a definite advantage over md raid. Data in a battery backed RAID cache won't go away under a reboot/crash. > > Thanks! > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From james.p.lux at jpl.nasa.gov Fri Aug 21 12:04:09 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 21 Aug 2009 12:04:09 -0700 Subject: [Beowulf] modular motherboard prototype In-Reply-To: References: Message-ID: From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Peter St. John Sent: Friday, August 21, 2009 10:23 AM To: Beowulf List Subject: [Beowulf] modular motherboard prototype Slashdot pointed out a prototype modular motherboard; a big mobo is made from snapping together a bunch of little ones: http://www.wired.com/gadgetlab/2009/08/modular-motherboard/ 'David Ackley, associate professor of computer science at the University of New Mexico and one of the contributors to the project [says] "We have a CPU, RAM, data storage and serial ports for connectivity on every two square inches." ' ... Each X Machina module has a 72 MHz processor (currently an ARM chip), a solid state drive of 16KB and 128KB of storage in an EEPROM ... chip. There's also an LED for display output and a button for user interaction.' The modules share power & data through a single connector, recognize the presence of neighbors and can load each others programs automatically. If the connectors are the usual male/female (not certain from the picture) then snapping them together would be a bear, since the vertical pair of pins wouldn't be aligned when the horizontal pair hasn't been snapped in yet, and vice versa. Peter ---- There are lots of hermaphroditic connectors around, or, for that matter, you arrange both male and female contacts in the connector. There is probably also a relative orientation requirement for these boards.. e.g. they are all in a plane, and they all have to have the same 2D rotational orientation (North up on all boards, or something similar), so then you could have gendered connectors, and still flexible interconnects.. The left side has female, the right male, for instance. In the case of the Illuminato boards, it looks like they're simple female 0.1" spaced connectors, and you'd put a 7x1 pin header in between to connect them. (it might be a 2x7 array for 14 pins.. the picture isn't totally clear on the website. These days, with high speed serial interconnects, especially with wireless (optical or RF), one should be able to build nodes that can be literally thrown together any old way, except for the power. And there are solutions for power distribution as well. The folks doing swarms and emergent behavior are big into this kind of thing. AS far as the architecture goes, I wonder if these guys ever looked at transputers and their related architectures. And anyone on this list is well aware that the real challenge isn't the hardware, but the software, and making use of that amorphous blob of processing with random interconnects for anything other than "toy" applications. As the Wired article says: "..there are many details that need to be worked out..." "..haven't benchmarked .. power consumption and speed.." We can guess, though.. It's an ARM running at 72 MHz, for which they claim 64 Dhrystone MIPS. Some of those are going to be burned in interprocess communication. The LPC2368 (which is what they use) datasheet says it draws 125mA at 3.3V running at 72MHz. (I'm pretty sure that doesn't include power to drive I/O pins, which is separately supplied). That's about half a watt. So they're getting about 128 DMIPS/Watt. Let's compare to, say, a Core2 Duo at 2.4GHz, with 7000 DMIPS, burning about 35W, or 200 DMIPS/Watt. They're in the same ballpark, but I suspect that in a "real" system doing "real" work, the overhead moving data from one tiny processor to another will consume easily half of the overall resources. SO right now, this is a cute toy. I wouldn't mind having a few dozen modules to fool with. (they're about $60 each) (compare also, what if you went out and got a box of Gumstix) Jim From lindahl at pbm.com Fri Aug 21 13:15:54 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 21 Aug 2009 13:15:54 -0700 Subject: [Beowulf] modular motherboard prototype In-Reply-To: References: Message-ID: <20090821201554.GD314@bx9.net> On Fri, Aug 21, 2009 at 12:04:09PM -0700, Lux, Jim (337C) wrote: > We can guess, though.. It's an ARM running at 72 MHz, for which they > claim 64 Dhrystone MIPS. Some of those are going to be burned in > interprocess communication. The LPC2368 (which is what they use) > datasheet says it draws 125mA at 3.3V running at 72MHz. (I'm pretty > sure that doesn't include power to drive I/O pins, which is > separately supplied). That's about half a watt. So they're getting > about 128 DMIPS/Watt In comparison, a Sheevaplug is $100, 1.2 Ghz ARM, 15 watts max for the whole thing, and you can use gigE networking and program it with MPI. I suppose it depends on how weird you like your hobbies to be :-) -- g From gerry.creager at tamu.edu Fri Aug 21 13:28:37 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Fri, 21 Aug 2009 15:28:37 -0500 Subject: [Beowulf] modular motherboard prototype In-Reply-To: <20090821201554.GD314@bx9.net> References: <20090821201554.GD314@bx9.net> Message-ID: <4A8F0375.8010902@tamu.edu> Greg Lindahl wrote: > On Fri, Aug 21, 2009 at 12:04:09PM -0700, Lux, Jim (337C) wrote: > >> We can guess, though.. It's an ARM running at 72 MHz, for which they >> claim 64 Dhrystone MIPS. Some of those are going to be burned in >> interprocess communication. The LPC2368 (which is what they use) >> datasheet says it draws 125mA at 3.3V running at 72MHz. (I'm pretty >> sure that doesn't include power to drive I/O pins, which is >> separately supplied). That's about half a watt. So they're getting >> about 128 DMIPS/Watt > > In comparison, a Sheevaplug is $100, 1.2 Ghz ARM, 15 watts max for the > whole thing, and you can use gigE networking and program it with MPI. > I suppose it depends on how weird you like your hobbies to be :-) For Jim's projects to work, however, he's gotta also have the spacecraft fly information to keep the PowerSquid in formation, so the SheevaPlugs can all get power. Then there's the real long extension cord back to those AC-procuding solar cells.... From prentice at ias.edu Fri Aug 21 14:11:00 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Aug 2009 17:11:00 -0400 Subject: [Beowulf] Help for terrible NFS write performance In-Reply-To: <4A8EE41B.7000605@scalableinformatics.com> References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> <4A8EDFD5.9080909@cora.nwra.com> <4A8EE41B.7000605@scalableinformatics.com> Message-ID: <4A8F0D64.7050702@ias.edu> Joe Landman wrote: > > So do most people. If your node crashes, do you get data loss? If your > server crashes, will you get data loss? I am guessing that if you do an > > hdparm -W /dev/sd* Just following this thread. When I try that command, I get an error: hdparm -W /dev/sd* -W: missing value (0/1) /dev/sda: /dev/sda1: /dev/sda2: /dev/sdb: /dev/sdc: /dev/sdd: /dev/sde: /dev/sdf: /dev/sdg: -- Prentice From james.p.lux at jpl.nasa.gov Fri Aug 21 14:24:29 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 21 Aug 2009 14:24:29 -0700 Subject: [Beowulf] modular motherboard prototype In-Reply-To: <4A8F0375.8010902@tamu.edu> References: <20090821201554.GD314@bx9.net> <4A8F0375.8010902@tamu.edu> Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Gerry Creager > Sent: Friday, August 21, 2009 1:29 PM > To: Greg Lindahl > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] modular motherboard prototype > > Greg Lindahl wrote: > > On Fri, Aug 21, 2009 at 12:04:09PM -0700, Lux, Jim (337C) wrote: > > > >> We can guess, though.. It's an ARM running at 72 MHz, for which they > >> claim 64 Dhrystone MIPS. Some of those are going to be burned in > >> interprocess communication. The LPC2368 (which is what they use) > >> datasheet says it draws 125mA at 3.3V running at 72MHz. (I'm pretty > >> sure that doesn't include power to drive I/O pins, which is > >> separately supplied). That's about half a watt. So they're getting > >> about 128 DMIPS/Watt > > > > In comparison, a Sheevaplug is $100, 1.2 Ghz ARM, 15 watts max for > the > > whole thing, and you can use gigE networking and program it with MPI. > > I suppose it depends on how weird you like your hobbies to be :-) > > > For Jim's projects to work, however, he's gotta also have the spacecraft > fly information to keep the PowerSquid in formation, so the SheevaPlugs > can all get power. Then there's the real long extension cord back to > those AC-procuding solar cells.... But that's work.. For hobbies, it's different. Hmm. 3.3V power.. I wonder if you could string em in series and run them off a rectified and filtered wall socket, just like Christmas tree lights. One could do all kinds of interesting things with individual processors on each bulb.. display patterns, messages. If they had photosensors, you could build your own fireflies, and have them blink in synchronism, etc. From gus at ldeo.columbia.edu Fri Aug 21 14:46:42 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 21 Aug 2009 17:46:42 -0400 Subject: hdparm error: Was Re: [Beowulf] Help for terrible NFS write performance In-Reply-To: <4A8F0D64.7050702@ias.edu> References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> <4A8EDFD5.9080909@cora.nwra.com> <4A8EE41B.7000605@scalableinformatics.com> <4A8F0D64.7050702@ias.edu> Message-ID: <4A8F15C2.7040009@ldeo.columbia.edu> Prentice Bisbal wrote: > Joe Landman wrote: >> So do most people. If your node crashes, do you get data loss? If your >> server crashes, will you get data loss? I am guessing that if you do an >> >> hdparm -W /dev/sd* > > Just following this thread. When I try that command, I get an error: > > hdparm -W /dev/sd* > -W: missing value (0/1) > > /dev/sda: > > > /dev/sdg: > > -- > Prentice > Somehow it works on Fedora 10, but not on CentOS 4 and 5. # uname -r 2.6.27.25-170.2.72.fc10.i686 # hdparm -W /dev/sda /dev/sda: write-caching = 1 (on) ** # uname -r 2.6.18-92.1.22.el5 # hdparm -W /dev/sda -W: missing value (0/1) /dev/sda: ** The hdparm man page says: "Some options may work correctly only with the latest kernels." Gus Correa From coutinho at dcc.ufmg.br Fri Aug 21 15:17:34 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Fri, 21 Aug 2009 19:17:34 -0300 Subject: [Beowulf] modular motherboard prototype In-Reply-To: <20090821201554.GD314@bx9.net> References: <20090821201554.GD314@bx9.net> Message-ID: 2009/8/21 Greg Lindahl > On Fri, Aug 21, 2009 at 12:04:09PM -0700, Lux, Jim (337C) wrote: > > > We can guess, though.. It's an ARM running at 72 MHz, for which they > > claim 64 Dhrystone MIPS. Some of those are going to be burned in > > interprocess communication. The LPC2368 (which is what they use) > > datasheet says it draws 125mA at 3.3V running at 72MHz. (I'm pretty > > sure that doesn't include power to drive I/O pins, which is > > separately supplied). That's about half a watt. So they're getting > > about 128 DMIPS/Watt > > In comparison, a Sheevaplug is $100, 1.2 Ghz ARM, 15 watts max for the > whole thing, and you can use gigE networking and program it with MPI. > I suppose it depends on how weird you like your hobbies to be :-) They extended the fast array of wimpy nodes to a matrix of wimpy nodes! :-) -------------- next part -------------- An HTML attachment was scrubbed... URL: From dave.cunningham at lmco.com Fri Aug 21 10:54:18 2009 From: dave.cunningham at lmco.com (Cunningham, Dave) Date: Fri, 21 Aug 2009 11:54:18 -0600 Subject: [Beowulf] Help for terrible NFS write performance In-Reply-To: <4A8EDB89.7090409@scalableinformatics.com> References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> Message-ID: Are you suggesting that running nosync in an hpc environment might be a good idea ? We had that discussion with our integrator and were assured that running nosync was the path to damnation and would probably grow hair on our palms. What are your thoughts on the tradeoff ? Dave Cunningham -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Joe Landman Sent: Friday, August 21, 2009 10:38 AM To: Orion Poplawski Cc: Beowulf List Subject: Re: [Beowulf] Help for terrible NFS write performance Orion Poplawski wrote: > I'm trying to improve the terrible NFS (write in particular) performance > I'm seeing. Pure network performance does not appear to be an issue as > I can hit 120MB/s reading which should be about the limit for gigE. > Perhaps the local disk performance is not what it should be. Any help > would be greatly appreciated. Using bonnie++ for benchmarks. > > Server: > > Dual proc dual core opteron 2GHz > 8GB RAM > CentOS 4.7 > kernel 2.6.9-78.0.22.plus.c4smp > 3 8-port Marvell MV88SX6081 SATAII controllers > sata_mv 3.6.2 driver > Ethernet controller: nVidia Corporation MCP55 Ethernet (rev a3) > MTU 8982 > > Arrays are linux md arrays of 6 disks with 2 on each controller. 64k > cunks. ext3 filesystem. If I had to bet, ext3 would have much to do with this ... though, honestly, md RAID write performance over NFS is nothing to write home about. We can get ~350MB/s on our DeltaV's, but this takes lots of work. > > "working" - raid0 ST31000340AS 1TB drives > local perf: 224-240MB/s write, 135MB/s rewrite, 390-400MB/s read > "cora6" - raid5 ST31500341AS 1.5TB drives > local perf: 84MB/s write, 42MB/s rewrite, 161-166MB/s read > > /etc/exports: > /export *.cora.nwra.com(rw,sync,fsid=0) > /export/cora6 *.cora.nwra.com(rw,sync,nohide) > /export/working *.cora.nwra.com(rw,sync,nohide) Ok. There it is... Sync. Don't need to see anything else. That is it. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Fri Aug 21 22:44:44 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 21 Aug 2009 22:44:44 -0700 Subject: [Beowulf] Help for terrible NFS write performance In-Reply-To: References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> Message-ID: <20090822054444.GA8950@bx9.net> > Are you suggesting that running nosync in an hpc environment might > be a good idea ? We had that discussion with our integrator and > were assured that running nosync was the path to damnation and would > probably grow hair on our palms. Ever since "nosync" was invented, almost all people have used it almost all of the time. It provides a huge benefit with a very modest chance of damnation and hair growth. "sync" still has a chance of damnation; anyone who tells you that it's perfect is lying. I would suggest getting a new integrator. Really. This is basic stuff. -- greg From bcostescu at gmail.com Sat Aug 22 18:17:08 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Sun, 23 Aug 2009 03:17:08 +0200 Subject: [Beowulf] nearly future of Larrabee In-Reply-To: References: Message-ID: 2009/8/21 Mikhail Kuzminsky : > Recently I didn't see any words about Larrabee-based servers - only about > graphical cards. I have attended a talk by someone from Intel, but not someone working on Larrabee, so the information might not be current or accurate. He mentioned that the initial launch will only be with graphics cards. Intel sees Larrabee as an addition to the system and not as the main component of the system, so it's unlikely to have it as the main CPU; it's still possible to have it in some other form than a graphics card though. > Q1. Is there the plans to build Larrabee-based motherboards (in particular > in 2010) ? >From what I have seen, it's unclear what a hypothetical Larrabee motherboard should contain. The schematics that I've seen mentioned the ways the cores connect to the shared cache, but only mentioned a (shared) memory controller, no I/O controller. > If Larrabee will be in the form of graphical card (the most probable case) - > Q2. What will be the interface - one slot PCI-E v.2 x16 ? In the presentation that I've seen, it was not clearly specified, but implied, as this is the current way of interfacing with a graphics card. > Q3. Does it means that Larrabee will give essential speedup also on relative > short vectors ? I don't quite understand your question... > And is there some preliminary articles w/estimation of Larrabee DP > performance ? This question was asked, the answer was that there are no figures yet as the only official way to play with a Larrabee core now is through a simulator which makes performance figures irrelevant. > Q4. Is there some rumours about direct Larrabee support w/Intel ifort or PGI > compilers in 2010 ? The core is mainly a P5 with added vector instructions, so most of the code generation should be done already for several years by Intel, PGI and other compilers, even gcc. Only the new vector instructions need to be added and the compiler be taught to do vectorization using them; this might not be as easy as it sounds because the new instructions don't only deal with a larger set of bits but also with f.e. scatter/gather/masking array elements. One other detail that was mentioned was a difference with respect with current nVidia & AMD GPUs: these are good at doing the same thing to lots of data in parallel (SIMD), while the Larrabee cores will also be good at doing sequences of operations - workflows - where core 1 always does the same operation on data from memory, core 2 always does the same operation, different from core 1, to data coming from core 1, etc. I haven't kept up to date with compiler technology, so I don't know how fit are current compilers to detect these workflows and generate such code. > Q5. How much may costs Larrabee-based hardware in 2010 ? I hope it'll be > lower $10000. Any more exact predictions ? The launch as graphics cards suggests to me that they will compete in price with similar offerings from nVidia and AMD. Bogdan From trainor at divination.biz Sat Aug 22 02:37:05 2009 From: trainor at divination.biz (Douglas J. Trainor) Date: Sat, 22 Aug 2009 05:37:05 -0400 Subject: [Beowulf] First Workshop on High-Performance Computing in India Message-ID: <418F3128-31A8-40B2-807A-346B2B857759@divination.biz> From: asriniva at cs.fsu.edu Subject: [SIAM-SC] Re: CFP: Student Research Symposium HiPC 2009 CALL FOR PARTICIPATION First Workshop on High-Performance Computing in India Nov. 20, 2009 Portland, OR, USA (to be held in Conjuction with Supercomputing 2009) ATIP's First Workshop on HPC in India will be held on Nov. 20, 2009 at Portland, OR, USA in conjunction with the Supercompting 2009 Conference. The main goal of this workshop is to showcase Indian research on HPC at SC-2009. The workshop also serves several other purposes, including bringing together leading researchers in various disciplines who use HPC extensively, to discuss new developments and needs related to HPC. It would also enable networking with HPC researchers from all over the world which would lead to potential collaboration. Thus the overall objective is to stimulate discussions on the use of HPC, to define the grand challenge problems in these areas, and how one could derive benefits from knowing HPC work in related disciplines. The workshop would include a significant set of presentations and panels from a delegation of researchers from Indian Institutions, Research Laboratories, Industry and Government Agencies. Student posters from graduate students would also be presented at the workshop. An initial list of confirmed speakers include: Indian Government Plans and Programs * N. Balakrishnan (Indian Institute of Science, Bangalore) * Shailesh Nayak (Secretary, Min. of Earth Sciences, Govt. of India) Indian HPC Systems Research and Data Centres * Subrata Chattopadhyay(CDAC, Bangalore) * R. Govindarajan (Indian Institute of Science, Bangalore) * P.K. Sinha (CDAC, Pune) * Sathish Vadhiyar (Indian Institute of Science, Bangalore) Science and Engineering Applications in India * B. Jayaram (Indian Institute of Technology, New Delhi) * Amalendu Chandra (Indian Institute of Technology, Kanpur) * E.D. Jemmis (Indian Institute of Science Education Research, Thiruvananthapuram) * S. K. Mittal (Indian Institute of Technology, Kanpur) * Ravi S. Nanjundiah (Indian Institute of Science, Bangalore) * N. Balakrishnan (Indian Institute of Science, Bangalore) * S. Balasubramanian (JN Centre for Advanced Scientific Research, Bangalore) * Saraswathi Vishveshwara (Indian Institute of Science, Bangalore) HPC Vendors * Cray * IBM * Netweb Panel on opportunities for Indo-US collaborations * TBA The Workshop will be open to all SC09 participants. For more information, please visit: http://www.serc.iisc.ernet.in/hpit or http://atip.org/index.php?option=com_content&view=article&id=7069 Please reply to: schpcws at serc.iisc.ernet.in _______________________________________________ SIAM-SC mailing list To post messages to the list please send them to: SIAM-SC at siam.org http://lists.siam.org/mailman/listinfo/siam-sc From kilian.cavalotti.work at gmail.com Sun Aug 23 23:29:47 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Mon, 24 Aug 2009 08:29:47 +0200 Subject: [Beowulf] Re: amd 3 and 6 core processors In-Reply-To: <778088694.1715931250807734460.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <778088694.1715931250807734460.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: On Fri, Aug 21, 2009 at 12:35 AM, wrote: > I would also like to know how Intel and > AMD are disabling/degrading the cores. ?They very like have built > in circuits that they can "burn out" to ensure physical incapacity. Still, > perhaps it is done another way. At least for AMD's Phenom II X3, re-enabling a disabled core is a simple matter of changing a BIOS setting. See http://www.tomshardware.com/news/amd-phenom-cpu,7080.html Cheers, -- Kilian From eagles051387 at gmail.com Mon Aug 24 00:41:12 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Mon, 24 Aug 2009 09:41:12 +0200 Subject: [Beowulf] Re: amd 3 and 6 core processors In-Reply-To: References: <778088694.1715931250807734460.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: kilian but how would you know that the 4th core isnt faulty? -------------- next part -------------- An HTML attachment was scrubbed... URL: From kilian.cavalotti.work at gmail.com Mon Aug 24 03:10:34 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Mon, 24 Aug 2009 12:10:34 +0200 Subject: [Beowulf] Re: amd 3 and 6 core processors In-Reply-To: References: <778088694.1715931250807734460.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: On Mon, Aug 24, 2009 at 9:41 AM, Jonathan Aquilina wrote: > kilian but how would you know that the 4th core isnt faulty? Ah, but you don't! That's the catch. No such thing as a free lunch, remember? :) You may get lucky and find out that your disabled core has been crippled only to supply cheaper 3-cores demand, as David Mathog suggested. Or even get a slightly disabled core, which won't cause any trouble in your email writing or Crysis gaming. But chances are, if your needs are more HPC-centric, that you won't really be able to fully take advantage of that fourth core after all. Maybe it's worth trying, maybe it's not, depends what you do. Cheers, -- Kilian From gus at ldeo.columbia.edu Mon Aug 24 08:06:17 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 24 Aug 2009 11:06:17 -0400 Subject: hdparm error: Was Re: [Beowulf] Help for terrible NFS write performance In-Reply-To: References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> <4A8EDFD5.9080909@cora.nwra.com> <4A8EE41B.7000605@scalableinformatics.com> <4A8F0D64.7050702@ias.edu> <4A8F15C2.7040009@ldeo.columbia.edu> Message-ID: <4A92AC69.1080906@ldeo.columbia.edu> Mark Hahn wrote: >> Somehow it works on Fedora 10, but not on CentOS 4 and 5. > > what version does hdparm -v mention on each system? Hi Mark, list Better late than never: 8.6 on Fedora 10, 6.6 on CentOS 5, 5.7 on CentOS 4. I'd guess 6.6 and earlier are too old, even the man pages are different. Gus Correa From gus at ldeo.columbia.edu Mon Aug 24 08:10:46 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 24 Aug 2009 11:10:46 -0400 Subject: hdparm error: Was Re: [Beowulf] Help for terrible NFS write performance In-Reply-To: References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> <4A8EDFD5.9080909@cora.nwra.com> <4A8EE41B.7000605@scalableinformatics.com> <4A8F0D64.7050702@ias.edu> <4A8F15C2.7040009@ldeo.columbia.edu> Message-ID: <4A92AD76.3030403@ldeo.columbia.edu> Mark Hahn wrote: >> Somehow it works on Fedora 10, but not on CentOS 4 and 5. > > what version does hdparm -v mention on each system? Hi Mark, list Not sure you want version (-V, sent on previous email) or defaults (-v). hdparm -v doesn't work on either system, returns usage/help, not the defaults, which it was supposed to do. Gus Correa From kus at free.net Mon Aug 24 09:21:40 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Mon, 24 Aug 2009 20:21:40 +0400 Subject: [Beowulf] nearly future of Larrabee In-Reply-To: Message-ID: In message from Bogdan Costescu (Sun, 23 Aug 2009 03:17:08 +0200): >2009/8/21 Mikhail Kuzminsky : >> Q3. Does it means that Larrabee will give essential speedup also on >>relative >> short vectors ? > >I don't quite understand your question... > For example, will DAXPY give essential speedup (percent of peak performance) for N=10 or 100 for example (for matrix and vector), and will DGEMM give high performance for meduim sizes of matrices - or we'll need large N values - for example, 1000, 10000 etc ? What is about gather/scatter/etc for vector processing, the compilers for Cray T90/C90 ... Cray 1, NEC SX-6/5/4... performs, I beleive, all the necessary things. Mikhail Mikhail From henning.fehrmann at aei.mpg.de Mon Aug 24 01:58:56 2009 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Mon, 24 Aug 2009 10:58:56 +0200 Subject: [Beowulf] HD undetectable errors In-Reply-To: <200908211833.21485.cap@nsc.liu.se> References: <20090821132532.GA16945@gretchen.aei.mpg.de> <200908211833.21485.cap@nsc.liu.se> Message-ID: <20090824085856.GA20141@gretchen.aei.mpg.de> Hello Peter, Thank you for the answer. > > Yes. But most wont and it will hur quite a lot performance wise. I know, for > example, that our IBM DS4700 with updated firmware can > enable "verify-on-read". Is it easily possible to switch this feature on and off? E.g., one does a a test from time to time but for the every day usage one avoids "verify-on-read"? > Remember that you have to calculate this against the amount of corrupt data, > not the total amount of data. My hope is that corrupted data will be repaired by the RAID controller. The chance to loose the data on a RAID6 system is of the third order should be very small. My guess it that the silent corruption rate is higher if the RAID system does not have the "verify-on-read" feature. I try to get the numbers. I assume in this calculation that no bit flips occur on the buses or the controller which is already sort of naive. Thank you. Cheers, Henning From mmuratet at hudsonalpha.org Mon Aug 24 02:40:22 2009 From: mmuratet at hudsonalpha.org (Michael Muratet) Date: Mon, 24 Aug 2009 04:40:22 -0500 Subject: [Beowulf] Configuring nodes on a scyld cluster Message-ID: <93AC2CF8-3096-487E-BC08-FBC644C5C62C@hudsonalpha.org> Greetings I'm not sure if this is more appropriate for the beowulf or ganglia list, please forgive a cross-post. I have been trying to get ganglia (v 3.0.7) to record info from the nodes of my scyld cluster. gmond was not installed on any of the compute nodes nor was gmond.conf in /etc of any of the compute nodes when we got it from the vendor. I didn't see much in the documentation about configuring nodes but I did find a 'howto' at http://www.krazyworks.com/installing-and-configuring- ganglia/. I have been testing on one of the nodes as follows. I copied gmond from /usr/sbin on the head node to the subject compute node /usr/ sbin. I ran gmond --default_config and saved the output and changed it thus: scyld:etc root$ bpsh 5 cat /etc/gmond.conf /* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 0 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no } /* If a cluster attribute is specified, then all gmond hosts are wrapped inside * of a tag. If you do not specify a cluster tag, then all will * NOT be wrapped inside of a tag. */ cluster { name = "mendel" owner = "unspecified" latlong = "unspecified" url = "unspecified" } /* The host section describes attributes of the host, like the location */ host { location = "unspecified" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { port = 8649 host = 10.54.50.150 /* head node's IP */ } /* You can specify as many udp_recv_channels as you like as well. */ /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 } I modified gmond on the head node thus: /* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 0 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no } /* If a cluster attribute is specified, then all gmond hosts are wrapped inside * of a tag. If you do not specify a cluster tag, then all will * NOT be wrapped inside of a tag. */ cluster { name = "mendel" owner = "unspecified" latlong = "unspecified" url = "unspecified" } /* The host section describes attributes of the host, like the location */ host { location = "unspecified" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { port = 8649 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 } I started gmond on the compute node bpsh 5 gmond and restarted gmond and gmetad. I don't see my node running gmond. ps -elf | grep gmond on the compute node returns nothing. I tried to add gmond as a service on the compute node with the script at the krazy site but I get: scyld:~ root$ bpsh 5 chkconfig --add gmond service gmond does not support chkconfig and scyld:~ root$ bpsh 5 service gmond start /sbin/service: line 3: /etc/init.d/functions: No such file or directory I am at a loss over what to try next, it seems this should work. Any and all suggestions will be appreciated. Thanks Mike Michael Muratet, Ph.D. Senior Scientist HudsonAlpha Institute for Biotechnology mmuratet at hudsonalpha.org (256) 327-0473 (p) (256) 327-0966 (f) Room 4005 601 Genome Way Huntsville, Alabama 35806 From hawson at gmail.com Mon Aug 24 04:53:10 2009 From: hawson at gmail.com (Jesse Becker) Date: Mon, 24 Aug 2009 07:53:10 -0400 Subject: [Beowulf] Re: [Ganglia-general] Configuring nodes on a scyld cluster In-Reply-To: <93AC2CF8-3096-487E-BC08-FBC644C5C62C@hudsonalpha.org> References: <93AC2CF8-3096-487E-BC08-FBC644C5C62C@hudsonalpha.org> Message-ID: On Mon, Aug 24, 2009 at 05:40, Michael Muratet wrote: > Greetings > > I'm not sure if this is more appropriate for the beowulf or ganglia > list, please forgive a cross-post. I have been trying to get ganglia > (v 3.0.7) to record info from the nodes of my scyld cluster. gmond was If I recall, Scyld clusters (and the successor, ClusterWare), run a modified version of Ganglia, mostly for the data display, but not collection. For data collection, they run their own program called 'bproc', which does some of the same things as gmond. There was a short discussion about bproc/gmond in the Beowulf mailing list about a year or year and half ago. Also, the hacked-up version of ganglia that they ship is based off 2.5.7 I think, so there is a good reason to upgrade. However, it should work, but with some tweaking. > not installed on any of the compute nodes nor was gmond.conf in /etc > of any of the compute nodes when we got it from the vendor. I didn't > see much in the documentation about configuring nodes but I did find a > 'howto' at http://www.krazyworks.com/installing-and-configuring- > ganglia/. I have been testing on one of the nodes as follows. I copied > gmond from /usr/sbin on the head node to the subject compute node /usr/ > sbin. I ran gmond --default_config and saved the output and changed it > thus: > > scyld:etc root$ bpsh 5 cat /etc/gmond.conf > /* This configuration is as close to 2.5.x default behavior as possible > ? ?The values closely match ./gmond/metric.h definitions in 2.5.x */ > globals { > ? daemonize = yes > ? setuid = yes > ? user = nobody > ? debug_level = 0 > ? max_udp_msg_len = 1472 > ? mute = no > ? deaf = no > ? host_dmax = 0 /*secs */ > ? cleanup_threshold = 300 /*secs */ > ? gexec = no > } > > /* If a cluster attribute is specified, then all gmond hosts are > wrapped inside > ?* of a tag. ?If you do not specify a cluster tag, then all > will > ?* NOT be wrapped inside of a tag. */ > cluster { > ? name = "mendel" > ? owner = "unspecified" > ? latlong = "unspecified" > ? url = "unspecified" > } > > /* The host section describes attributes of the host, like the > location */ > host { > ? location = "unspecified" > } > > /* Feel free to specify as many udp_send_channels as you like. ?Gmond > ? ?used to only support having a single channel */ > udp_send_channel { > ? port = 8649 > ? host = 10.54.50.150 /* head node's IP */ > } > > /* You can specify as many udp_recv_channels as you like as well. */ > > /* You can specify as many tcp_accept_channels as you like to share > ? ?an xml description of the state of the cluster */ > tcp_accept_channel { > ? port = 8649 > } > > I modified gmond on the head node thus: > > /* This configuration is as close to 2.5.x default behavior as possible > ? ?The values closely match ./gmond/metric.h definitions in 2.5.x */ > globals { > ? daemonize = yes > ? setuid = yes > ? user = nobody > ? debug_level = 0 > ? max_udp_msg_len = 1472 > ? mute = no > ? deaf = no > ? host_dmax = 0 /*secs */ > ? cleanup_threshold = 300 /*secs */ > ? gexec = no > } > > /* If a cluster attribute is specified, then all gmond hosts are > wrapped inside > ?* of a tag. ?If you do not specify a cluster tag, then all > will > ?* NOT be wrapped inside of a tag. */ > cluster { > ? name = "mendel" > ? owner = "unspecified" > ? latlong = "unspecified" > ? url = "unspecified" > } > > /* The host section describes attributes of the host, like the > location */ > host { > ? location = "unspecified" > } > > /* Feel free to specify as many udp_send_channels as you like. ?Gmond > ? ?used to only support having a single channel */ > > /* You can specify as many udp_recv_channels as you like as well. */ > udp_recv_channel { > ? port = 8649 > } > > /* You can specify as many tcp_accept_channels as you like to share > ? ?an xml description of the state of the cluster */ > tcp_accept_channel { > ? port = 8649 > } > > I started gmond on the compute node bpsh 5 gmond and restarted gmond > and gmetad. I don't see my node running gmond. ps -elf | grep gmond on > the compute node returns nothing. I tried to add gmond as a service on > the compute node with the script at the krazy site ?but I get: > > scyld:~ root$ bpsh 5 chkconfig --add gmond > service gmond does not support chkconfig Looks like the startup script for gmond doesn't natively support chkconfig. This isn't a huge problem. You will, however, have to manually create symlinks in /etc/rc3.d that point into /etc/init.d. Basically, you want two links that look something like this: /etc/rc3.d/S99gmond -> /etc/init.d/gmond /etc/rc3.d/K01gmond -> /etc/init.d/gmond I'd do this on the head node *only*. Scyld clusters are a bit funny if you have never used them before. There's almost *nothing* on the compute nodes except local data partitions, and they don't run a 'normal' userspace either. > and > > scyld:~ root$ bpsh 5 service gmond start > /sbin/service: line 3: /etc/init.d/functions: No such file or directory Unsurprising... Run 'bpsh 5 ls -l /etc' and you will see why this error occurs: there's probably almost nothing in /etc at all. > I am at a loss over what to try next, it seems this should work. Any > and all suggestions will be appreciated. Try running gmond directly, with debugging turned on (as a test): bpsh 5 /usr/sbin/gmond -c /etc/gmond.conf -d 2 and see what it complains about. -- Jesse Becker GPG Fingerprint -- BD00 7AA4 4483 AFCC 82D0 2720 0083 0931 9A2B 06A2 From rezamirani at yahoo.com Fri Aug 21 23:38:27 2009 From: rezamirani at yahoo.com (Reza Mirani) Date: Fri, 21 Aug 2009 23:38:27 -0700 (PDT) Subject: [Beowulf] realtime network Message-ID: <525965.78444.qm@web56605.mail.re3.yahoo.com> Dear sir, I want to know more about Realtime networks and architectures. Did have any one who has some information about it ? ************** Reza Mirani HPC Technology Co.Ltd United Arab emirates Dubai Fax : +1- 413-473-1716 ************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From Craig.Tierney at noaa.gov Tue Aug 25 10:09:16 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Tue, 25 Aug 2009 11:09:16 -0600 Subject: [Beowulf] Anybody going with AMD Istanbul 6-core CPU's? In-Reply-To: References: Message-ID: <4A941ABC.6070604@noaa.gov> Steve Cousins wrote: > > I haven't seen anybody here talking about the 6-core AMD CPU's yet. Is > anybody trying these out? Anybody have real-world comparisons (say WRF) > of scalability of a 12-core system vs. a 16 thread Nehalem system? > > Thanks, > > Steve > We looked at them and the processor may have been available when we needed to take delivery. However, the platform that would take advantage of the chip, Fiorano, isn't out yet (or just out). When that gets released, then there will be something worth comparing. As far as using threading, I doubt that threading is going to buy you much for WRF. Minimal testing showed no benefit and it is more likely to cause confusion to the users than a small bump in speed. We would like to test it more in the future, but right now the users need cycles. Craig > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Craig Tierney (craig.tierney at noaa.gov) From peter.st.john at gmail.com Tue Aug 25 11:00:04 2009 From: peter.st.john at gmail.com (Peter St. John) Date: Tue, 25 Aug 2009 14:00:04 -0400 Subject: [Beowulf] realtime network In-Reply-To: <525965.78444.qm@web56605.mail.re3.yahoo.com> References: <525965.78444.qm@web56605.mail.re3.yahoo.com> Message-ID: Reza, I don't know what all this list would have to say about real-time; may be interesting. Almost 20 years ago I used OS 9 (Microware, not to be confused with Mac OS 9 predecessor of OS X) on motorola VME bus, which was purpose-built for real-time; see http://en.wikipedia.org/wiki/OS9 I think that's still alive and you can get a port to intel processors. More recently, I've used VMS (see http://en.wikipedia.org/wiki/OpenVMS) on DEC Alpha, was popular for industrial transaction processing. It used to be that unix wasn't considered reliable for real-time but I'm sure that's changed, there must be a flavor out there somewhere. Maybe look at AIX, but see below. I see Wiki has lists of extant real-time OSes: http://en.wikipedia.org/wiki/Real-time_operating_system#Examples Peter On Sat, Aug 22, 2009 at 2:38 AM, Reza Mirani wrote: > Dear sir, > > I want to know more about Realtime networks and architectures. Did have any > one who has some information about it ? > > > ************** > Reza Mirani > HPC Technology Co.Ltd > United Arab emirates > Dubai > Fax : +1- 413-473-1716 > ************** > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From niftyompi at niftyegg.com Tue Aug 25 13:37:57 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Tue, 25 Aug 2009 13:37:57 -0700 Subject: [Beowulf] HD undetectable errors In-Reply-To: <20090821132532.GA16945@gretchen.aei.mpg.de> References: <20090821132532.GA16945@gretchen.aei.mpg.de> Message-ID: <20090825203757.GA2917@compegg> Not an expert on this.... some thoughts below. On Fri, Aug 21, 2009 at 03:25:32PM +0200, Henning Fehrmann wrote: > Hello, > > a typical rate for data not recovered in a read operation on a HD is > 1 per 10^15 bit reads. > > If one fills a 100TByte file server the probability of loosing data > is of the order of 1. > Off course, one could circumvent this problem by using RAID5 or RAID6. > Most of the controller do not check the parity if they read data and > here the trouble begins. > I can't recall the rate for undetectable errors but this might be few > orders of magnitude smaller than 1 per 10^15 bit reads. However, given > the fact that one deals nowadays with few hundred TBytes of data this > might happen from time to time without being realized. > > One could lower the rate by forcing the RAID controller to check the > parity information in a read process. Are there RAID controller which > are able to perform this? > Another solution might be the useage of file systems which have additional > checksums for the blocks like zfs or qfs. This even prevents data > corruption due to undetected bit flips on the bus or the RAID > controller. > Does somebody know the size of the checksum and the rate of undetected > errors for qfs? > For zfs it is 256 bit per 512Byte data. > One option is the fletcher2 algorithm to compute the checksum. > Does somebody know the rate of undetectable bit flips for such a > setting? > > Are there any other file systems doing block-wise checksumming? I do not think you have the statistics correct but the issue is very real. There are many archival and site policies that add their own check sum and error recovery codes to their archives because of the value or sensitivity of the data. All disks I know of have a CRC/ECC code on the media that is checked at read time by hardware, Seagate says one 512 byte sector in 10^16 reads error rate. The RAID however cannot recheck its parity without re-reading all the spindles and recomputing+check of the parity, which is slow, but it could. However, adding the extra read does not solve the issue at two levels * Most RAID devices are designed to react to the disk's reported error the 10^16 number is a value for undetected and unreported errors thus the a RAID will not have it's redundancy mechanism triggered. * Most RAID designs would not be able to recover from an all spindle read and parity recompute+check that detected an error. i.e. the redundancy in common RAIDs cannot discover which of the devices presented bogus data. And it is unknowable if the error is a single bit or many bits. In the simple mirror case when the data does not match -- which is correct, A or B? In most more complex RAID designs the same problem exists. In a triple redundant mirror case a majority could rule. At single disk read speeds of 15MB/s one sector in 10^16 reads one error in +100year? With a failure in time on the order of 100 years other issues would seem (to me) to dominate the reliability of a storage system. But statistics do generate unexpected results. I do know of at least one site that has detected a single bit data storage error in a multiple TB RAID that went undetected by hardware and the OS. Compressed data makes this problem even more interesting because many of the stream tools (encryption or compression) fail "badly" and depending on where the bits flip a little or a LOT of data can be lost. More to the point are the number of times the dice are rolled with data. Network link, PCIe, Processor data paths, memory data paths, disk controller data paths, device links, read data paths, write data paths.... Disks are the strong link in this data chain in way too many cases. This question from above is interesting. + Does somebody know the size of the checksum and the rate of undetected + errors for qfs? The error rate is not a simple function of qfs it is most likely a function of the underlying error rate in the hardware involved in qfs. Since QFS can extend its reach from disk to tape, to/from disk cache, to optical to other... each media needs to be understood as well as the statistics associated with all the hidden transfers. With basic error rate info for all the hardware that touches the data some swag on the file system error rate and undetected error rates might begin. I think the Seagate 10^16 number is simply the hash statistics for their ReedSolomon ECC/CRC length and 2^512 permutations of data not the error rate. i.e. the quality of the code not the error rate of the device. However, It does make sense to me to generate and maintain site specific meta data for all valuable data files to include both detection (yes tamper detection too) and recovery codes. I would extend this to all data with the hope that any problems might be seen first on inconsequential files. Tripwire might be a good model for starting out on this. I should note that the three 'big' error rate problems I have worked on in the past 25 years had their root cause in an issue not understood or considered at design time so empirical data from the customer was critical. Data sheets and design document conclusions just missed the issue. These experiences taught me to be cautious with storage statistics. Looming in the dark clouds is a need for owning your own data integrity. It seems obvious to me in the growing business of cloud computing and cloud storage that you need to "trust but verify" the integrity of your data. My thought on this is that external integrity methods are critical in the future. And do remember that "parity is for farmers." -- T o m M i t c h e l l Found me a new hat, now what? a From reuti at staff.uni-marburg.de Tue Aug 25 15:18:17 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed, 26 Aug 2009 00:18:17 +0200 Subject: [Beowulf] HD undetectable errors In-Reply-To: <20090825203757.GA2917@compegg> References: <20090821132532.GA16945@gretchen.aei.mpg.de> <20090825203757.GA2917@compegg> Message-ID: Am 25.08.2009 um 22:37 schrieb Nifty Tom Mitchell: > > However, It does make sense to me to generate and maintain site > specific meta > data for all valuable data files to include both detection (yes tamper > detection too) and recovery codes. Sounds like using local par or par2 files along with their hash information about the original files. Maybe this could be implemented as FUSE filesystem for easy handling (which will automatically split the files, create the hashes and any number of par files you like). -- Reuti > I would extend this to all data with > the hope that any problems might be seen first on inconsequential > files. > Tripwire might be a good model for starting out on this. > > I should note that the three 'big' error rate problems I have > worked on > in the past 25 years had their root cause in an issue not understood > or considered at design time so empirical data from the customer was > critical. Data sheets and design document conclusions just missed > the issue. > These experiences taught me to be cautious with storage statistics. > > Looming in the dark clouds is a need for owning your own data > integrity. > It seems obvious to me in the growing business of cloud computing > and cloud storage > that you need to "trust but verify" the integrity of your data. > My thought > on this is that external integrity methods are critical in the future. > > And do remember that "parity is for farmers." > > > > -- > T o m M i t c h e l l > Found me a new hat, now what? > a > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From niftyompi at niftyegg.com Tue Aug 25 23:44:24 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Tue, 25 Aug 2009 23:44:24 -0700 Subject: [Beowulf] HD undetectable errors In-Reply-To: References: <20090821132532.GA16945@gretchen.aei.mpg.de> <20090825203757.GA2917@compegg> Message-ID: <20090826064424.GA2942@compegg> On Wed, Aug 26, 2009 at 12:18:17AM +0200, Reuti wrote: > Am 25.08.2009 um 22:37 schrieb Nifty Tom Mitchell: > >> >> However, It does make sense to me to generate and maintain site >> specific meta >> data for all valuable data files to include both detection (yes tamper >> detection too) and recovery codes. > > Sounds like using local par or par2 files along with their hash > information about the original files. Maybe this could be implemented as > FUSE filesystem for easy handling (which will automatically split the > files, create the hashes and any number of par files you like). > > -- Reuti Interesting... par2 looks close. It may solve my worries with Cloud storage. The differences between par and par2 are interesting. The various issues involving damaged files (even a single bit error) in the initial design of par were a limitation. http://www.par2.net/pardif.php I can see that there has been a lot of work done already making it is a good place to start. And yes slipping this under a FUSE filesystem might hide the pain for local storage. >> I would extend this to all data with >> the hope that any problems might be seen first on inconsequential >> files. >> Tripwire might be a good model for starting out on this. >> >> I should note that the three 'big' error rate problems I have worked on >> in the past 25 years had their root cause in an issue not understood >> or considered at design time so empirical data from the customer was >> critical. Data sheets and design document conclusions just missed the >> issue. >> These experiences taught me to be cautious with storage statistics. >> >> Looming in the dark clouds is a need for owning your own data >> integrity. >> It seems obvious to me in the growing business of cloud computing and >> cloud storage >> that you need to "trust but verify" the integrity of your data. My >> thought >> on this is that external integrity methods are critical in the future. >> >> And do remember that "parity is for farmers." >> >> >> >> -- >> T o m M i t c h e l l >> Found me a new hat, now what? >> a >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >> Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > -- T o m M i t c h e l l Found me a new hat, now what? From h-bugge at online.no Wed Aug 26 00:21:56 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Wed, 26 Aug 2009 09:21:56 +0200 Subject: [Beowulf] Anybody going with AMD Istanbul 6-core CPU's? In-Reply-To: <4A941ABC.6070604@noaa.gov> References: <4A941ABC.6070604@noaa.gov> Message-ID: Craig, On Aug 25, 2009, at 19:09 , Craig Tierney wrote: > As far as using threading, I doubt that threading is going to buy you > much for WRF. Minimal testing showed no benefit and it is more likely > to cause confusion to the users than a small bump in speed. We > would like > to test it more in the future, but right now the users need cycles. This is consistent with my findings in "An Evaluation of Intel?s Core i7 Architecture using a Comparative Approach". WRF, as embodied into the SPEC MPI2007 suite, ran 2% slower using threading on a single- node, dual-core Nehalem system. Out of the 13 apps constituting the suite, three ran slower (2,3, and 4%), five ran +10% faster, with 122.tachyon excelling at a 35% speedup from threading. Thanks, H?kon From niftyompi at niftyegg.com Thu Aug 27 07:12:33 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Thu, 27 Aug 2009 07:12:33 -0700 Subject: [Beowulf] Re: amd 3 and 6 core processors In-Reply-To: References: <778088694.1715931250807734460.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <20090827141233.GA3272@tosh2egg.ca.sanfran.comcast.net> On Mon, Aug 24, 2009 at 08:29:47AM +0200, Kilian CAVALOTTI wrote: > > On Fri, Aug 21, 2009 at 12:35 AM, wrote: > > I would also like to know how Intel and > > AMD are disabling/degrading the cores. ?They very like have built > > in circuits that they can "burn out" to ensure physical incapacity. Still, > > perhaps it is done another way. > > At least for AMD's Phenom II X3, re-enabling a disabled core is a > simple matter of changing a BIOS setting. See > http://www.tomshardware.com/news/amd-phenom-cpu,7080.html Has anyone tinkered with disabling one core at a time and benchmarking the remaining set of cores with various parallel tests. I suspect that there is some asymmetry that might color processor affinity if the locality of the disabled core can be exposed. Things like interrupt servicing, cache line interactions, TLB state and IO channel latency come to mind. I guess AMD could just tell us.... -- T o m M i t c h e l l Found me a new hat, now what? From mdidomenico4 at gmail.com Thu Aug 27 13:27:02 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Thu, 27 Aug 2009 16:27:02 -0400 Subject: [Beowulf] lustre 'lctl dl' weirdness? Message-ID: I posted this to the lustre-discuss mailling list, but i have not heard anything all day, just wonder if anyone here might have an ideas? We had a problem in the datacenter this morning where a bunch of servers went down hard, this included my lustre filesystem and just about every other machine in the building When i try to bring the MDS/MGS back online, it does mount, but an 'lctl dl' shows that everything is there and UP, but i know its not, because i have not mounted the OSS's. Is this some junk left over? Can/Should it be cleared out? If i go through and mount all the OSS's they mount and i can mount the filesystem on the client, but no ls or df of the mountpoint works I've tried various methods of recovery that i know of, but i can't seem to get the MDS/MGS to come up in what appears to be a clean state is there some magic command or file that needs to be deleted to abort everything and restart? Thanks From akerstens at penguincomputing.com Tue Aug 25 11:40:30 2009 From: akerstens at penguincomputing.com (Andre Kerstens) Date: Tue, 25 Aug 2009 11:40:30 -0700 Subject: [Beowulf] Configuring nodes on a scyld cluster In-Reply-To: <200908251801.n7PI1Bln024816@bluewest.scyld.com> References: <200908251801.n7PI1Bln024816@bluewest.scyld.com> Message-ID: <17E468C34A1ACD4FB9DEBBA406BB69C8013AA13E@orca.penguincomputing.com> Michael, On a cluster running Scyld Clusterware (are you running 4 or 5?) there is no need to install any Ganglia components on the compute nodes: the compute nodes communicate cluster information incl. ganglia info to the head node via the beostatus sendstats mechanism. If ganglia is not enabled yet on your cluster, you can do it as follows: Edit /etc/xinetd.d/beostat and change 'disable=yes' to 'disable=no' followed by: /sbin/chkconfig xinetd on /sbin/chkconfig httpd on /sbin/chkconfig gmetad on and service xinetd restart service httpd start service gemetad start Then point your web browser to http://localhost/ganglia and off you go. This information can be found in the release notes document of your Scyld cluster or in the Scyld admin guide. Cheers Andre ------------------------------ Message: 2 Date: Mon, 24 Aug 2009 04:40:22 -0500 From: Michael Muratet Subject: [Beowulf] Configuring nodes on a scyld cluster To: ganglia-general at lists.sourceforge.net Cc: beowulf at beowulf.org Message-ID: <93AC2CF8-3096-487E-BC08-FBC644C5C62C at hudsonalpha.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Greetings I'm not sure if this is more appropriate for the beowulf or ganglia list, please forgive a cross-post. I have been trying to get ganglia (v 3.0.7) to record info from the nodes of my scyld cluster. gmond was not installed on any of the compute nodes nor was gmond.conf in /etc of any of the compute nodes when we got it from the vendor. I didn't see much in the documentation about configuring nodes but I did find a 'howto' at http://www.krazyworks.com/installing-and-configuring- ganglia/. I have been testing on one of the nodes as follows. I copied gmond from /usr/sbin on the head node to the subject compute node /usr/ sbin. I ran gmond --default_config and saved the output and changed it thus: scyld:etc root$ bpsh 5 cat /etc/gmond.conf /* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 0 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no } /* If a cluster attribute is specified, then all gmond hosts are wrapped inside * of a tag. If you do not specify a cluster tag, then all will * NOT be wrapped inside of a tag. */ cluster { name = "mendel" owner = "unspecified" latlong = "unspecified" url = "unspecified" } /* The host section describes attributes of the host, like the location */ host { location = "unspecified" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { port = 8649 host = 10.54.50.150 /* head node's IP */ } /* You can specify as many udp_recv_channels as you like as well. */ /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 } I modified gmond on the head node thus: /* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes setuid = yes user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 0 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no } /* If a cluster attribute is specified, then all gmond hosts are wrapped inside * of a tag. If you do not specify a cluster tag, then all will * NOT be wrapped inside of a tag. */ cluster { name = "mendel" owner = "unspecified" latlong = "unspecified" url = "unspecified" } /* The host section describes attributes of the host, like the location */ host { location = "unspecified" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { port = 8649 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 } I started gmond on the compute node bpsh 5 gmond and restarted gmond and gmetad. I don't see my node running gmond. ps -elf | grep gmond on the compute node returns nothing. I tried to add gmond as a service on the compute node with the script at the krazy site but I get: scyld:~ root$ bpsh 5 chkconfig --add gmond service gmond does not support chkconfig and scyld:~ root$ bpsh 5 service gmond start /sbin/service: line 3: /etc/init.d/functions: No such file or directory I am at a loss over what to try next, it seems this should work. Any and all suggestions will be appreciated. Thanks Mike Michael Muratet, Ph.D. Senior Scientist HudsonAlpha Institute for Biotechnology mmuratet at hudsonalpha.org (256) 327-0473 (p) (256) 327-0966 (f) Room 4005 601 Genome Way Huntsville, Alabama 35806 ------------------------------ From jclinton at advancedclustering.com Tue Aug 25 12:49:34 2009 From: jclinton at advancedclustering.com (Jason Clinton) Date: Tue, 25 Aug 2009 14:49:34 -0500 Subject: [Beowulf] Anybody going with AMD Istanbul 6-core CPU's? In-Reply-To: References: Message-ID: <588c11220908251249o120b6b96q4af0512a153d3f7c@mail.gmail.com> On Thu, Aug 20, 2009 at 11:01 AM, Steve Cousins wrote: > I haven't seen anybody here talking about the 6-core AMD CPU's yet. Is > anybody trying these out? Anybody have real-world comparisons (say WRF) of > scalability of a 12-core system vs. a 16 thread Nehalem system? I ran a benchmark awhile ago and published it on our blog: http://www.advancedclustering.com/company-blog/molecular-dynamics-amd-vs-intel.html A co-worker, Shane, is working on a WRF benchmark. In short, it depends on the code, but it's a formidable competitor. -- Jason D. Clinton, Advanced Clustering Technologies 913-643-0306, http://twitter.com/HPCClusterTech From mmuratet at hudsonalpha.org Tue Aug 25 15:13:08 2009 From: mmuratet at hudsonalpha.org (Michael Muratet) Date: Tue, 25 Aug 2009 17:13:08 -0500 Subject: [Beowulf] Configuring nodes on a scyld cluster In-Reply-To: <17E468C34A1ACD4FB9DEBBA406BB69C8013AA13E@orca.penguincomputing.com> References: <200908251801.n7PI1Bln024816@bluewest.scyld.com> <17E468C34A1ACD4FB9DEBBA406BB69C8013AA13E@orca.penguincomputing.com> Message-ID: <64985711-A8CD-4643-B0DD-DFD844F84194@hudsonalpha.org> On Aug 25, 2009, at 1:40 PM, Andre Kerstens wrote: > Michael, > > On a cluster running Scyld Clusterware (are you running 4 or 5?) there > is no need to install any Ganglia components on the compute nodes: the > compute nodes communicate cluster information incl. ganglia info to > the > head node via the beostatus sendstats mechanism. If ganglia is not > enabled yet on your cluster, you can do it as follows: > > Edit /etc/xinetd.d/beostat and change 'disable=yes' to 'disable=no' > followed by: > > /sbin/chkconfig xinetd on > /sbin/chkconfig httpd on > /sbin/chkconfig gmetad on > > and > > service xinetd restart > service httpd start > service gemetad start > > Then point your web browser to http://localhost/ganglia and off you > go. Andre Thanks for the info. Yes, I got that far. It is apparently also necessary to reboot the head node, and we're waiting for a slack moment to do that. Cheers Mike > > > This information can be found in the release notes document of your > Scyld cluster or in the Scyld admin guide. > > Cheers > Andre > > ------------------------------ > Message: 2 > Date: Mon, 24 Aug 2009 04:40:22 -0500 > From: Michael Muratet > Subject: [Beowulf] Configuring nodes on a scyld cluster > To: ganglia-general at lists.sourceforge.net > Cc: beowulf at beowulf.org > Message-ID: <93AC2CF8-3096-487E-BC08-FBC644C5C62C at hudsonalpha.org> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > Greetings > > I'm not sure if this is more appropriate for the beowulf or ganglia > list, please forgive a cross-post. I have been trying to get ganglia > (v > 3.0.7) to record info from the nodes of my scyld cluster. gmond was > not > installed on any of the compute nodes nor was gmond.conf in /etc of > any > of the compute nodes when we got it from the vendor. I didn't see much > in the documentation about configuring nodes but I did find a > 'howto' at > http://www.krazyworks.com/installing-and-configuring- > ganglia/. I have been testing on one of the nodes as follows. I copied > gmond from /usr/sbin on the head node to the subject compute node / > usr/ > sbin. I ran gmond --default_config and saved the output and changed it > thus: > > scyld:etc root$ bpsh 5 cat /etc/gmond.conf > /* This configuration is as close to 2.5.x default behavior as > possible > The values closely match ./gmond/metric.h definitions in 2.5.x */ > globals { > daemonize = yes > setuid = yes > user = nobody > debug_level = 0 > max_udp_msg_len = 1472 > mute = no > deaf = no > host_dmax = 0 /*secs */ > cleanup_threshold = 300 /*secs */ > gexec = no > } > > /* If a cluster attribute is specified, then all gmond hosts are > wrapped > inside > * of a tag. If you do not specify a cluster tag, then all > will > * NOT be wrapped inside of a tag. */ cluster { > name = "mendel" > owner = "unspecified" > latlong = "unspecified" > url = "unspecified" > } > > /* The host section describes attributes of the host, like the > location > */ host { > location = "unspecified" > } > > /* Feel free to specify as many udp_send_channels as you like. Gmond > used to only support having a single channel */ udp_send_channel { > port = 8649 > host = 10.54.50.150 /* head node's IP */ } > > /* You can specify as many udp_recv_channels as you like as well. */ > > /* You can specify as many tcp_accept_channels as you like to share > an xml description of the state of the cluster */ > tcp_accept_channel > { > port = 8649 > } > > I modified gmond on the head node thus: > > /* This configuration is as close to 2.5.x default behavior as > possible > The values closely match ./gmond/metric.h definitions in 2.5.x */ > globals { > daemonize = yes > setuid = yes > user = nobody > debug_level = 0 > max_udp_msg_len = 1472 > mute = no > deaf = no > host_dmax = 0 /*secs */ > cleanup_threshold = 300 /*secs */ > gexec = no > } > > /* If a cluster attribute is specified, then all gmond hosts are > wrapped > inside > * of a tag. If you do not specify a cluster tag, then all > will > * NOT be wrapped inside of a tag. */ cluster { > name = "mendel" > owner = "unspecified" > latlong = "unspecified" > url = "unspecified" > } > > /* The host section describes attributes of the host, like the > location > */ host { > location = "unspecified" > } > > /* Feel free to specify as many udp_send_channels as you like. Gmond > used to only support having a single channel */ > > /* You can specify as many udp_recv_channels as you like as well. */ > udp_recv_channel { > port = 8649 > } > > /* You can specify as many tcp_accept_channels as you like to share > an xml description of the state of the cluster */ > tcp_accept_channel > { > port = 8649 > } > > I started gmond on the compute node bpsh 5 gmond and restarted gmond > and > gmetad. I don't see my node running gmond. ps -elf | grep gmond on the > compute node returns nothing. I tried to add gmond as a service on the > compute node with the script at the krazy site but I get: > > scyld:~ root$ bpsh 5 chkconfig --add gmond service gmond does not > support chkconfig > > and > > scyld:~ root$ bpsh 5 service gmond start > /sbin/service: line 3: /etc/init.d/functions: No such file or > directory > > I am at a loss over what to try next, it seems this should work. Any > and > all suggestions will be appreciated. > > Thanks > > Mike > > Michael Muratet, Ph.D. > Senior Scientist > HudsonAlpha Institute for Biotechnology > mmuratet at hudsonalpha.org > (256) 327-0473 (p) > (256) 327-0966 (f) > > Room 4005 > 601 Genome Way > Huntsville, Alabama 35806 > > > > > > > > ------------------------------ > Michael Muratet, Ph.D. Senior Scientist HudsonAlpha Institute for Biotechnology mmuratet at hudsonalpha.org (256) 327-0473 (p) (256) 327-0966 (f) Room 4005 601 Genome Way Huntsville, Alabama 35806 From madskaddie at gmail.com Tue Aug 25 17:11:36 2009 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Wed, 26 Aug 2009 01:11:36 +0100 Subject: [Beowulf] Cluster install and admin approach (newbie question) Message-ID: Greetings, I relatively new to cluster environments and I was given a small (7nodes+1head) cluster to admin. So far I only had to maintain what was already installed so few problems to solve (and to think on). But new (diferent: amd opteron vs intel xeon) machines came and I have to expand the cluster (think and solve problems). The (old) cluster is semi-diskless (all machines do have disks but they boot from a single image on a central server) with nfs for filesystem sharing. The main problems I had were: * if the /var filesystem is shared, race conditions happen (all nodes want to write on the same files). I had this problem and moved to a local /var filesystem. * if /var is local (which it may because the disks do exist), the whole point of central point for easy admin vanishes, because I would had to create all the /var structure that packages need to work, on each node (would be easier to do: "for $node; ssh $install_cmd; done", than guessing which dirs I need to create or files to copy). * if /var is tmpfs all forensics are certainly gone after failure (Murphy told me this one ;). Everything I read on the subject do underline the advantages of diskless approaches but miss to alert to this problem and/or to solve it. On the other side, the distributed approach tools (where every node is autonomous) seem to be halted (as systemimager - which is used in the Oscar project) or discontinued, or truly overblown for my reference scale (IBM's xCat); so it really seems that I'm missing something. The question is what you do about this ? Gil Brandao From jbickhard at gmail.com Wed Aug 26 06:57:09 2009 From: jbickhard at gmail.com (J Bickhard) Date: Wed, 26 Aug 2009 08:57:09 -0500 Subject: [Beowulf] Practicality of a Beowulf Cluster Message-ID: So, I was thinking of making a cluster, but wondered: what are the practical uses of one? I mean, you can't exactly run Windows on these things, and it looks like they're mostly for parallel computing of complex algorithms. Would an average Joe like me have a use for a cluster? From Glen.Beane at jax.org Thu Aug 27 14:45:43 2009 From: Glen.Beane at jax.org (Glen Beane) Date: Thu, 27 Aug 2009 17:45:43 -0400 Subject: [Beowulf] Practicality of a Beowulf Cluster In-Reply-To: References: Message-ID: What use is a screwdriver if you don't have any screws? Sent from my iPhone On Aug 27, 2009, at 5:39 PM, "J Bickhard" wrote: > So, I was thinking of making a cluster, but wondered: what are the > practical uses of one? I mean, you can't exactly run Windows on these > things, and it looks like they're mostly for parallel computing of > complex algorithms. > > Would an average Joe like me have a use for a cluster? > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From landman at scalableinformatics.com Thu Aug 27 14:57:18 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 27 Aug 2009 17:57:18 -0400 Subject: [Beowulf] Practicality of a Beowulf Cluster In-Reply-To: References: Message-ID: <4A97013E.2070700@scalableinformatics.com> J Bickhard wrote: > So, I was thinking of making a cluster, but wondered: what are the > practical uses of one? I mean, you can't exactly run Windows on these > things, and it looks like they're mostly for parallel computing of > complex algorithms. Technically you can run windows on it, though this raises additional questions, which are better served answered elsewhere. > > Would an average Joe like me have a use for a cluster? That is the important question, but the answer is a function of what you need to do in a computational sense. If you are cranking on excel spreadsheets all day long, yeah, chances are, a cluster doesn't make sense. If you are performing very detailed and time sensitive calculations that require hundreds of billions of operations to arrive at an answer, that is more likely the domain of something cluster-like. The real answer is "it depends" and in most cases, the "average Joe" (or :) ) probably doesn't need one. Going forward, many average Joe's workloads will likely be handled on local accelerators such as GPU or similar systems. Clusters provide massively increased processor cycle density per unit time. If this is what your application needs, then by all means, look into clusters. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From dag at sonsorol.org Thu Aug 27 14:58:56 2009 From: dag at sonsorol.org (Chris Dagdigian) Date: Thu, 27 Aug 2009 17:58:56 -0400 Subject: [Beowulf] Practicality of a Beowulf Cluster In-Reply-To: References: Message-ID: <659A31BD-8244-4061-AAC8-0EAA1538F951@sonsorol.org> In a nutshell: Science. Finance. Rendering & Digital content creation. ... if you have interests (work or personal) in any of these areas, you'll find a cluster useful. On Aug 26, 2009, at 9:57 AM, J Bickhard wrote: > So, I was thinking of making a cluster, but wondered: what are the > practical uses of one? I mean, you can't exactly run Windows on these > things, and it looks like they're mostly for parallel computing of > complex algorithms. > > Would an average Joe like me have a use for a cluster? From james.p.lux at jpl.nasa.gov Thu Aug 27 15:26:47 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 27 Aug 2009 15:26:47 -0700 Subject: [Beowulf] Practicality of a Beowulf Cluster In-Reply-To: <659A31BD-8244-4061-AAC8-0EAA1538F951@sonsorol.org> References: <659A31BD-8244-4061-AAC8-0EAA1538F951@sonsorol.org> Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Chris Dagdigian > Sent: Thursday, August 27, 2009 2:59 PM > To: J Bickhard; Beowulf List > Subject: Re: [Beowulf] Practicality of a Beowulf Cluster > > > In a nutshell: > > Science. > Finance. > Rendering & Digital content creation. > > ... if you have interests (work or personal) in any of these areas, > you'll find a cluster useful. > > Or, if you're interested in developing parallel algorithms, message passing, etc. Anything computationally intensive would be potential grist for a Beowulf. For instance, if you wanted to do video compression, or process lots of video frames to do feature extraction? Wasn't there a news story about somebody trying to tag all photos on the internet with the names of everyone in the photo? There's a computationally large but potentially parallelizable task. From rgb at phy.duke.edu Fri Aug 28 06:05:53 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri, 28 Aug 2009 09:05:53 -0400 (EDT) Subject: [Beowulf] Practicality of a Beowulf Cluster In-Reply-To: References: Message-ID: On Wed, 26 Aug 2009, J Bickhard wrote: > So, I was thinking of making a cluster, but wondered: what are the > practical uses of one? I mean, you can't exactly run Windows on these > things, and it looks like they're mostly for parallel computing of > complex algorithms. > > Would an average Joe like me have a use for a cluster? Possibly none. Your statement above is pretty much dead on the money. Why would you need more compute power (which is what a cluster is designed to provide) if you don't need more compute power? Let is imagine that a cluster is anthropomorphic -- one of my favorite metaphors is that it is a room full of monks who are all your servants, ready to do any work for you that you can give them to do. Word processing, for example, is a monk taking dictation according to your (finger and mouse driven) input and transforming your wishes into a lovely illuminated manuscript page. When word processing, though, nearly all of the monks sit idle, because only one monk is needed to listen to you and do all of the work of arranging the letters on the page and filling in all that gold leaf, and THAT monk works much faster than you can type and spends most of his time twiddling his thumbs and picking his teeth. Sure, maybe you have other tasks for your monks that you want done at the same time -- one monk, for example is playing music gently on a violin, but he does have recording equipment and a sound room and it only takes him a second or two to play an hour's worth of music and then he, too, is idle once again. In fact, the first monk can take time out between keystrokes and keep the prerecorded music buffer full and still be making parchment airplanes and sailing them at the other monks while waiting for something to do. There are a TINY HANDFUL of tasks you do in normal quotidian computing -- maybe decoding video streams while handling a complex network and playing video games -- that actually use up a whole monk, or even a monk and a half. Nowadays, however, pretty much all CPUs are dual core, so you always have at least two monks anyway, and quads are increasingly common giving you four (of which two are nearly always idle but available in case you want your house painted and taxes done while you are playing a video game at the same time). But YOU have a hard time organizing more than two or three way multitasking, and interactively you just can't keep those damn monks busy. So what CAN keep a large cluster of monks/CPUs -- tens, hundreds, even thousands -- busy? Big tasks, in particular big tasks that can be split up so that all of the monks stay busy. Computations, metaphorically trying to create an entire illuminated manuscript all at once with every monk working on a single page in parallel, where the abbot hands out an assignment to each of the worker monks that will keep them busy all day, and then handles the results and collates them and arranges for the page monk 32 illuminated to be sent to monk 133 so that the figure he drew can be copied accurately and incorporated into the manuscript page that particular monk is working on and so on. So in order for a cluster to make sense, you need some sort of work that can be done in parallel, work that takes a long time (so that your one monk can't finish it satisfactorily quickly working alone), work that is important enough to justify the expense. Does that make sense to you? rgb > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From eagles051387 at gmail.com Fri Aug 28 07:17:23 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Fri, 28 Aug 2009 16:17:23 +0200 Subject: [Beowulf] Cluster install and admin approach (newbie question) In-Reply-To: References: Message-ID: would creating a cron job for each of the nodes to where only one is workign on the files on the head node? On Wed, Aug 26, 2009 at 2:11 AM, wrote: > Greetings, > > > I relatively new to cluster environments and I was given a small > (7nodes+1head) cluster to admin. So far I only had to maintain what > was already installed so few problems to solve (and to think on). But > new (diferent: amd opteron vs intel xeon) machines came and I have to > expand the cluster (think and solve problems). The (old) cluster is > semi-diskless (all machines do have disks but they boot from a single > image on a central server) with nfs for filesystem sharing. The main > problems I had were: > * if the /var filesystem is shared, race conditions happen (all nodes > want to write on the same files). I had this problem and moved to a > local /var filesystem. > * if /var is local (which it may because the disks do exist), the > whole point of central point for easy admin vanishes, because I would > had to create all the /var structure that packages need to work, on > each node (would be easier to do: "for $node; ssh $install_cmd; done", > than guessing which dirs I need to create or files to copy). > * if /var is tmpfs all forensics are certainly gone after failure > (Murphy told me this one ;). > > Everything I read on the subject do underline the advantages of > diskless approaches but miss to alert to this problem and/or to solve > it. On the other side, the distributed approach tools (where every > node is autonomous) seem to be halted (as systemimager - which is used > in the Oscar project) or discontinued, or truly overblown for my > reference scale (IBM's xCat); so it really seems that I'm missing > something. > > The question is what you do about this ? > > Gil Brandao > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Fri Aug 28 07:57:26 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 28 Aug 2009 10:57:26 -0400 (EDT) Subject: [Beowulf] Practicality of a Beowulf Cluster In-Reply-To: References: Message-ID: > So, I was thinking of making a cluster, but wondered: what are the > practical uses of one? I don't think you really mean practical here. perhaps "commonplace"? > I mean, you can't exactly run Windows on these > things, but why do you equate practical with "runs windows"? a lot of the wider computer world doesn't run windows. windows is mainly a low-end desktop, low-end server ghetto. large, yes, but not the best (or most "practical") by any measure. > and it looks like they're mostly for parallel computing of > complex algorithms. do you mean parallel/complex makes beowulf of limited interest or niche? > Would an average Joe like me have a use for a cluster? do you use any websites? any search engines? beowulf and related technologies are all about scaling, so in a sense, anything big is a beowulf. some may complain that this stretches the term, that "grid" or "cloud" should be used instead, but it's all the same concept. a beowulf is a cluster, normally of x86 commodity parts, normally for compute-intensive research/engineering. a render farm is basically the same but oriented towards making movies (IO intensive, but more "embarassingly" parallel.) a grid means a cluster that's geographically distributed and including multiple administrative domains (hence not suitable for tightly-coupled parallelism.) a cloud is a grid-like facility usually implemented on top of VMs. google/amazon/yahoo/etc are all, in this sense, beowulf-like clusters. windows-based beowulf-like clusters are still beowulf-like - they're just knock-offs using a less appropriate OS. programs on clusters tend not to be all that dependent on the "underware" (vs middleware) of the platform. From joshua_mora at usa.net Thu Aug 27 16:53:29 2009 From: joshua_mora at usa.net (Joshua mora acosta) Date: Fri, 28 Aug 2009 01:53:29 +0200 Subject: [Beowulf] Practicality of a Beowulf Cluster Message-ID: <641NHAX1d8298S08.1251417209@cmsweb08.cms.usa.net> In my personal experience, I developed long time ago CFD software on a single machine with a single core. Once I started complicating my life with more complex problems (eg. from 2D to 3D), the time it was taking to solve the problems was growing exponentially ( from hours to several weeks). Therefore I required at some point more computational infrastructure despite all the sw and numerical things I would do to accelerate the sw. I required more, ie a bunch of computers, or a cluster to get to my performance or productivity goal (eg. reduce a 3 week simulation to a overnight run time). But it took me my time to _realize_ or to get more demanding based on my evolving computational needs. On other cases, you cannot simply fit the data you are crunching on a single node so that is a capacity problem you can solve by distributing the data among more compute nodes and having the aggregated capacity needed. The performance one, can also be seen as the # of arithmetic operations to solve your problem is growing so much that you need the aggregated computing power of multiple computers, again distributing the computation among more processors and nodes. Given this sort of introduction, my advice would be to grow your computational infrastructure along your needs over the time (eg. every 1 or 2 years, and much better if aligned with your trusted HW vendor provider), from a single node which nowadays looks like a cluster 5 years ago. And then if you need more, start adding computational infrastructure, which could be also more storage or more network gear, or more gpus which these days are also used for accelerating the computation of the multicore processors. Making the assumption of very little knowledge on your computational needs and usage it is nearly impossible to guess if a cluster will satisfy you and even harder to size it properly. The people that use clusters typically have computational needs well understood for many years, so the sizing can be more or less estimated by running "kernels" (the meat) of their applications on a single node and then multiplying the performance achieved on that node by the number of nodes necessary to reach the total performance or productivity or capacity target. Having a cluster without using it for what its been designed/built is a whole waste of money, electric power and time and a bunch of unnecessary headaches on many directions. Finally on your comment on Windows, Microsoft has spent already since 2004 money and people in developing and bringing to the market a HPC solution as well. So yes, there is a windows solution for clusters with same features as you will see on Linux/Unix. You can also run decently on Windows on a single box a compute intensive application.... I hope it helps you clear out whether it makes sense or not for you to build a cluster. This group assumes you are already on it and you need perhaps the analysis/feedback/friendly advice on components or on the way an application stresses that specific component of the cluster (eg. processor, networking, storage, OS, settings), or sw tools for management/debugging of the HW+SW clustered solution, among many other things... Best regards, Joshua Mora. ------ Original Message ------ Received: 11:42 PM CEST, 08/27/2009 From: J Bickhard To: beowulf at beowulf.org Subject: [Beowulf] Practicality of a Beowulf Cluster > So, I was thinking of making a cluster, but wondered: what are the > practical uses of one? I mean, you can't exactly run Windows on these > things, and it looks like they're mostly for parallel computing of > complex algorithms. > > Would an average Joe like me have a use for a cluster? > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From madskaddie at gmail.com Fri Aug 28 03:37:44 2009 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Fri, 28 Aug 2009 11:37:44 +0100 Subject: [Beowulf] Cluster install and admin approach (newbie question) In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0CF1F9D5@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Fri, Aug 28, 2009 at 10:14 AM, Hearns, John wrote: > Greetings, > > > I relatively new to cluster environments and I was given a small > (7nodes+1head) cluster to admin. So far I only had to maintain what > was already installed so few problems to solve (and to think on). But > new (diferent: amd opteron vs intel xeon) machines came and I have to > expand the cluster (think and solve problems). > > And whose bright idea was this one? > I bet it wasn't yours. > I've seen this before - 'computers' are just assumed to be 'all the > same' by the management - > and the poor techie is the one who has to cope with new hardware for > which drivers don't exist > in the original Linux install, new kernel version are needed, cluster > management and monitoring > won't cope with these ndoes. > I have some real sympathy for you. > > That issue I see it by another point of view: finally I will learn something really new. Yes, I will loose time ?but I hope that in the end all players will win: me because I got money and know how and the cluster users because we doubled the capacity (not really because I don't believe that mixing the nodes will possible) so more people can run code. But the question remains unanswered ;) and is not tied to the "heterogeneous cluster" problem: ?If diskless what about "/var" like issues; if not, what do you use to install and manage it Gil Brandao From madskaddie at gmail.com Fri Aug 28 04:53:00 2009 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Fri, 28 Aug 2009 12:53:00 +0100 Subject: [Beowulf] Cluster install and admin approach (newbie question) In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0CF1FAED@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0CF1F9D5@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0CF1FAED@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Fri, Aug 28, 2009 at 11:58 AM, Hearns, John wrote: > > Gil, I can't answer your questions because I don't know who supplied > your cluster - > there are many cluster management suites. > The cluster is a Debian based beowulf-like cluster (we have supplied our selves). > > You can set up syslog-ng to copy syslog entries to a central syslog > host, ie. the cluster head node. > You can also use the 'conserver' program to log serial console output to > files on the cluster head node, > this counts for both 'real' serial consoles and IPMI serial over LAN. > From hahn at mcmaster.ca Fri Aug 28 08:07:32 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 28 Aug 2009 11:07:32 -0400 (EDT) Subject: [Beowulf] Cluster install and admin approach (newbie question) In-Reply-To: References: Message-ID: > * if the /var filesystem is shared, race conditions happen (all nodes > want to write on the same files). I had this problem and moved to a > local /var filesystem. indeed, shared /var is simply a bug. non-shared NFS /var is viable, but generally pointless. > * if /var is local (which it may because the disks do exist), the > whole point of central point for easy admin vanishes, because I would eh? > had to create all the /var structure that packages need to work, on > each node (would be easier to do: "for $node; ssh $install_cmd; done", > than guessing which dirs I need to create or files to copy). but if your nodes are nfs-root, you won't be installing anything on them: you'll be installing on the nfs-root. > * if /var is tmpfs all forensics are certainly gone after failure > (Murphy told me this one ;). syslog is very happy to log over the network. > Everything I read on the subject do underline the advantages of > diskless approaches but miss to alert to this problem and/or to solve > it. On the other side, the distributed approach tools (where every > node is autonomous) seem to be halted (as systemimager - which is used > in the Oscar project) or discontinued, or truly overblown for my > reference scale (IBM's xCat); so it really seems that I'm missing there's also OneSIS. > something. > > The question is what you do about this ? setting up your own nfs-root cluster is a simple exercise. if you're not very familiar with *nix booting/daemons/init scripts, it will take a few tries to get the config right, but the end result is pretty simple and robust. remote syslog, preferably with console-over-net (ipmi sol, netconsole) means that there's nothing interesting on the local /var. From gus at ldeo.columbia.edu Fri Aug 28 15:48:06 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 28 Aug 2009 18:48:06 -0400 Subject: hdparm error: Was Re: [Beowulf] Help for terrible NFS write performance In-Reply-To: References: <4A8ED7D6.8080601@cora.nwra.com> <4A8EDB89.7090409@scalableinformatics.com> <4A8EDFD5.9080909@cora.nwra.com> <4A8EE41B.7000605@scalableinformatics.com> <4A8F0D64.7050702@ias.edu> <4A8F15C2.7040009@ldeo.columbia.edu> Message-ID: <4A985EA6.9060100@ldeo.columbia.edu> Mark Hahn wrote: >> Somehow it works on Fedora 10, but not on CentOS 4 and 5. > > what version does hdparm -v mention on each system? Hi Mark, list Thank you Mark. Sorry for my very late answer. Anyway, better late than never. The hdparm versions I have are: 8.6 on Fedora 10, 6.6 on CentOS 5, 5.7 on CentOS 4. I'd guess 6.6 and earlier are too old, even the man pages are different. Gus Correa From amjad11 at gmail.com Sat Aug 29 00:42:36 2009 From: amjad11 at gmail.com (amjad ali) Date: Sat, 29 Aug 2009 12:42:36 +0500 Subject: [Beowulf] GPU question Message-ID: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> Hello All, I perceive following computing setups for GP-GPUs, 1) ONE PC with ONE CPU and ONE GPU, 2) ONE PC with more than one CPUs and ONE GPU 3) ONE PC with one CPU and more than ONE GPUs 4) ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than ONE GPUs (e.g. Nvidia C1060) 5) Cluster of PCs with each node having ONE CPU and ONE GPU 6) Cluster of PCs with each node having more than one CPUs and ONE GPU 7) Cluster of PCs with each node having ONE CPU and more than ONE GPUs 8) Cluster of PCs with each node having more than one CPUs and more than ONE GPUs. Which of these are good/realistic/practical; which are not? Which are quite ?natural? to use for CUDA based programs? IMPORTANT QUESTION: Will a cuda based program will be equally good for some/all of these setups or we need to write different CUDA based programs for each of these setups to get good efficiency? Comments are welcome also for AMD/ATI FireStream. With best regards, AMJAD ALI. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsten.aulbert at aei.mpg.de Sat Aug 29 01:53:13 2009 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Sat, 29 Aug 2009 10:53:13 +0200 Subject: [Beowulf] GPU question In-Reply-To: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> Message-ID: <200908291053.14044.carsten.aulbert@aei.mpg.de> Hi On Saturday 29 August 2009 09:42:36 amjad ali wrote: > I perceive following computing setups for GP-GPUs, > > 1) ONE PC with ONE CPU and ONE GPU, > > 2) ONE PC with more than one CPUs and ONE GPU > > 3) ONE PC with one CPU and more than ONE GPUs > > 4) ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than ONE GPUs > (e.g. Nvidia C1060) I think no one will be able to answer your question correctly. It will all boil down how your usage scenario will be, e.g. if your codes run only on the GPUs and you have hardly to do anything except providing the initial data to the GPUs I think you might even go to the extreme with a single box and 7 GPUs in it along with a quad CPU. On the other extreme you have code where only a fraction of the code (say 20%) can be put onto the GPU you will probably have to aim at a different CPU to GPU ratio. > > 5) Cluster of PCs with each node having ONE CPU and ONE GPU > > 6) Cluster of PCs with each node having more than one CPUs and ONE GPU > > 7) Cluster of PCs with each node having ONE CPU and more than ONE GPUs > > 8) Cluster of PCs with each node having more than one CPUs and more > than ONE GPUs. > Same as above, it highly depends what will run on the cluster. > > > IMPORTANT QUESTION: Will a cuda based program will be equally good for > some/all of these setups or we need to write different CUDA based programs > for each of these setups to get good efficiency? Not much experience here, sorry Carsten From amjad11 at gmail.com Sat Aug 29 16:35:30 2009 From: amjad11 at gmail.com (amjad ali) Date: Sun, 30 Aug 2009 04:35:30 +0500 Subject: [Beowulf] GPU question In-Reply-To: References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> Message-ID: <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Hello all, specially Gil Brandao Actually I want to start CUDA programming for my |C.I have 2 options to do: 1) Buy a new PC that will have 1 or 2 CPUs and 2 or 4 GPUs. 2) Add 1 GPUs to each of the Four nodes of my PC-Cluster. Which one is more "natural" and "practical" way? Does a program written for any one of the above will work fine on the other? or we have to re-program for the other? Regards. On Sat, Aug 29, 2009 at 5:48 PM, wrote: > On Sat, Aug 29, 2009 at 8:42 AM, amjad ali wrote: > > Hello All, > > > > > > > > I perceive following computing setups for GP-GPUs, > > > > > > > > 1) ONE PC with ONE CPU and ONE GPU, > > > > 2) ONE PC with more than one CPUs and ONE GPU > > > > 3) ONE PC with one CPU and more than ONE GPUs > > > > 4) ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than ONE GPUs > > (e.g. Nvidia C1060) > > > > 5) Cluster of PCs with each node having ONE CPU and ONE GPU > > > > 6) Cluster of PCs with each node having more than one CPUs and ONE > GPU > > > > 7) Cluster of PCs with each node having ONE CPU and more than ONE > GPUs > > > > 8) Cluster of PCs with each node having more than one CPUs and more > > than ONE GPUs. > > > > > > > > Which of these are good/realistic/practical; which are not? Which are > quite > > ?natural? to use for CUDA based programs? > > > > CUDA is kind of new technology, so I don't think there is a "natural > use" yet, though I read that there people doing CUDA+MPI and there are > papers on CPU+GPU algorithms. > > > > > IMPORTANT QUESTION: Will a cuda based program will be equally good for > > some/all of these setups or we need to write different CUDA based > programs > > for each of these setups to get good efficiency? > > > > There is no "one size fits all" answer to your question. If you never > developed with CUDA, buy one GPU an try it. If it fits your problems, > scale it with the approach that makes you more comfortable (but > remember that scaling means: making bigger problems or having more > users). If you want a rule of thumb: your code must be > _truly_parallel_. If you are buying for someone else, remember that > this is a niche. The hole thing is starting, I don't thing there isn't > many people that needs much more 1 or 2 GPUs. > > > > > Comments are welcome also for AMD/ATI FireStream. > > > > put it on hold until OpenCL takes of (in the real sense, not in > "standards papers" sense), otherwise you will have to learn another > technology that even fewer people knows. > > > Gil Brandao > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Sat Aug 29 20:18:37 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Sat, 29 Aug 2009 23:18:37 -0400 Subject: [Beowulf] hpcmp benchmarks Message-ID: does anyone know if the TI series benchmark packages were ever released in the open? if not, are there any out of core solver benchmarks out there? my google searches didn't turn out as I had expected, so I'm probably searching for the wrong thing. From tjrc at sanger.ac.uk Sun Aug 30 04:11:31 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Sun, 30 Aug 2009 12:11:31 +0100 Subject: [Beowulf] Cluster install and admin approach (newbie question) In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0CF1F9D5@milexchmb1.mil.tagmclarengroup.com> Message-ID: On 28 Aug 2009, at 11:37 am, madskaddie at gmail.com wrote: > That issue I see it by another point of view: finally I will learn > something really new. Yes, I will loose time but I hope that in the > end all players will win: me because I got money and know how and the > cluster users because we doubled the capacity (not really because I > don't believe that mixing the nodes will possible) so more people can > run code. > > But the question remains unanswered ;) and is not tied to the > "heterogeneous cluster" problem: If diskless what about "/var" like > issues; if not, what do you use to install and manage it We also use Debian. We also use a heterogenous cluster (since our workload is embarrassingly parallel and the individual jobs are mostly single threaded, this doesn't really matter). We use FAI for installation, since our nodes are not diskless, rsyslog for logging to a central log server, which is running Splunk. We use cfengine 2 for configuration management. We don't have diskless nodes, so the /var problem doesn't exist for us. Regards, Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From madskaddie at gmail.com Sun Aug 30 05:41:18 2009 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Sun, 30 Aug 2009 13:41:18 +0100 Subject: [Beowulf] GPU question In-Reply-To: <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Message-ID: On Sun, Aug 30, 2009 at 12:35 AM, amjad ali wrote: > Hello all, specially Gil Brandao > > Actually I want to start CUDA programming for my |C.I have 2 options to do: > 1) Buy a new PC that will have 1 or 2 CPUs and 2 or 4 GPUs. > 2) Add 1 GPUs to each of the Four nodes of my PC-Cluster. > > Which one is more "natural" and "practical" way? If I had to choose, I would go for option 1. It's simpler and I wouldn't had to deal with MPI-like problems. What it would be interesting is to study which of the following options is best: - QuadCore / 2 GPUs - DualCore / 4 GPus - or a N:N relation For now I can not tell you, although I deeply suspect that this related to the problem being solved. For now the CFD code I'm programing uses only 1:1 but to scale to bigger problems I'll be soon using 1:3 and depending of the results, I have do check if I had to use another CPU to deal with data logging stuff. And be aware of the motherboards you choose: if your cluster doesn't have PCI-x 2.0, buy a new PC with a mobo (and GPU card) that supports it. Don't forget that if you put to much GPUs on the same bus and you do not take that in to account, you are likely to have a slow bottleneck. > Does a program written for any one of the above will work fine on the other? > or we have to re-program for the other? Apart from MPI-like programming, I can not see why would you need any other stuff. But it's not fully transparent: you have to explicitly choose what card will run what code. > Regards. > > On Sat, Aug 29, 2009 at 5:48 PM, wrote: >> >> On Sat, Aug 29, 2009 at 8:42 AM, amjad ali wrote: >> > Hello All, >> > >> > >> > >> > I perceive following computing setups for GP-GPUs, >> > >> > >> > >> > 1)????? ONE PC with ONE CPU and ONE GPU, >> > >> > 2)????? ONE PC with more than one CPUs and ONE GPU >> > >> > 3)????? ONE PC with one CPU and more than ONE GPUs >> > >> > 4)????? ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than ONE GPUs >> > (e.g. Nvidia C1060) >> > >> > 5)????? Cluster of PCs with each node having ONE CPU and ONE GPU >> > >> > 6)????? Cluster of PCs with each node having more than one CPUs and ONE >> > GPU >> > >> > 7)????? Cluster of PCs with each node having ONE CPU and more than ONE >> > GPUs >> > >> > 8)????? Cluster of PCs with each node having more than one CPUs and more >> > than ONE GPUs. >> > >> > >> > >> > Which of these are good/realistic/practical; which are not? Which are >> > quite >> > ?natural? to use for CUDA based programs? >> > >> >> CUDA is kind of new technology, so I don't think there is a "natural >> use" yet, though I read that there people doing CUDA+MPI and there are >> papers on CPU+GPU algorithms. >> >> > >> > IMPORTANT QUESTION: Will a cuda based program will be equally good for >> > some/all of these setups or we need to write different CUDA based >> > programs >> > for each of these setups to get good efficiency? >> > >> >> There is no "one size fits all" answer to your question. If you never >> developed with CUDA, buy one GPU an try it. If it fits your problems, >> scale it with the approach that makes you more comfortable (but >> remember that scaling means: making bigger problems or having more >> users). If you want a rule of thumb: your code must be >> _truly_parallel_. If you are buying for someone else, remember that >> this is a niche. The hole thing is starting, I don't thing there isn't >> many people that needs much more 1 or 2 GPUs. >> >> > >> > Comments are welcome also for AMD/ATI FireStream. >> > >> >> put it on hold until OpenCL takes of ?(in the real sense, not in >> "standards papers" sense), otherwise you will have to learn another >> technology that even fewer people knows. >> >> >> Gil Brandao > > From eagles051387 at gmail.com Sun Aug 30 23:40:49 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Mon, 31 Aug 2009 08:40:49 +0200 Subject: [Beowulf] GPU question In-Reply-To: References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Message-ID: One thing that has yet to be mentioned is what kind of gpu are we talking about. depending on the problem would tesla gpu's, if you are building the cluster from scratch, be better for a gpu based cluster as they are meant for high performance computing? -------------- next part -------------- An HTML attachment was scrubbed... URL: From madskaddie at gmail.com Mon Aug 31 06:15:07 2009 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Mon, 31 Aug 2009 14:15:07 +0100 Subject: [Beowulf] GPU question In-Reply-To: References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Message-ID: On Mon, Aug 31, 2009 at 7:40 AM, Jonathan Aquilina wrote: > One thing that has yet to be mentioned is what kind of gpu are we talking > about. depending on the problem would tesla gpu's, if you are building the > cluster from scratch,?be better for a gpu based cluster as they are meant > for high performance computing? Tesla (10 series) solutions have a big bunch of memory (4GB per GPU) and no graphics card component. In terms of FLOPS the geforce 2xx series are also great (but with less memory per GPU). The tesla C1060 that I work with generates massive heat (they need 200W per card to work), so that is an issue to care about (I have a cold - ~16Celsius - air flow at front of the PC). The 1070 is a 4 GPU "all in one" 1U blade, so I guess it's probably the optimal solution from a management point of view (don't forget the host PC too). The problem with too much data within the GPU (total 16GB) is the bootlenecks (at PCI-x bus) you may have if you need to download big bunches of data frequently or if the code in GPU A is supposed to interact with GPU B/C/D. One thing that's not mentioned out loud by NVIDIA (I have read only in CUDA programming manual) is that if the video system needs more memory that's not available(say you change resolution, while you're waiting for your process to finish), it will crash your cuda app, so I advise you to use a second card to display (if you have a tesla solution, you certainly have a "second" display card). If you are running remotly, this i an non issue (framebuffers don't need much memory neither change resolution). Gil Brandao From eagles051387 at gmail.com Mon Aug 31 07:13:55 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Mon, 31 Aug 2009 16:13:55 +0200 Subject: [Beowulf] GPU question In-Reply-To: References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Message-ID: >One thing that's not mentioned out loud by NVIDIA (I have read only in >CUDA programming manual) is that if the video system needs more memory >that's not available(say you change resolution, while you're waiting >for your process to finish), it will crash your cuda app, so I advise >you to use a second card to display (if you have a tesla solution, >you certainly have a "second" display card). If you are running >remotly, this i an non issue (framebuffers don't need much memory >neither change resolution). in this regard then why waste a pci-x slot when u can get one that has graphics integrated onto the board leaving the slots free to use for data processing. is there any difference in performance in a motherboard that has the graphics card integrated and one that does not? -------------- next part -------------- An HTML attachment was scrubbed... URL: From gus at ldeo.columbia.edu Mon Aug 31 09:28:43 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 31 Aug 2009 12:28:43 -0400 Subject: [Beowulf] GPU question In-Reply-To: <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Message-ID: <4A9BFA3B.5030809@ldeo.columbia.edu> Hi Amjad 1. Beware of hardware requirements, specially on your existing computers, which may or may not fit a CUDA-ready GPU. Otherwise you may end up with a useless lemon. A) Not all NVidia graphic cards are CUDA-ready. NVidia has lists telling which GPUs are CUDA-ready, which are not. B) Check all the GPU hardware requirements in detail: motherboard, PCIe version and slot, power supply capacity and connectors, etc. See the various GPU models on NVidia site, and the product specs from the specific vendor you choose. C) You need a free PCIe slot, most likely 16x, IIRR. D) Most GPU card models are quite thick, and take up its own PCIe slot and cover the neighbor slot, which cannot be used. Hence, if your motherboard is already crowded, make sure everything will fit. For rackmount a chassis you may need at least 2U height. On a tower PC chassis this shouldn't be a problem. You may need some type of riser card if you plan to mount the GPU parallel to the motherboard. E) If I remember right, you need PCIe version 1.5 (?) or version 2 on your motherboard. F) You also need a power supply with enough extra power to feed the GPU beast. The GPU model specs should tell you how much power you need. Most likely a 600W PS or larger, specially if you have a dual socket server motherboard with lots of memory, disks, etc to feed. G) Depending on the CUDA-ready GPU card, the low end ones require 6-pin PCIe power connectors from the power supply. The higher end models require 8-pin power supply PCIe connectors. You may find and buy molex-to-PCIe connector adapters also, so that you can use the molex (i.e. ATA disk power connectors) if your PS doesn't have the PCIe connectors. However, you need to have enough power to feed the GPU and the system, no matter what. *** 2. Before buying a lot of hardware, I would experiment first with a single GPU on a standalone PC or server (that fits the HW requirements), to check how much programming it takes, and what performance boost you can extract from CUDA/GPU. CUDA requires quite a bit of logistics of shipping data between memory, GPU, CPU, etc. It is perhaps more challenging to program than, say, parallelizing a serial program with MPI, for instance. Codes that are heavy in FFTs or linear algebra operations are probably good candidates, as there are CUDA libraries for both. At some point only 32-bit floating point arrays would take advantage of CUDA/GPU, but not 64-bit arrays. The latter would require additional programming to change between 64/32 bit when going to and coming back from the GPU. Not sure if this still holds true, newer GPU models may have efficient 64-bit capability, but it is worth checking this out, including if performance for 64-bit is as good as for 32-bit. 3. PGI compilers version 9 came out with "GPU directives/pragmas" that are akin to the OpenMPI directives/pragmas, and may simplify the use of CUDA/GPU. At least before the promised OpenCL comes out. Check the PGI web site. Note that this will give you intra-node parallelism exploring the GPU, just like OpenMP does using threads on the CPU/cores. 4. CUDA + MPI may be quite a challenge to program. I hope this helps, Gus Correa amjad ali wrote: > Hello all, specially Gil Brandao > > Actually I want to start CUDA programming for my |C.I have 2 options to do: > 1) Buy a new PC that will have 1 or 2 CPUs and 2 or 4 GPUs. > 2) Add 1 GPUs to each of the Four nodes of my PC-Cluster. > > Which one is more "natural" and "practical" way? > Does a program written for any one of the above will work fine on the > other? or we have to re-program for the other? > > Regards. > > On Sat, Aug 29, 2009 at 5:48 PM, > wrote: > > On Sat, Aug 29, 2009 at 8:42 AM, amjad ali > wrote: > > Hello All, > > > > > > > > I perceive following computing setups for GP-GPUs, > > > > > > > > 1) ONE PC with ONE CPU and ONE GPU, > > > > 2) ONE PC with more than one CPUs and ONE GPU > > > > 3) ONE PC with one CPU and more than ONE GPUs > > > > 4) ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than > ONE GPUs > > (e.g. Nvidia C1060) > > > > 5) Cluster of PCs with each node having ONE CPU and ONE GPU > > > > 6) Cluster of PCs with each node having more than one CPUs > and ONE GPU > > > > 7) Cluster of PCs with each node having ONE CPU and more > than ONE GPUs > > > > 8) Cluster of PCs with each node having more than one CPUs > and more > > than ONE GPUs. > > > > > > > > Which of these are good/realistic/practical; which are not? Which > are quite > > ?natural? to use for CUDA based programs? > > > > CUDA is kind of new technology, so I don't think there is a "natural > use" yet, though I read that there people doing CUDA+MPI and there are > papers on CPU+GPU algorithms. > > > > > IMPORTANT QUESTION: Will a cuda based program will be equally > good for > > some/all of these setups or we need to write different CUDA based > programs > > for each of these setups to get good efficiency? > > > > There is no "one size fits all" answer to your question. If you never > developed with CUDA, buy one GPU an try it. If it fits your problems, > scale it with the approach that makes you more comfortable (but > remember that scaling means: making bigger problems or having more > users). If you want a rule of thumb: your code must be > _truly_parallel_. If you are buying for someone else, remember that > this is a niche. The hole thing is starting, I don't thing there isn't > many people that needs much more 1 or 2 GPUs. > > > > > Comments are welcome also for AMD/ATI FireStream. > > > > put it on hold until OpenCL takes of (in the real sense, not in > "standards papers" sense), otherwise you will have to learn another > technology that even fewer people knows. > > > Gil Brandao > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From bernard at vanhpc.org Mon Aug 31 11:57:27 2009 From: bernard at vanhpc.org (Bernard Li) Date: Mon, 31 Aug 2009 11:57:27 -0700 Subject: [Beowulf] Cluster install and admin approach (newbie question) In-Reply-To: References: Message-ID: Hi Gil: On Tue, Aug 25, 2009 at 5:11 PM, wrote: > it. On the other side, the distributed approach tools (where every > node is autonomous) seem to be halted (as systemimager - which is used > in the Oscar project) or discontinued, or truly overblown for my As far as I know, the OSCAR project is still alive and kicking: http://svn.oscar.openclustergroup.org/trac/oscar There hasn't been a lot of development for SystemImager, but the code should be fairly stable and self-sufficient. If you are looking for deployment systems to try out, you might also want to take a look at Perceus. Out of the box it does stateless provisioning, but I have created patches so that it can perform stateful provisioning as well. Hopefully this will get included in the 1.6 release: http://www.perceus.org Cheers, Bernard From eagles051387 at gmail.com Mon Aug 31 23:58:39 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 1 Sep 2009 08:58:39 +0200 Subject: [Beowulf] GPU question In-Reply-To: <4A9BFA3B.5030809@ldeo.columbia.edu> References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> <4A9BFA3B.5030809@ldeo.columbia.edu> Message-ID: Gus wouldnt iti be better in regards to power supplies to purchase a modular power supply that way you put in the power cables that u will use and not have any extra clutter in your box? -------------- next part -------------- An HTML attachment was scrubbed... URL: