From bob at drzyzgula.org Thu Jul 1 03:10:08 2010 From: bob at drzyzgula.org (Bob Drzyzgula) Date: Thu, 1 Jul 2010 06:10:08 -0400 Subject: [Beowulf] Re: Multiple FlexLM lmgrd services on a single Linux machine? In-Reply-To: References: Message-ID: <20100701101008.GA21330@mx1.drzyzgula.org> One could also, clearly, set up multiple KVM- or Xen-based virutual machine images on which to run lmgrd. But one might then ask why one would want to do this, given that part of the point of mulitple lmgrds is to provide phyisical server redundancy, unless as Mark appears to be thinking, you simply believe you need one lmgrd for each vendor... On 01/07/10 01:21 -0400, Mark Hahn wrote: >> Linux OS support multiple lmgrd services or not? If its not directly, is >> there a way to do it? > > I don't really understand what you're asking. yes, linux provides fully > functional TCP/IP. yes, flexlm can run either with a merged license file > (single base port, multiple vendor ports), or with multiple completely > separate instances (listening on say, ports 27000+27001 and 28000+28001). > the latter is often more convenient, since it means you can adjust one > instance without affecting the other. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lynesh at Cardiff.ac.uk Thu Jul 1 03:38:31 2010 From: lynesh at Cardiff.ac.uk (Huw Lynes) Date: Thu, 01 Jul 2010 11:38:31 +0100 Subject: [Beowulf] Re: Multiple FlexLM lmgrd services on a single Linux machine? In-Reply-To: <20100701101008.GA21330@mx1.drzyzgula.org> References: <20100701101008.GA21330@mx1.drzyzgula.org> Message-ID: <1277980711.2155.16.camel@w1181.insrv.cf.ac.uk> On Thu, 2010-07-01 at 06:10 -0400, Bob Drzyzgula wrote: > One could also, clearly, set up multiple KVM- or Xen-based > virutual machine images on which to run lmgrd. But one > might then ask why one would want to do this, given > that part of the point of mulitple lmgrds is to provide > phyisical server redundancy, unless as Mark appears to > be thinking, you simply believe you need one lmgrd for > each vendor... > While it is possible to run licenses from multiple vendors under a single LMGRD I would advise against it on the basis that 99.9% of vendors don't understand how flexlm works. The worst case I've seen of this is PGI where we have to run two separate virtual machines for the windows and linux licenses. The way they set up their license files makes them impossible to merge. And to add insult to injury the vendor daemon has paths to lock files hard-coded so you can't run two PGI license servers on the same box. It's possible that there is some magic environment variable that would fix the lock-file issue. On this issue of VMs to run license servers; the reason we do it is to provide physical server redundancy since the VMs reside in an ESX HA cluster. Cheers, Huw -- Huw Lynes | Advanced Research Computing HEC Sysadmin | Cardiff University | Redwood Building, Tel: +44 (0) 29208 70626 | King Edward VII Avenue, CF10 3NB From john.hearns at mclaren.com Thu Jul 1 03:46:37 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 1 Jul 2010 11:46:37 +0100 Subject: [Beowulf] Multiple FlexLM lmgrd services on a single Linux machine? In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B10F7567C@milexchmb1.mil.tagmclarengroup.com> We're in a process of implementing a centralized FlexLM license server for multiple commercial applications. Can some one tell us, whether Linux OS support multiple lmgrd services or not? If its not directly, is there a way to do it? For example, can we install FlexLM license servers of both ANSYS and STAR CD on a single linux server? You can run multiple lmgrd on a single machine. As Mark says, use different ports. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ntmoore at gmail.com Thu Jul 1 06:25:37 2010 From: ntmoore at gmail.com (Nathan Moore) Date: Thu, 1 Jul 2010 08:25:37 -0500 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: I spent a summer working at IBM (porting applications) when LLNL's Blue Gene system was being installed/finalized. After spending 10 years in college/grad school with no real outside experience, it was an interesting time. A few observations might be relevant to the discussion. (1) to a certain extent, intellectual/scientific prestige was very important to the culture of the place. Promotions were/are based in part on how many patents you generate (not dissimilar to a publications count), but at least superficially, patents don't seem like a major revenue stream. Another data point, the company has a few internal research journals, http://domino.research.ibm.com/tchjr/journalindex.nsf/Home?OpenForm . (2) about once a week, my supervisor (a very skilled applications programmer) would ask, "So, have you figured out how to sell a million Blue Gene's yet?". Once the design was finalized/produced, the clear goal was to sell lots of them. (Fastest/best/national lab etc only really matter for a short time - people have to be paid...). (3) The local view seemed to be that the interconnect fabric (really fast and high-bandwith, ideal for finite-element calculations, and actually somewhat difficult to implement (well) in Molecular Dynamics) in the BGL was included because of LLNL's application needs, and the machine was accordingly hard to sell to "regular customers." (Something akin to selling a fleet of porsche's to a Taxi Company). (3.5) a little more. 0.8GHz cpus, minimum allocation is 512/1024 CPU's at a time. Not really an architecture that the guys at Citibank are used to writing for... As I recall, this was a result of the design requirements from LLNL. Its an amazing system to look at though - its just one big board with a bunch of chips (CPU+memory) plugged in. The system density and low power consumption was the most impressive thing to me. (4) From my experience, it seems like one of the roots of IBM's success was taking a computer that you have to replace every two years (or can build from parts on NewEgg) and turning it into an industrial appliance (like a hobart mixer or a drill-press) that you service regularly and can get 10 or more years of life out of. This seemed like the essence of the "i-Series, and earlier "System-360" machines. From hearnsj at googlemail.com Thu Jul 1 06:47:17 2010 From: hearnsj at googlemail.com (John Hearns) Date: Thu, 1 Jul 2010 14:47:17 +0100 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: On 1 July 2010 14:25, Nathan Moore wrote: >> (1) to a certain extent, intellectual/scientific prestige was very > important to the culture of the place. ?Promotions were/are based in > part on how many patents you generate (not dissimilar to a > publications count), but at least superficially, patents don't seem like > a major revenue stream. You should go to a Richard M Stallman talk on software patents sometime. (They're always on software patents). Software patents are used by companies as leverage in legal disputes between themselves - therefore a company which has many patents can 'defend' itself against others by threatening to counter-sue for infringment of their patents. The more patents you have the better. RMS's argument of course is that software patents (he deals with software - don't extrapolate to other types of patent) should be ended. From bcostescu at gmail.com Thu Jul 1 07:14:21 2010 From: bcostescu at gmail.com (Bogdan Costescu) Date: Thu, 1 Jul 2010 16:14:21 +0200 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2C076A.5010703@scalableinformatics.com> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com> <4C2B9E5F.2010602@ias.edu> <4C2C076A.5010703@scalableinformatics.com> Message-ID: On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman wrote: > At the end of the day, the fundamental question we are debating is, does > the "prestige" of working with a top university/national lab have any > real tangible value that you can ascribe to the bottom line, does it > actually impact sales. > > I posit that the answer to this is a resounding "no". ?You obviously > disagree. I also disagree, but I have another point of view: the fact of working with a top university/national lab can be important for the development of the product or line of products. A top university/national lab is considered top because it has clever people who are renowned for their way of thinking and/or published results; given a new (type of) parallel machine, they might come up with amazing results and/or might allow them to become even more famous - their publications will mention the (type of) parallel machine on which their results were obtained and other people looking to obtain similar results or looking for even better results (=competitors :-)) will become interested. This doesn't necessarily mean that they will buy the same (type of) parallel machines now but, if the results were amazing enough, the _next_ generation of parallel machines from this or other vendor will be able to achieve the same amazing results because, by then, buyers will ask for it. So it effectively becomes an investment in the future. Bogdan From bcostescu at gmail.com Thu Jul 1 07:14:21 2010 From: bcostescu at gmail.com (Bogdan Costescu) Date: Thu, 1 Jul 2010 16:14:21 +0200 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2C076A.5010703@scalableinformatics.com> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com> <4C2B9E5F.2010602@ias.edu> <4C2C076A.5010703@scalableinformatics.com> Message-ID: On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman wrote: > At the end of the day, the fundamental question we are debating is, does > the "prestige" of working with a top university/national lab have any > real tangible value that you can ascribe to the bottom line, does it > actually impact sales. > > I posit that the answer to this is a resounding "no". ?You obviously > disagree. I also disagree, but I have another point of view: the fact of working with a top university/national lab can be important for the development of the product or line of products. A top university/national lab is considered top because it has clever people who are renowned for their way of thinking and/or published results; given a new (type of) parallel machine, they might come up with amazing results and/or might allow them to become even more famous - their publications will mention the (type of) parallel machine on which their results were obtained and other people looking to obtain similar results or looking for even better results (=competitors :-)) will become interested. This doesn't necessarily mean that they will buy the same (type of) parallel machines now but, if the results were amazing enough, the _next_ generation of parallel machines from this or other vendor will be able to achieve the same amazing results because, by then, buyers will ask for it. So it effectively becomes an investment in the future. Bogdan From landman at scalableinformatics.com Thu Jul 1 07:51:11 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 01 Jul 2010 10:51:11 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com> <4C2B9E5F.2010602@ias.edu> <4C2C076A.5010703@scalableinformatics.com> Message-ID: <4C2CAB5F.208@scalableinformatics.com> Bogdan Costescu wrote: > On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman > wrote: >> At the end of the day, the fundamental question we are debating is, does >> the "prestige" of working with a top university/national lab have any >> real tangible value that you can ascribe to the bottom line, does it >> actually impact sales. >> >> I posit that the answer to this is a resounding "no". You obviously >> disagree. > > I also disagree, but I have another point of view: the fact of working > with a top university/national lab can be important for the > development of the product or line of products. A top This isn't the issue. The issue is, will a discount which amounts to you as a vendor paying your customer to taking your product, in order to garner prestige ... will this prestige translate to the bottom line in the near term ... will it positively impact sales. > university/national lab is considered top because it has clever people > who are renowned for their way of thinking and/or published results; This of course impugns the quality of people at "not-top" sites, places which may produce excellent science, but don't have the name brand of the "top" folks. My experience is that innovation happens where bright people are, whom are motivated to innovate. Excellent science happens many places which are not "top" sites. > given a new (type of) parallel machine, they might come up with > amazing results and/or might allow them to become even more famous - > their publications will mention the (type of) parallel machine on > which their results were obtained and other people looking to obtain > similar results or looking for even better results (=competitors :-)) "Might" "might allow" "will mention" Which one of these directly impacts against the bottom line? Which one of these actively increases sales and revenues? > will become interested. This doesn't necessarily mean that they will > buy the same (type of) parallel machines now but, if the results were Exactly. They won't necessarily buy. That does impact the bottom line. > amazing enough, the _next_ generation of parallel machines from this > or other vendor will be able to achieve the same amazing results > because, by then, buyers will ask for it. So it effectively becomes an > investment in the future. ... but this doesn't matter, unless you are quantifying this return on investment by classifying the discount as an investment. The danger in doing this, and there is a profound danger in doing this, is that *everyone* will want you as the vendor to "invest" in them. This does happen, and that is why the NDA is such an important tool for vendors to control the distribution of the deal details. Morever, these investments, as I have indicated, and again, I haven't seen a single indication in email (private or public) otherwise, simply don't have a meaningful ROI. They aren't accretive to the bottom line. Which means, if you start giving everyone a discount and couch this as an investment, all you have done is lowered your margins to close to zero or below. You have no real expectation of a return on this "investment". Again, I am not bashing anyone. I do think people need to think these arguments over very carefully before they present this stuff to their vendor of choice. Most I have spoken to over the last year have told me that any discount comes with a significant quid pro quo, something that will help offset other costs elsewhere, e.g. have a measurable impact upon the bottom line. Prestige adds nothing to be bottom line, it gives you talking points. It won't steer a detectable/measurable number of customers your way. Investment in product development happens generally independently of the sales efforts. That is a real cost to the bottom line. If you couch this as product investment, then you have a cost center. Which negatively impacts bottom line. > Bogdan Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From james.p.lux at jpl.nasa.gov Thu Jul 1 08:29:45 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 1 Jul 2010 08:29:45 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: Message-ID: On 7/1/10 6:47 AM, "John Hearns" wrote: > On 1 July 2010 14:25, Nathan Moore wrote: >>> (1) to a certain extent, intellectual/scientific prestige was very >> important to the culture of the place. ?Promotions were/are based in >> part on how many patents you generate (not dissimilar to a >> publications count), but at least superficially, patents don't seem like >> a major revenue stream. > > You should go to a Richard M Stallman talk on software patents > sometime. (They're always on software patents). > Software patents are used by companies as leverage in legal disputes > between themselves - therefore a company > which has many patents can 'defend' itself against others by > threatening to counter-sue for infringment of their patents. > The more patents you have the better. > > RMS's argument of course is that software patents (he deals with > software - don't extrapolate to other types of patent) > should be ended. > > RMS is an interesting guy. He does tend to stake out one extreme on the intellectual property rights spectrum. I've had many a stimulating(?) discussion with folks who advocate a form of Marxism for software: that is, all software should be freely available to all, and magic elves/the state/some entity will ensure that the writers of such software will have a roof over their heads and food to eat, because it's a sharing of a societal good thing. Sadly, even in the halls of academe such a situation doesn't really exist. RMS knows this and has a finely nuanced way to deal with it. The same cannot be said of his philosophical adherents. Such as it is, the IP world we live in is the one we have to work in. We can't rely on the Renaissance patronage approach, or the aristocratic gentleman of independent means approach for support. Government support is sort of the new "patronage", but it is a fickle master (although perhaps, now that I think about it, not any more fickle than Prince Ludivico of Milan, etc.). For the rest of us, the enormous number of technology and science developers in the private world, the value of IP ("goodwill" on the books) is that with which a profit making entity justifies the original investment, and with which they pay your salary, which lets you put that roof over your head etc. With respect to software patents, IBM gets bunches o' patents on hardware too, and has always done so. The modern trend to building a portfolio as a strategic weapon against other portfolios (e.g. So you have something to cross license with) is somewhat worrying, because it tends to favor the big boys. That is, if you have 100 patents in your quiver and someone knocks a few out, you've still got 90+ to beat them over the head with. If you have 1 patent.... It's an interesting topic... From james.p.lux at jpl.nasa.gov Thu Jul 1 08:46:01 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 1 Jul 2010 08:46:01 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: Message-ID: On 7/1/10 7:14 AM, "Bogdan Costescu" wrote: > On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman > wrote: >> At the end of the day, the fundamental question we are debating is, does >> the "prestige" of working with a top university/national lab have any >> real tangible value that you can ascribe to the bottom line, does it >> actually impact sales. >> >> I posit that the answer to this is a resounding "no". ?You obviously >> disagree. > > I also disagree, but I have another point of view: the fact of working > with a top university/national lab can be important for the > development of the product or line of products. This manifests itself in other ways, too. For instance, during the .com bubble, there were joking comments about being paid in "space dollars", typically in reference to the disparity between wages paid to develop research satellite equipment and in the wireless industry. There are people at JPL who maintain that we shouldn't be worrying about paying competitive salaries, because if you don't want to work on "space stuff" as a personal goal, regardless of compensation, you shouldn't be working there. I view this as unrealistic and comparable memes like the "starving in a garrett leading to true art" or the "if you truly cared, you'd do it for free and live in tent with the other homeless people along the arroyo". While the latter might have been acceptable to me in my 20s, now that I've passed the half century mark, I'm a bit more inured to creature comforts and, more to the point, so are my wife and children. I have talked with senior managers at technology companies about why they would contemplate being a vendor for JPL vs the commercial world (JPL tends to be in the category of a "high maintenance" customer who asks a lot of questions, wants to peer into every aspect of your processes, and asks a lot of their vendors, and because we're the government, you're not going to make huge profits).. Their response is often that it allows them to attract top talent, who can then benefit the company in other ways. In the case of NASA work, too, it's public, unlike other high tech work for the defense type markets, which is often classified. If you're recruiting smart people out of school, telling them that they can work on a radio for a Mars probe is much sexier than telling them that they will be the third assistant door latch controller developer on the automotive products team. So, given that someone wants to hire top people, and there are lots of potential employers for top people, you need something to distinguish yourself beyond the hygiene issue of salary. (A hygiene issue is one that has a sort of threshold effect.. Nobody cares much about the details of having it, but not having it is a big negative. ) Working on "the worlds fastest computer" is one of those things that gets you in the door. And, as a "top person" you might find that, though skilled, that's not your thing, and that you really have a talent for something else that IBM does, so IBM benefits, in many ways. From jlforrest at berkeley.edu Thu Jul 1 09:03:29 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Thu, 01 Jul 2010 09:03:29 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: <4C2CBC51.6090704@berkeley.edu> Another reason why some vendors are willing to sell stuff at reduced prices to universities is for visibility. The thinking is that when grad students (finally) graduate and go off into industry, they'll want to buy the same stuff they used when they were students. I'm not sure how valid this approach is, but I don't argue with it. As the #1 public university in many (most?) scientific fields, UC Berkeley gets approached with all kinds of deals. During the boom times we sometimes had to turn down such deals because we simply didn't have the space and/or the people to allow us to accept the equipment. Things are different now, but space and people are still more expensive than most equipment. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From landman at scalableinformatics.com Thu Jul 1 09:24:17 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 01 Jul 2010 12:24:17 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2CBC51.6090704@berkeley.edu> References: <4C2CBC51.6090704@berkeley.edu> Message-ID: <4C2CC131.4010403@scalableinformatics.com> Jon Forrest wrote: > Another reason why some vendors are willing to > sell stuff at reduced prices to universities is > for visibility. The thinking is that when grad > students (finally) graduate and go off into > industry, they'll want to buy the same stuff > they used when they were students. I'm not > sure how valid this approach is, but I don't > argue with it. Heh ... I deleted that section of one of my responses. There isn't a lot of data to support the notion that they will go out into the world and buy the same stuff. They will often go for the best deals. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From james.p.lux at jpl.nasa.gov Thu Jul 1 09:31:51 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 1 Jul 2010 09:31:51 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2CAB5F.208@scalableinformatics.com> Message-ID: On 7/1/10 7:51 AM, "Joe Landman" wrote: > Bogdan Costescu wrote: >> On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman >> wrote: >>> At the end of the day, the fundamental question we are debating is, does >>> the "prestige" of working with a top university/national lab have any >>> real tangible value that you can ascribe to the bottom line, does it >>> actually impact sales. >>> >>> I posit that the answer to this is a resounding "no". You obviously >>> disagree. >> >> I also disagree, but I have another point of view: the fact of working >> with a top university/national lab can be important for the >> development of the product or line of products. A top > > This isn't the issue. The issue is, will a discount which amounts to > you as a vendor paying your customer to taking your product, in order to > garner prestige ... will this prestige translate to the bottom line in > the near term ... will it positively impact sales. I think it depends on the organization deriving the putative benefit from prestige. For some organizations, there is none.. It doesn't directly translate to increased profits in the near term. For others, it might be less tangible, but as real: attracting better job candidates increases the value of "workforce capital". As you've ably described, there's a huge tension at the top of most corporations between things that can be accurately valued in cash terms and those that are intangible. And, while lots of companies may say "our most valuable asset walks out the door each evening" they may not actually act that way in real life, particularly if their board feels that they have to responsible to "shareholder value" concerns. It would be interesting to compare whether more technology companies have gone broke because they over valued "prestige" or under valued it. As Joe has pointed out, more than one company has gotten into trouble for the "discounts for sexy customers" policy. On the other hand, if all your people walk out the door, permanently, because it's no fun working on the latest mid range commodity server, you also die, just slower. This is the classic "let's do away with R&D, because they're just a cost center" problem ( Chainsaw Al and his M&A ilk). This is much like when I worked in the entertainment industry. If you have some unique skill or capability, everyone comes to you with offers to work for free/minimal pay, "because it's a great opportunity, and you'll get to work with X, and the next job will pay better", and because without you, their grand idea will never work. (They do this with people who have commodity skills, too, but those have very depressed prices: e.g. Actors basically work for free and hope to get lucky. ) You pretty quickly learn to be pretty hard nosed about this: "Is this an investment opportunity or a work opportunity, and if it's the former, what's my fraction of the gross revenue" (Everybody knows you never, never, never take "net points") Sometimes, though, the "you'll work with X" *is* worth taking the job, either because it's something you'd literally pay for if offered the chance (hey, people bid in charity auctions for a dinner with Y or Z, it's not that different), or because the potential downstream reward has a high enough expected value (Probability(event)*Revenue(event) > reduced income from this one job). It's when P(event) is down in the 0.01 range and the R(event) is in the <1 years income category you say, "gosh that sounds nice, but I'm sort of busy right now, and can I refer you to someone else) The actor model is slightly different.. P(event) where event is "getting a real paying gig" is very small, but the R(event) is fairly high (that gig gets you into the union, for life, which helps improve the P(future event)) AND most important, the "foregone revenue" is basically zero. Actors KNOW that very few make any money, so they get jobs to feed and house themselves that are flexibly scheduled (wait staff, casual day labor, construction) so they don't lose revenue by taking the opportunity presented. I can't see anybody in the HPC business taking the 'actor' approach as a business plan, though. From john.hearns at mclaren.com Thu Jul 1 09:38:38 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 1 Jul 2010 17:38:38 +0100 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B10F75B4E@milexchmb1.mil.tagmclarengroup.com> > > RMS is an interesting guy. He does tend to stake out one extreme on > the > intellectual property rights spectrum. I've had many a stimulating(?) > discussion with folks who advocate a form of Marxism for software: that > is, > all software should be freely available to all, and magic elves/the > state/some entity will ensure that the writers of such software will > have a > roof over their heads and food to eat, because it's a sharing of a > societal > good thing. Sadly, even in the halls of academe such a situation > doesn't > really exist. RMS knows this and has a finely nuanced way to deal with > it. Jim, RMS makes a distinction between patents and copyright. Remember that the GNU Copyleft is a copyright, used to defend free software. There is nothing wrong with copyrighting your work, and being paid for it, and indeed making money from it. It is the patents which lead to the absurdities of companies scrambling for patents for commonsense techniques - in order to build up that portfolio to brandish at other companies. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From gus at ldeo.columbia.edu Thu Jul 1 09:41:38 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 01 Jul 2010 12:41:38 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2CBC51.6090704@berkeley.edu> References: <4C2CBC51.6090704@berkeley.edu> Message-ID: <4C2CC542.9060007@ldeo.columbia.edu> Hi Jon, Rahul, list Jon Forrest wrote: > Another reason why some vendors are willing to > sell stuff at reduced prices to universities is > for visibility. The thinking is that when grad > students (finally) graduate and go off into > industry, they'll want to buy the same stuff > they used when they were students. I'm not > sure how valid this approach is, but I don't > argue with it. > I agree this may not work with high end HPC. However, this is to some extent what Microsoft does very effectively with kids from pre-kindergarten to graduate school worldwide, including part of their charity donations to schools, etc. It creates a culture, a habit, a dependency, in what is a much bigger market, but yet a market with much less choices than HPC - surprisingly or not. (This is despite the inroads that Ubuntu and others may have created lately.) Hey Rahul: Your original question went a long way, didn't it? :) Somehow you always ask questions that trigger these interesting debates. Cheers, Gus Correa > As the #1 public university in many (most?) > scientific fields, UC Berkeley gets approached > with all kinds of deals. During the boom times > we sometimes had to turn down such deals because > we simply didn't have the space and/or the people > to allow us to accept the equipment. > > Things are different now, but space and people > are still more expensive than most equipment. > > Cordially, From landman at scalableinformatics.com Thu Jul 1 10:00:30 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 01 Jul 2010 13:00:30 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> Message-ID: <4C2CC9AE.7030202@scalableinformatics.com> Greg Rubino wrote: > I have to say I partially agree with Prentice. I don't know if > prestige directly translates into revenue, but if your a huge company Thats the thesis that I am saying I do not believe to be the case, and Prentis is (as far as I understand it) indicating that he believes this to be the case. If a discount doesn't translate into revenue (e.g. no return on "investment"), then what does it translate into? The accountants will tell you. > and your platform is the first one upon which some new innovation in > HPC is implemented (cutthroat or not), you have a huge opportunity on > your hands. I guess it depends upon the terms under which you took > that initial "loss" (s/loss/risk/g). Thats marketing. Few companies list marketing as a profit center. It is an expense. It reduces the bottom line. No one is denying the marketing cache' of a nice PR win. However ... marketing cache' doesn't often turn into cash (or, more accurately, profitable revenue). Which is the basis of what I am arguing. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From jlforrest at berkeley.edu Thu Jul 1 10:10:31 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Thu, 01 Jul 2010 10:10:31 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: <4C2CCC07.1020004@berkeley.edu> On 7/1/2010 9:47 AM, Lux, Jim (337C) wrote: > Giving it away for free to educational institutions worked for Unix, eh? (at > least in the long run) Maybe so, for some definition of "worked". I don't know how it was at other places, but at Berkeley, especially in the Computer Science Dept. where I worked, the presence of certain brands of equipment was like age rings in trees. By this I mean that 1991-1993 were the DEC years, 1994-1996 were the HP years, 1997-2000 were the Intel years (I'm making up these years and vendors). During these periods the vendors made their equipment available to us at extremely good prices. Then, something would happen that caused the vendors to loose interest, and another vendor would gain interest. In our case, I think part of the reason why this happened is because the vendors wanted access to the professors and grad students involved in the research projects seemed like they would have promising commercial value, such as RAID, RISC, Postgres, and NOW. -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From gus at ldeo.columbia.edu Thu Jul 1 10:29:07 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 01 Jul 2010 13:29:07 -0400 Subject: [Beowulf] guide for pbs/torque and mpi In-Reply-To: References: Message-ID: <4C2CD063.9020606@ldeo.columbia.edu> Hi Akshar akshar bhosale wrote: > hi, > we want to have a good reference guide for torque(pbs),maui and mpi > > akshar Torque and Maui guides are available from ClusterResources: http://www.clusterresources.com/pages/products/torque-resource-manager.php http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php LLNL has good MPI (and other) tutorials: https://computing.llnl.gov/?set=training&page=index https://computing.llnl.gov/tutorials/mpi/ I hope it helps, Gus Correa > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From james.p.lux at jpl.nasa.gov Thu Jul 1 09:47:53 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 1 Jul 2010 09:47:53 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2CBC51.6090704@berkeley.edu> Message-ID: On 7/1/10 9:03 AM, "Jon Forrest" wrote: > Another reason why some vendors are willing to > sell stuff at reduced prices to universities is > for visibility. The thinking is that when grad > students (finally) graduate and go off into > industry, they'll want to buy the same stuff > they used when they were students. I'm not > sure how valid this approach is, but I don't > argue with it. > Giving it away for free to educational institutions worked for Unix, eh? (at least in the long run) And IBM used to give very attractive lease rates to universities (in the good old days, you couldn't BUY an IBM mainframe, only lease them) An interesting question is whether this strategy would be viable today, given the more short term return orientation of the capital markets. From rpnabar at gmail.com Thu Jul 1 12:39:07 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 1 Jul 2010 14:39:07 -0500 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2CC542.9060007@ldeo.columbia.edu> References: <4C2CBC51.6090704@berkeley.edu> <4C2CC542.9060007@ldeo.columbia.edu> Message-ID: On Thu, Jul 1, 2010 at 11:41 AM, Gus Correa wrote: > Hey Rahul: > > Your original question went a long way, didn't it? ?:) > Somehow you always ask questions that trigger these > interesting debates. Yes it did. I am not sure if that is good or bad though. :) I hope it doesn't make me look like a "troll". I just had a genuine curiosity to know more about the economic side of HPC, economies of scale etc.; this is the stuff often in the shadows and I don't see much coverage in the typical material I read. Best, -- Rahul From rpnabar at gmail.com Thu Jul 1 12:46:34 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 1 Jul 2010 14:46:34 -0500 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2CC131.4010403@scalableinformatics.com> References: <4C2CBC51.6090704@berkeley.edu> <4C2CC131.4010403@scalableinformatics.com> Message-ID: On Thu, Jul 1, 2010 at 11:24 AM, Joe Landman wrote: > Heh ... I deleted that section of one of my responses. ?There isn't a lot of > data to support the notion that they will go out into the world and buy the > same stuff. ?They will often go for the best deals. > People will only have a limited number of vendors on their list when they get quotes and compare specs. etc. And the sort of prestige branding does make you more likely to get on the list of vendors people will consider for a particular project. e.g I imagine there are at least 20 different HPC vendors (if not more) that would sell the kind of equipment that went into our latest cluster upgrade. How many did I actually talk to? Maybe 7. (Maybe others are more diligent than I am) And this is the sort of subjective decision that will get made on the basis of reputation and familiarity. Word-of-mouth counts a lot too. If there is a Vendor-X that nobody on Campus has ever worked with there is a far more uphill battle to convince the decision makers to give them a chance. The reasons have to be pretty compelling. -- Rahul From gus at ldeo.columbia.edu Thu Jul 1 14:51:32 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 01 Jul 2010 17:51:32 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: <4C2CBC51.6090704@berkeley.edu> <4C2CC542.9060007@ldeo.columbia.edu> Message-ID: <4C2D0DE4.1070908@ldeo.columbia.edu> Hi Rahul, list So far we have three data points, if I counted right: 1) Yours: US$35k/TFlop, 100-node Beowulf cluster. 2) Dmitry Chubarov's US$158k/TFlop, SKIF/Cyberia cluster, Feb/2007, Russia 3) Mark Hahn's $CAD 30k/TFlop (approx US$28/Tflop), in Canada, a Beowulf cluster, I presume, size not specified. Not sure if all these numbers correspond to nominal Tflops (Rmax), or, to actual HPL benchmark (Rpeak), or to some estimate, say,(85% Rpeak/Rmax HPL)*(nominal Tflops), or to some other performance metric. I can add one point to the data table. 19 months ago we paid about US$37k/Tflop (nominal), or US$43k/Tflop (actual HPL). (Small Beowulf cluster w/ IB, IPMI, GigE, not counting UPS, storage, and head node.) I guess if you buy the latest greatest fastest processors they add a quite a bit on the $ side. The prices seem to be significantly higher if you buy outside the USA, Canada, maybe the EU also, as Dmitry's number suggests. Cray recently sold a 244Tflop XT6 to Brazil for weather forecast for US$20M (US$82k/Tflop). A bigger Petaflop version was sold to Los Alamos Natl. Lab for US$45M (US$45k/Tflop): http://www.theregister.co.uk/2010/04/29/brazil_buys_cray_xt6/ http://insidehpc.com/2010/04/21/cray-wins-big-deal-in-brazil/ http://www.hpcwire.com/offthewire/Cray-Wins-20M-Contract-with-Brazils-National-Institute-for-Space-Research-91710269.html http://www.networkworld.com/news/2010/040210-cray-supercomputer.html http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=1415533&highlight= http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=1409130&highlight Interesting that the $/Tflop ratio seems to (still) be significantly lower in the Beowulfs, even though commodity processors, RAM, etc, are used more and more in the brand name machines. Mark got the best deal. Maybe we should go buy in Canada, as people do with medicine. :) Cheers, Gus Correa Rahul Nabar wrote: > On Thu, Jul 1, 2010 at 11:41 AM, Gus Correa wrote: >> Hey Rahul: >> >> Your original question went a long way, didn't it? :) >> Somehow you always ask questions that trigger these >> interesting debates. > > Yes it did. I am not sure if that is good or bad though. :) I hope it > doesn't make me look like a "troll". I just had a genuine curiosity to > know more about the economic side of HPC, economies of scale etc.; > this is the stuff often in the shadows and I don't see much coverage > in the typical material I read. > > Best, > From rpnabar at gmail.com Thu Jul 1 14:58:30 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 1 Jul 2010 16:58:30 -0500 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2D0DE4.1070908@ldeo.columbia.edu> References: <4C2CBC51.6090704@berkeley.edu> <4C2CC542.9060007@ldeo.columbia.edu> <4C2D0DE4.1070908@ldeo.columbia.edu> Message-ID: On Thu, Jul 1, 2010 at 4:51 PM, Gus Correa wrote: > Hi Rahul, list > > Not sure if all these numbers correspond to nominal Tflops (Rmax), > or, to actual HPL benchmark (Rpeak), > or to some estimate, say,(85% Rpeak/Rmax HPL)*(nominal Tflops), > or to some other performance metric. Mine were nominal Tflops. One more data point for Gus' list: $27,000 per Teraflop (nominal) if I read this article about Crays currently. US deployment and seems to be sort of an averaged out number over several Cray systems circa Feb 2010 http://www.theregister.co.uk/2010/02/24/cray_dod_deals/ -- Rahul From mathog at caltech.edu Thu Jul 1 15:54:03 2010 From: mathog at caltech.edu (David Mathog) Date: Thu, 01 Jul 2010 15:54:03 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? Message-ID: This thread brings to mind the punch line from the second of the SNL "Citiwide Change Bank" ads. Here is a link to the transcript: http://snltranscripts.jt.org/88/88achangebank2.phtml Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From samuel at unimelb.edu.au Thu Jul 1 21:02:06 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 02 Jul 2010 14:02:06 +1000 Subject: [Beowulf] Multiple FlexLM lmgrd services on a single Linux machine? In-Reply-To: References: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/07/10 14:51, Sangamesh B wrote: > We're in a process of implementing a > centralized FlexLM license server for multiple > commercial applications. Can some one tell us, > whether Linux OS support multiple lmgrd services > or not? Yes you can, when I was at VPAC we were doing it (and they still are) and now I'm here we're doing it for our cluster (though only for 2 at present). What we do is have a single FlexLM install and then a directory per application under /opt/licenses/. In that directory we keep the license.dat file along with the vendor daemon for the application and the (optional) license.opt file. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkwtZL4ACgkQO2KABBYQAh+cawCeO5iH3wJaJMHS8iTOwrCVRXt3 Uq8An0DQvevclbyfcETh9WPHNjfQaxdd =qayK -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at mclaren.com Fri Jul 2 01:59:05 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 2 Jul 2010 09:59:05 +0100 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B10F75E1F@milexchmb1.mil.tagmclarengroup.com> > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of David Mathog > Sent: 01 July 2010 23:54 > To: beowulf at beowulf.org > Subject: Re: [Beowulf] dollars-per-teraflop : any lists like the > Top500? > > This thread brings to mind the punch line from the second of the SNL > "Citiwide Change Bank" ads. Here is a link to the transcript: > > http://snltranscripts.jt.org/88/88achangebank2.phtml Hmmmm. I'd just returned from a business trip to London, and all the cash I had was a five-pound note. Citiwide wasn't able to convert it to dollars, but they did give me four guineas, two crowns, four shillings, and ten pence. http://home.clara.net/brianp/money.html One guinea = 21 shillings = 105p in decimal money Crown = five shillings 25p in decimal money One shilling = 5p in decimal money One pence old money = 0.417p decimal I make that 479.17pence decimal in exchange for five pounds (500p). Yup, that's how Citibank make their money The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From Bill.Rankin at sas.com Fri Jul 2 07:37:23 2010 From: Bill.Rankin at sas.com (Bill Rankin) Date: Fri, 2 Jul 2010 14:37:23 +0000 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B10F75E1F@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B10F75E1F@milexchmb1.mil.tagmclarengroup.com> Message-ID: <76097BB0C025054786EFAB631C4A2E3C093322A0@MERCMBX04R.na.SAS.com> David: > > This thread brings to mind the punch line from the second of the SNL > > "Citiwide Change Bank" ads. Here is a link to the transcript: > > > > http://snltranscripts.jt.org/88/88achangebank2.phtml Thank you, now my keyboard is covered in coffee. ;-) > Hmmmm. > > I'd just returned from a business trip to London, and all the cash I > had was a five-pound note. > Citiwide wasn't able to convert it to dollars, but they did give me > four guineas, two crowns, four shillings, and ten pence. [...] > I make that 479.17pence decimal in exchange for five pounds (500p). > Yup, that's how Citibank make their money Well John, that's American math for you. I blame our educational system. (You would have to go add that up, wouldn't you ;-). Going completely off topic, most US banks will not convert anything other than paper currency and I have a baggie full of Euro and Pound coinage to show for it. So the fact that "Citiwide" converted the note to coins would be another win for them. Have a good holiday weekend, for those of us on this side of the pond(*). To everyone else, have a nice normal weekend. -b (*) and north of Mexico and south of Canada, unless of course you are in Alaska or Hawaii. From scrusan at UR.Rochester.edu Thu Jul 1 09:53:51 2010 From: scrusan at UR.Rochester.edu (Steve Crusan) Date: Thu, 01 Jul 2010 12:53:51 -0400 Subject: [Beowulf] guide for pbs/torque and mpi In-Reply-To: Message-ID: For Toreuq/Maui, go here: http://www.clusterresources.com/products.php There should be manuals for each product. On 6/30/10 3:32 PM, "akshar bhosale" wrote: > hi, > ?we want to have a good reference guide for torque(pbs),maui and mpi > > akshar > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester (585) 276-5599 https://www.crc.rochester.edu/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Nico.Mittenzwey at informatik.tu-chemnitz.de Fri Jul 2 04:11:03 2010 From: Nico.Mittenzwey at informatik.tu-chemnitz.de (Nico Mittenzwey) Date: Fri, 02 Jul 2010 13:11:03 +0200 Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain user Message-ID: <4C2DC947.8000903@informatik.tu-chemnitz.de> Hi all, we have a user with jobs that create heavy I/O load on our network file system every x minutes. If I block her jobs and start them manually with a delay of about 10 minutes, she can run about 60 jobs simultaneously. However, if I don't do that and the jobs are started at more or less the same time (for example after a big job from another user finished), she can run about 30 jobs without stressing our file system so hard that "ls" takes 30 seconds to display any home directory. So I would like to tell PBS/Maui to wait x seconds before starting another job of that particular user. Do you know of any means to accomplish that (even if I have to change the source)? Since I may need this for some other users too I would prefer using PBS/Maui directly rather then blocking all jobs of this users and starting the jobs using a script. Cheers, Nico From Bill.Rankin at sas.com Fri Jul 2 08:06:38 2010 From: Bill.Rankin at sas.com (Bill Rankin) Date: Fri, 2 Jul 2010 15:06:38 +0000 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: <4C2CBC51.6090704@berkeley.edu> Message-ID: <76097BB0C025054786EFAB631C4A2E3C093322CD@MERCMBX04R.na.SAS.com> > > Another reason why some vendors are willing to > > sell stuff at reduced prices to universities is > > for visibility. The thinking is that when grad > > students (finally) graduate and go off into > > industry, they'll want to buy the same stuff > > they used when they were students. I'm not > > sure how valid this approach is, but I don't > > argue with it. > > > > Giving it away for free to educational institutions worked for Unix, > eh? (at least in the long run) Well, SW != HW. Giving away the former is a matter of opportunity cost, whereas you have capital tied up in the latter. But there are also (as I understand it) significant tax write-offs in the discounts given to universities. Sun used to have heavy subsidies for their hardware sold to educational institutions, as did Dell and IBM. I suspect that additional money on the balance sheet made the "visibility" argument a lot more palatable. Microsoft once made a huge software "donation" to Duke Univ. Many copies of Windows, Office, Visio, et al., values at millions of $. But instead of shipping a few copies of the distribution media and a big list of individual software keys (which would have made storage and installation easier), they received pallets upon pallets containing individual boxed sets of all the software. I expect that this was required in order for them to write-off the donation in their accounting books. > And IBM used to give very attractive lease rates to universities (in > the good old days, you couldn't BUY an IBM mainframe, only lease them) Yup. They used to do the same thing on their SP-X systems (do they still do this with the BlueGene's?) I remember when the North Carolina Supercomputing Center defaulted after the third year of their five year SP3 lease due to state budget cuts. True to form, IBM sent the trucks in and repossessed a 3-year old "supercomputer". I think that could have made for great Reality TV. ;-) > An interesting question is whether this strategy would be viable today, > given the more short term return orientation of the capital markets. As well as the fact that (at least for the HPC arena) the useful lifespan of a system is a lot less than that most mainframes. The technology front is just moving too fast. -bill From john.hearns at mclaren.com Fri Jul 2 08:30:29 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 2 Jul 2010 16:30:29 +0100 Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain user In-Reply-To: <4C2DC947.8000903@informatik.tu-chemnitz.de> References: <4C2DC947.8000903@informatik.tu-chemnitz.de> Message-ID: <68A57CCFD4005646957BD2D18E60667B10FDC232@milexchmb1.mil.tagmclarengroup.com> > Hi all, > we have a user with jobs that create heavy I/O load on our network file > system every x minutes. If I block her jobs and start them manually > with > a delay of about 10 minutes, she can run about 60 jobs simultaneously. > However, if I don't do that and the jobs are started at more or less > the > same time (for example after a big job from another user finished), she > can run about 30 jobs without stressing our file system so hard that > "ls" takes 30 seconds to display any home directory. Nico, I have the perfect tool for this job: http://www.bofhcam.org/co-larters/lart-reference/ I did have a quick look in the PBSpro admin manual - I can't see anything at first glance which limits number of jobs per scheduler cycle. I could very well be wrong. You can increase the intervals between scheduler runs - but that will not do exactly what you want. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From james.p.lux at jpl.nasa.gov Fri Jul 2 08:47:18 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 2 Jul 2010 08:47:18 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <76097BB0C025054786EFAB631C4A2E3C093322A0@MERCMBX04R.na.SAS.com> Message-ID: On 7/2/10 7:37 AM, "Bill Rankin" wrote: > > > Have a good holiday weekend, for those of us on this side of the pond(*). To > everyone else, have a nice normal weekend. > > -b > > (*) and north of Mexico and south of Canada, unless of course you are in > Alaska or Hawaii. > Yesterday was Canada Day/F?te du Canada... (and, according to NPR, it was also the national day in Somalia, and given that in Muslim countries the weekend is often Thursday/Friday, it was probably a holiday weekend for them too) From eugen at leitl.org Sun Jul 4 09:05:01 2010 From: eugen at leitl.org (Eugen Leitl) Date: Sun, 4 Jul 2010 18:05:01 +0200 Subject: [Beowulf] 6 TFlops, 450 MFlops/W watercooled IBM @ ETH Message-ID: <20100704160501.GQ31956@leitl.org> http://www.physorg.com/news197295578.html IBM Hot Water-Cooled Supercomputer Goes Live at ETH Zurich July 2, 2010 (PhysOrg.com) -- IBM has delivered a first-of-a-kind hot water-cooled supercomputer to the Swiss Federal Institute of Technology Zurich (ETH Zurich), marking a new era in energy-aware computing. The innovative system, dubbed Aquasar, consumes up to 40 percent less energy than a comparable air-cooled machine. Through the direct use of waste heat to provide warmth to university buildings, Aquasar's carbon footprint is reduced by up to 85 percent. Building energy efficient computing systems and data centers is a staggering undertaking. In fact, up to 50 percent of an average air-cooled data center's energy consumption and carbon footprint today is not caused by computing but by powering the necessary cooling systems to keep the processors from overheating - a situation that is far from optimal when looking at energy efficiency from a holistic perspective. The development of Aquasar began one year ago as part of IBM's First-Of-A-Kind (FOAK) program, which engages IBM scientists with clients to explore and pilot emerging technologies that address business problems. The supercomputer consists of special water-cooled IBM BladeCenter Servers, which were designed and manufactured by IBM scientists in Zurich and Boblingen, Germany. For direct comparison with traditional systems, Aquasar also holds additional air-cooled IBM BladeCenter servers. In total, the system achieves a performance of six Teraflops and has an energy efficiency of about 450 megaflops per watt. In addition, nine kilowatts of thermal power are fed into the ETH Zurich's building heating system. With its innovative water-cooling system and direct utilization of waste heat, Aquasar is now fully-operational at the Department of Mechanical and Process Engineering at ETH Zurich. "With Aquasar, we make an important contribution to the development of sustainable high performance computers and computer system. In the future it will be important to measure how efficiently a computer is per watt and per gram of equivalent CO2 production," said Prof. Dimos Poulikakos, head of the Laboratory of Thermodynamics in New Technologies, ETH Zurich. Innovative water-cooling system The processors and numerous other components in the new high performance computer are cooled with up to 60 degrees C warm water. This is made possible by an innovative cooling system that comprises micro-channel liquid coolers which are attached directly to the processors, where most heat is generated. With this chip-level cooling the thermal resistance between the processor and the water is reduced to the extent that even cooling water temperatures of up to 60 degrees C ensure that the operating temperatures of the processors remain well below the maximally allowed 85 degrees C. The high input temperature of the coolant results in an even higher-grade heat at the output, which in this case is up to 65 degrees C. Overall, water removes heat 4,000 times more efficiently than air. "With Aquasar we achieved an important milestone on the way to CO2-neutral data centers," said Dr. Bruno Michel, manager of Advanced Thermal Packaging at IBM Research - Zurich. "The next step in our research is to focus on the performance and characteristics of the cooling system which will be measured with an extensive system of sensors, in order to optimize it further." Aquasar is part of a three-year collaborative research program called "Direct use of waste heat from liquid-cooled supercomputers: the path to energy saving, emission-high performance computers and data centers." In addition to ETH Zurich and IBM Research - Zurich, the project also involves ETH Lausanne. It is supported by the Swiss Centre of Competence of support for Energy and Mobility (CCEM). Source: IBM From cbergstrom at pathscale.com Sun Jul 4 09:30:57 2010 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Sun, 04 Jul 2010 23:30:57 +0700 Subject: [Beowulf] 6 TFlops, 450 MFlops/W watercooled IBM @ ETH In-Reply-To: <20100704160501.GQ31956@leitl.org> References: <20100704160501.GQ31956@leitl.org> Message-ID: <4C30B741.2070807@pathscale.com> Eugen Leitl wrote: > http://www.physorg.com/news197295578.html > > IBM Hot Water-Cooled Supercomputer Goes Live at ETH Zurich > > July 2, 2010 > > (PhysOrg.com) -- IBM has delivered a first-of-a-kind hot water-cooled > supercomputer to the Swiss Federal Institute of Technology Zurich (ETH > Zurich), marking a new era in energy-aware computing. The innovative system, > dubbed Aquasar, consumes up to 40 percent less energy than a comparable > air-cooled machine. Through the direct use of waste heat to provide warmth to > university buildings Others have already made the joke that Fermi could double as a space heater, but I wonder if that could end up being reality.. It's also not really breaking news that a water cooled system is more efficient than air, but how real world tested is this? From the pictures on the youtube video [1] I wonder how this could be adapted to the current trend moving away from cell and towards gpus.. (The conduit looked pretty well mounted to processors which would be much harder to do with a vertical PCIe card... Not to mention leaks..) [1] http://www.youtube.com/watch?v=FbGyAXsLzIc From samuel at unimelb.edu.au Sun Jul 4 17:11:56 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 05 Jul 2010 10:11:56 +1000 Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain user In-Reply-To: <4C2DC947.8000903@informatik.tu-chemnitz.de> References: <4C2DC947.8000903@informatik.tu-chemnitz.de> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/07/10 21:11, Nico Mittenzwey wrote: > So I would like to tell PBS/Maui to wait x seconds before > starting another job of that particular user. Do you know > of any means to accomplish that (even if I have to change > the source)? Why not just limit the number of jobs they can run by using MAXJOB to a level that your file server can cope with ? Is your fileserver a RHEL box using ext3 by some chance ? cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkwxI0wACgkQO2KABBYQAh9F2ACfchBAZ+qGV1qwKl9vtRtRBTp6 ARoAnjTs5SqcCnz8iNCAfrM56H2H/liq =XPBD -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Tue Jul 6 08:12:17 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 6 Jul 2010 11:12:17 -0400 (EDT) Subject: [Beowulf] HPL efficiency on Magny-Cours and Westmere? Message-ID: Hi all, can anyone tell me what kind of efficiency you're seeing on Magny-Cours and Westmere systems? by efficiency, I mean actual HPL performance as a fraction of cores * clock * 4 flops/cycle. I realize some of this can be drived from top500 results, but I'd also be be interested in single- socket and single-node scores for comparison. thanks, mark hahn. From joshua_mora at usa.net Tue Jul 6 08:57:48 2010 From: joshua_mora at usa.net (Joshua mora acosta) Date: Tue, 06 Jul 2010 10:57:48 -0500 Subject: [Beowulf] HPL efficiency on Magny-Cours and Westmere? Message-ID: <495ogFP5W5952S02.1278431868@web02.cms.usa.net> MC 12core at 2.2GHz: 91% on die, 86.7% on node 2 socket , above 82% on cluster. Joshua ------ Original Message ------ Received: 10:34 AM CDT, 07/06/2010 From: Mark Hahn To: Beowulf Mailing List Subject: [Beowulf] HPL efficiency on Magny-Cours and Westmere? > Hi all, > can anyone tell me what kind of efficiency you're seeing on Magny-Cours > and Westmere systems? by efficiency, I mean actual HPL performance as a > fraction of cores * clock * 4 flops/cycle. I realize some of this can > be drived from top500 results, but I'd also be be interested in single- > socket and single-node scores for comparison. > > thanks, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Tue Jul 6 10:32:23 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 06 Jul 2010 13:32:23 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2CC9AE.7030202@scalableinformatics.com> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2CC9AE.7030202@scalableinformatics.com> Message-ID: <4C3368A7.2090506@ias.edu> Joe Landman wrote: > Greg Rubino wrote: >> I have to say I partially agree with Prentice. I don't know if >> prestige directly translates into revenue, but if your a huge company > > Thats the thesis that I am saying I do not believe to be the case, and > Prentis is (as far as I understand it) indicating that he believes this > to be the case. > My original point has been quite perverted by this thread, that I'm abstaining from further comments. Except this one. -- Prentice From prentice at ias.edu Tue Jul 6 11:35:58 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 06 Jul 2010 14:35:58 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C336DDB.6050600@scalableinformatics.com> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2CC9AE.7030202@scalableinformatics.com> <4C3368A7.2090506@ias.edu> <4C336DDB.6050600@scalableinformatics.com> Message-ID: <4C33778E.3010303@ias.edu> >> Joe Landman wrote: >>> Greg Rubino wrote: >>>> I have to say I partially agree with Prentice. I don't know if >>>> prestige directly translates into revenue, but if your a huge company >>> >>> Thats the thesis that I am saying I do not believe to be the case, and >>> Prentis is (as far as I understand it) indicating that he believes this >>> to be the case. >>> I think my original point was misconstrued, and may have been completely forgotten in the subsequent conversation. Here's another attempt at conveying my original point: Using the big systems at the top of the Top500 list to get $/FLOP wouldn't be a useful exercise, because these systems are usually sold under NDA's and (probably) at a loss to the vendor. I posited that the vendors sell these systems Top500-winning systems at a loss (Roadrunner and Jaguar, in particulat) in exchange for other "intangibles": 1. Gain knowledge through the R&D that goes into building these systems. 2. Collaborating with the computer science geniuses at the customer's site (like the computer geniuses at LANL), which could lead to knowledge transfer. 3.Bragging rights (which I referred to as "prestige" in my original post, which may have lead to confusion). I further said that making it to the top of the list provides valuable media coverage, which equates to advertising for the the system vendor. This seems to be where the confusion/furor started. Other have gone on to argue whether or not that leads to a tangible return on investments or pleases shareholders, but that wasn't really my point. My main point was that it would be difficult or impossible to get the price of these systems. And since they account for so many of the FLOPS in the Top500, they could skew the results of the average $/FLOP in the Top500, or make such a number meaningless, since your average institution can't by such a system under the same circumstance. Now some more analogies that could be akin to adding fuel to the fire: You could equate building such systems to making "the world's largest pizza". I'm sure the small pizza place who makes it losses significant money making it, but it will make the local papers, be in the Guinness book of world's records forever (or until someone else makes a bigger one) and probably be mentioned on his signs, business cards, and menus. Clearly a publicity/advertising stunt. Can't think of any technology transfer that would make a normal sized pizza any better in this case. Car manufacturers often make exotic supercars for the same reason. Remember the Ford GT, or the Mercedes-Benz McLaren SLR? These exotics don't always make money, but the get a lot of press for the manufacturer, bring prestige to the brand, and if the car is sufficiently advanced enough technologically, the respect of competitors. Seldom do these cars turn a profit, but since Ford and Mercedes are large companies, their profits elsewhere subsidize these projects. And of course, the high technology in these cars usually trickles down to more proletarian models over the years. While we're on the topic of cars, here's a perfect analogy: How much $$$ Did Henry Ford II (aka "The Deuce") spend to develop the GT40s, which he built solely to beat Ferrari at Le Mans? -- Prentice From deadline at eadline.org Tue Jul 6 12:49:54 2010 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 6 Jul 2010 15:49:54 -0400 (EDT) Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C33778E.3010303@ias.edu> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2CC9AE.7030202@scalableinformatics.com> <4C3368A7.2090506@ias.edu> <4C336DDB.6050600@scalableinformatics.com> <4C33778E.3010303@ias.edu> Message-ID: <41790.76.98.139.137.1278445794.squirrel@mail.eadline.org> I have been predisposed so I missed the beginning of this thread (seems like a good thing) In any case, years ago there was some people suggesting that both cost and power be added to the Top500 results. The cost issue was just as contentious as it is now. As I recall, the discussion always came down to determining the true cost to purchase/install a production cluster vs the cost to run HPL. Things like actual hardware cost, staging cost (HW and SW), optimization cost, power/cooling are too variable to really track without a lot of effort (i.e. how do you account for a turn-key cluster vs. student built) The composite and piecemeal nature of clusters makes these types of numbers difficult to determine. As an aside, the Top500 is a great thing and I believe it is used for things for which it was never intended. It is after all one performance data point which has little relevance to the codes most people run or the number of cores most people use. -- Doug >>> Joe Landman wrote: >>>> Greg Rubino wrote: >>>>> I have to say I partially agree with Prentice. I don't know if >>>>> prestige directly translates into revenue, but if your a huge company >>>> >>>> Thats the thesis that I am saying I do not believe to be the case, and >>>> Prentis is (as far as I understand it) indicating that he believes >>>> this >>>> to be the case. >>>> > > I think my original point was misconstrued, and may have been completely > forgotten in the subsequent conversation. Here's another attempt at > conveying my original point: > > Using the big systems at the top of the Top500 list to get $/FLOP > wouldn't be a useful exercise, because these systems are usually sold > under NDA's and (probably) at a loss to the vendor. > > I posited that the vendors sell these systems Top500-winning systems at > a loss (Roadrunner and Jaguar, in particulat) in exchange for other > "intangibles": > > 1. Gain knowledge through the R&D that goes into building these systems. > > 2. Collaborating with the computer science geniuses at the customer's > site (like the computer geniuses at LANL), which could lead to knowledge > transfer. > > 3.Bragging rights (which I referred to as "prestige" in my original > post, which may have lead to confusion). > > I further said that making it to the top of the list provides valuable > media coverage, which equates to advertising for the the system vendor. > This seems to be where the confusion/furor started. > > Other have gone on to argue whether or not that leads to a tangible > return on investments or pleases shareholders, but that wasn't really my > point. > > My main point was that it would be difficult or impossible to get the > price of these systems. And since they account for so many of the FLOPS > in the Top500, they could skew the results of the average $/FLOP in the > Top500, or make such a number meaningless, since your average > institution can't by such a system under the same circumstance. > > Now some more analogies that could be akin to adding fuel to the fire: > > You could equate building such systems to making "the world's largest > pizza". I'm sure the small pizza place who makes it losses significant > money making it, but it will make the local papers, be in the Guinness > book of world's records forever (or until someone else makes a bigger > one) and probably be mentioned on his signs, business cards, and menus. > Clearly a publicity/advertising stunt. Can't think of any technology > transfer that would make a normal sized pizza any better in this case. > > Car manufacturers often make exotic supercars for the same reason. > Remember the Ford GT, or the Mercedes-Benz McLaren SLR? These exotics > don't always make money, but the get a lot of press for the > manufacturer, bring prestige to the brand, and if the car is > sufficiently advanced enough technologically, the respect of > competitors. Seldom do these cars turn a profit, but since Ford and > Mercedes are large companies, their profits elsewhere subsidize these > projects. And of course, the high technology in these cars usually > trickles down to more proletarian models over the years. > > While we're on the topic of cars, here's a perfect analogy: How much $$$ > Did Henry Ford II (aka "The Deuce") spend to develop the GT40s, which he > built solely to beat Ferrari at Le Mans? > > -- > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mdidomenico4 at gmail.com Tue Jul 6 19:01:08 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Tue, 6 Jul 2010 22:01:08 -0400 Subject: [Beowulf] hp dl170h g6? Message-ID: Does anyone on the list have HP DL170h G6 blade chassis's on their floor? Ours came with the on-board NIC mac addresses programmed in descending order, I'm curious if this is something new, I've never seen this done before. Every machine we have on the floor now has them in ascending order. The downside to the nic enumeration is that in bios eth0 is eth0 and pxe's from eth0, however, when inside anaconda (redhat) eth0 is really eth1, and thus kickstart cant run. The only time i've seen this happen is when the nic drivers load out of order, but that's easy to fix in the initrd HP gave me a bunch of software workarounds, that I'm not overly happy with, but I'd rather not have to put in a bunch of workarounds all over the place for these specific machines. Does anyone know if this is firmware flash fixable? HP refuses to acknowledge the question... From jlb17 at duke.edu Tue Jul 6 19:24:38 2010 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Tue, 6 Jul 2010 22:24:38 -0400 (EDT) Subject: [Beowulf] hp dl170h g6? In-Reply-To: References: Message-ID: On Tue, 6 Jul 2010 at 10:01pm, Michael Di Domenico wrote > Does anyone on the list have HP DL170h G6 blade chassis's on their > floor? Ours came with the on-board NIC mac addresses programmed in > descending order, I'm curious if this is something new, I've never > seen this done before. Every machine we have on the floor now has > them in ascending order. > > The downside to the nic enumeration is that in bios eth0 is eth0 and > pxe's from eth0, however, when inside anaconda (redhat) eth0 is really > eth1, and thus kickstart cant run. The only time i've seen this > happen is when the nic drivers load out of order, but that's easy to > fix in the initrd I demoed the a chassis full of the HP SL2x170z G6s, and they had the same problem -- BIOS/PXE eth0 became eth1 in anaconda. I worked around it by passing anaconda "ksdevice=bootif" in the pxe config file. But, yeah, it's annoying having to work around oddness like that. I'm looking at that model or the DL160 G6 (standard 1U), and I imagine the 160 will have the same issue. > Does anyone know if this is firmware flash fixable? HP refuses to > acknowledge the question... And, if it is, how many person-years will it take to find said firmware flash file on HP's website (seriously, how broken is that site!?)? -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From beckerjes at mail.nih.gov Tue Jul 6 19:44:07 2010 From: beckerjes at mail.nih.gov (Jesse Becker) Date: Tue, 6 Jul 2010 22:44:07 -0400 Subject: [Beowulf] hp dl170h g6? In-Reply-To: References: Message-ID: <20100707024407.GN1324@mail.nih.gov> I've all manner of enumeration problems like this with HP hardware, going back to the DL145G2 series. I've neither seen, nor tried, any firmware fixes. It's massively annoying, to be sure. I've a DL385 with 4 on-board NICs labeled as NET1 to NET4. They are correspondingly enumerated as eth2, eth3, eth0, eth1. On Tue, Jul 06, 2010 at 10:01:08PM -0400, Michael Di Domenico wrote: >Does anyone on the list have HP DL170h G6 blade chassis's on their >floor? Ours came with the on-board NIC mac addresses programmed in >descending order, I'm curious if this is something new, I've never >seen this done before. Every machine we have on the floor now has >them in ascending order. > >The downside to the nic enumeration is that in bios eth0 is eth0 and >pxe's from eth0, however, when inside anaconda (redhat) eth0 is really >eth1, and thus kickstart cant run. The only time i've seen this >happen is when the nic drivers load out of order, but that's easy to >fix in the initrd > >HP gave me a bunch of software workarounds, that I'm not overly happy >with, but I'd rather not have to put in a bunch of workarounds all >over the place for these specific machines. > >Does anyone know if this is firmware flash fixable? HP refuses to >acknowledge the question... >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Jesse Becker NHGRI Linux support (Digicon Contractor) From alscheinine at tuffmail.us Tue Jul 6 19:52:53 2010 From: alscheinine at tuffmail.us (Alan Louis Scheinine) Date: Tue, 06 Jul 2010 21:52:53 -0500 Subject: [Beowulf] hp dl170h g6? In-Reply-To: References: Message-ID: <4C33EC05.7070701@tuffmail.us> Just so I understand better, in the file /tftpboot/linux-install/pxelinux.cfg/ changing from eth0 to eth1 the line append ksdevice=eth1 [etc.] does not solve the problem? I've had similar problems but I don't remember how we solved it, we tried everything randomly and in a semi-panic. But since we are calmly discussing it in the mailing list, it would be nice to organize the question. There is ksdevice in the file described above and in addition there is "--device eth0" (or that could be "--device eth1") in the ks.cfg file. Changing neither one nor the other nor both solves the problem? Regards, Alan -- Alan Scheinine 200 Georgann Dr., Apt. E6 Vicksburg, MS 39180 Email: alscheinine at tuffmail.us Mobile phone: 225 288 4176 http://www.flickr.com/photos/ascheinine From alscheinine at tuffmail.us Tue Jul 6 21:22:42 2010 From: alscheinine at tuffmail.us (Alan Louis Scheinine) Date: Tue, 06 Jul 2010 23:22:42 -0500 Subject: [Beowulf] hp dl170h g6? In-Reply-To: References: Message-ID: <4C340112.7080007@tuffmail.us> "ksdevice=bootif" I had not previous heard of that option. Joshua Baker-LePain writes: > And, if it is, how many person-years will it take to find said > firmware flash file on HP's website (seriously, how broken is that site!?)? To find a ppd file for an HP printer, navigating their website does not work for me, whereas, using Google brings me to the right page on the HP web site. With regards to Google, it finds a suggestion from Jay Hilliard > > In your pxelinux config file: > > add ksdevice=bootif > > also add "IPAPPEND 2" to the end of the file > > In your kickstart file, don't specify a device: > "network --bootproto dhcp" There is also > http://fedoraproject.org/wiki/Anaconda/Kickstart#network Is this more complete? Or is it incorrect? I don't know, just asking. -- Alan Scheinine 200 Georgann Dr., Apt. E6 Vicksburg, MS 39180 Email: alscheinine at tuffmail.us Mobile phone: 225 288 4176 http://www.flickr.com/photos/ascheinine From mdidomenico4 at gmail.com Wed Jul 7 06:07:02 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 7 Jul 2010 09:07:02 -0400 Subject: [Beowulf] hp dl170h g6? In-Reply-To: <4C340112.7080007@tuffmail.us> References: <4C340112.7080007@tuffmail.us> Message-ID: fyi... The two suggestions i got from HP, might help others 1. http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&objectID=c01430330&jumpid=reg_R1002_USEN 2. Please try renaming the interfaces using UDEV rules. Before changing the names, please note which MAC address corresponds to which NIC port and the way they are to be ordered. DMIDECODE command may be used to obtain this information. For changing the NIC's name allocation, please change the file: /etc/udev/rules.d/XX-net_persistent_names.rules inserting the correct names in reference to MAC address. Depending on the distro and version the XX could be any number and the name of the file could be persistent-net.rules. root at linux:~# vi /etc/udev/rules.d/30-net_persistent_names.rules [...] SUBSYSTEM=="net", ACTION=="add", SYSFS{address}=="00:00:00::00:01", IMPORT="/lib/udev/rename_netiface %k eth0" SUBSYSTEM=="net", ACTION=="add", SYSFS{address}=="00:00:00:00:00:02", IMPORT="/lib/udev/rename_netiface %k eth1" At the end of each line replace the name ethX with the name you want to use, ie instead of eth0 use eth2 and finally reboot the system. You may refer to the following link for more information on writing udev rules: http://www.reactivated.net/writing_udev_rules.html --- Both of which would work to workaround the problem, as well as the ksdevice options do. However, the trouble comes in that we have post install scripts inside our kickstart files, if the NIC's enumerate wrong they don't work. We expect every machine in our building to have eth0 (mgmt) and eth1 (regular) network connections and up until now 99% of them work this way. since at least one other person has gotten the systems the same way, i guess there's not much i can do. On Wed, Jul 7, 2010 at 12:22 AM, Alan Louis Scheinine wrote: > > ?"ksdevice=bootif" ?I had not previous heard of that option. > > Joshua Baker-LePain writes: >> >> And, if it is, how many person-years will it take to find said > >> firmware flash file on HP's website (seriously, how broken is that >> site!?)? > > To find a ppd file for an HP printer, navigating their website does not work > for me, whereas, using Google brings me to the right page on the HP web > site. > > With regards to Google, it finds a suggestion from Jay Hilliard >> >> In your pxelinux config file: >> >> add ksdevice=bootif >> >> also add "IPAPPEND 2" to the end of the file >> >> In your kickstart file, don't specify a device: >> ? "network --bootproto dhcp" > > There is also >> >> http://fedoraproject.org/wiki/Anaconda/Kickstart#network > > Is this more complete? ?Or is it incorrect? > I don't know, just asking. > > -- > > ?Alan Scheinine > ?200 Georgann Dr., Apt. E6 > ?Vicksburg, MS ?39180 > > ?Email: alscheinine at tuffmail.us > ?Mobile phone: 225 288 4176 > > ?http://www.flickr.com/photos/ascheinine > From holden.dapenor at gmail.com Sun Jul 4 22:36:17 2010 From: holden.dapenor at gmail.com (Holden Dapenor) Date: Mon, 5 Jul 2010 01:36:17 -0400 Subject: [Beowulf] diskless cluster questions Message-ID: How does diskless clustering work for those aspects of the OS that need to be unique for each node? For instance, network configuration and hostfiles need to be specified somewhere, but if all nodes boot the same root, then where is this information stored? -------------- next part -------------- An HTML attachment was scrubbed... URL: From roger at HPC.MsState.Edu Tue Jul 6 08:58:52 2010 From: roger at HPC.MsState.Edu (Roger L. Smith) Date: Tue, 06 Jul 2010 10:58:52 -0500 Subject: [Beowulf] HPL efficiency on Magny-Cours and Westmere? In-Reply-To: References: Message-ID: <4C3352BC.60505@HPC.MsState.Edu> We achieved ~87% on HPL across 3072 cores (256 nodes) with QDR IB and 2GB/core. If I remember correctly, we got about 91% on a single node (across 12 cores). We didn't run single-socket tests. This is on an IBM iDataPlex with 2.8GHz X5660 Westmeres. Mark Hahn wrote: > Hi all, > can anyone tell me what kind of efficiency you're seeing on Magny-Cours > and Westmere systems? by efficiency, I mean actual HPL performance as a > fraction of cores * clock * 4 flops/cycle. I realize some of this can > be drived from top500 results, but I'd also be be interested in single- > socket and single-node scores for comparison. > > thanks, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Roger L. Smith Senior Systems Administrator Mississippi State University High Performance Computing Collaboratory From per at computer.org Tue Jul 6 12:22:01 2010 From: per at computer.org (Per Jessen) Date: Tue, 06 Jul 2010 21:22:01 +0200 Subject: [Beowulf] 6 TFlops, 450 MFlops/W watercooled IBM @ ETH References: <20100704160501.GQ31956@leitl.org> <4C30B741.2070807@pathscale.com> Message-ID: "C. Bergstr?m" wrote: > Eugen Leitl wrote: >> http://www.physorg.com/news197295578.html >> >> IBM Hot Water-Cooled Supercomputer Goes Live at ETH Zurich >> >> July 2, 2010 >> >> (PhysOrg.com) -- IBM has delivered a first-of-a-kind hot water-cooled >> supercomputer to the Swiss Federal Institute of Technology Zurich >> (ETH >> Zurich), marking a new era in energy-aware computing. The innovative >> system, dubbed Aquasar, consumes up to 40 percent less energy than a >> comparable air-cooled machine. Through the direct use of waste heat >> to provide warmth to university buildings > > Others have already made the joke that Fermi could double as a space > heater, but I wonder if that could end up being reality.. It's also > not really breaking news that a water cooled system is more efficient > than air, but how real world tested is this? The concept is very real world, but quite old. Back in the 80s I worked for a Danish bank - our IBM 3090s were water-cooled as were the cooling-towers. Around '89 we installed heat-exchangers for re-using the hot cooling water for warming up the offices (in winter). /Per Jessen, Z?rich From eagles051387 at gmail.com Wed Jul 7 07:33:30 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 7 Jul 2010 16:33:30 +0200 Subject: [Beowulf] diskless cluster questions In-Reply-To: References: Message-ID: its actually easier if you use one os you know how to work with that way. the os doesnt need to be unique for each node, but you want the slave nodes to have as much ram as possible. also you will need to use pxe to boot off the master node as well as tftp to transfer the information from master to slaves. when goign diskless all information on slaves is stored in ram. then once you power them off if im not mistaken any data is sent back to the master for storage > For instance, network configuration and hostfiles need to be specified > somewhere, but if all nodes boot the same root, then where is this > information stored? > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashley at pittman.co.uk Wed Jul 7 07:47:23 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 7 Jul 2010 15:47:23 +0100 Subject: [Beowulf] diskless cluster questions In-Reply-To: References: Message-ID: <83248991-E0D9-4D84-AE97-CCBA76622244@pittman.co.uk> On 5 Jul 2010, at 06:36, Holden Dapenor wrote: > How does diskless clustering work for those aspects of the OS that need to be unique for each node? Differently for each distribution. > For instance, network configuration and hostfiles need to be specified somewhere hostfiles are the same, network configuration including hostname can be done by dhcp. > but if all nodes boot the same root, then where is this information stored? This is not mandated by diskless configuration, you may choose to share / ro between clients and have a rw copy of /var for each client or you may choose to have an entire fs tree for each client. Another option might be to use fuse although I don't have much experience of that myself, it's basically the same but each client would have a copy-on-write version of /var and /etc to allow them to write to files in these directories. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From hearnsj at googlemail.com Wed Jul 7 08:02:19 2010 From: hearnsj at googlemail.com (John Hearns) Date: Wed, 7 Jul 2010 16:02:19 +0100 Subject: [Beowulf] diskless cluster questions In-Reply-To: References: Message-ID: On 5 July 2010 06:36, Holden Dapenor wrote: > How does diskless clustering work for those aspects of the OS that need to > be unique for each node? As Ashley says, you use DHCP for the network configuration. There is very little else you should need to configure differently on each individual host - for instance batch scheduler systems store information on batch nodes in a central place. All the node needs to do is be configured to know its batch master, and to start the batch system daemon, then wait for the jobs to come in. Any changes to (say) /etc/pbs.conf are generally made to all the cluster nodes identically. From eagles051387 at gmail.com Wed Jul 7 09:14:03 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 7 Jul 2010 18:14:03 +0200 Subject: [Beowulf] diskless cluster questions In-Reply-To: References: Message-ID: i thought any changes made get transmitted back to the master node and stored so that next time the node is powered up it can be restored to that state -------------- next part -------------- An HTML attachment was scrubbed... URL: From Nico.Mittenzwey at informatik.tu-chemnitz.de Wed Jul 7 09:15:39 2010 From: Nico.Mittenzwey at informatik.tu-chemnitz.de (Nico Mittenzwey) Date: Wed, 07 Jul 2010 18:15:39 +0200 Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain user In-Reply-To: <201007051900.o65J09rf004655@bluewest.scyld.com> References: <201007051900.o65J09rf004655@bluewest.scyld.com> Message-ID: <4C34A82B.10000@informatik.tu-chemnitz.de> Christopher Samuel wrote: > Why not just limit the number of jobs they can run by > using MAXJOB to a level that your file server can cope > with ? Because I have free nodes and don't want them to idle while jobs are available. ;) And - as always - the users want their results asap... > Is your fileserver a RHEL box using ext3 by some chance ? No a rather big Lustre storage system - but the mentioned user alone creates an I/O load of 2.5GB/s. cheers Nico From Nico.Mittenzwey at informatik.tu-chemnitz.de Wed Jul 7 09:33:29 2010 From: Nico.Mittenzwey at informatik.tu-chemnitz.de (Nico Mittenzwey) Date: Wed, 07 Jul 2010 18:33:29 +0200 Subject: [Beowulf] diskless cluster questions In-Reply-To: References: Message-ID: <4C34AC59.4030405@informatik.tu-chemnitz.de> Holden Dapenor wrote: > How does diskless clustering work for those aspects of the OS that need > to be unique for each node? You may take a look at www.perceus.org which is a nice solution for diskless clusters and handles that aspects. cheers, Nico From rpnabar at gmail.com Wed Jul 7 09:54:32 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 7 Jul 2010 11:54:32 -0500 Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain user In-Reply-To: <4C2DC947.8000903@informatik.tu-chemnitz.de> References: <4C2DC947.8000903@informatik.tu-chemnitz.de> Message-ID: On Fri, Jul 2, 2010 at 6:11 AM, Nico Mittenzwey wrote: > So I would like to tell PBS/Maui to wait x seconds before starting another > job of that particular user. Do you know of any means to accomplish that > (even if I have to change the source)? Would a modified prologue script do the job? You could have a bash script that looks at a username and adds a "wait y" delay there only for this user's jobs. y can be made some function of x and the current number of that users jobs? There could be race conditions but if you add a random delay before the logic, this might not be an issue. Yes, it's a hack but it might work? -- Rahul From mathog at caltech.edu Wed Jul 7 15:34:01 2010 From: mathog at caltech.edu (David Mathog) Date: Wed, 07 Jul 2010 15:34:01 -0700 Subject: [Beowulf] instances where a failed storage block is not all zero? Message-ID: With "modern" hardware are there currently any notable instances where a failed read of a hardware storage area block results in that missing data being filled in with something other than null bytes? For instance, if a disk swapped a bad block out of the inside of a file, or a region of a DVD goes bad. (Assuming that the software reading it can even go on beyond the failure, which is often not possible, for instance on many tapes.) I know for instance that when reading from damaged media dd conv=sync,noerror will fill in with null bytes, but there is a lot of other software out there... Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From lindahl at pbm.com Wed Jul 7 16:04:40 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 7 Jul 2010 16:04:40 -0700 Subject: [Beowulf] instances where a failed storage block is not all zero? In-Reply-To: References: Message-ID: <20100707230440.GB9218@bx9.net> On Wed, Jul 07, 2010 at 03:34:01PM -0700, David Mathog wrote: > With "modern" hardware are there currently any notable instances where a > failed read of a hardware storage area block results in that missing > data being filled in with something other than null bytes? Yes. You might get the wrong block due to a misdirected write or read, or you might get an old block because the previous write experienced "write tearing". If the OS knows it was unable to read a block and replaced it with zeros, it will throw an error. In Linux, the behavior depends on what you chose: panic on error, mount r/o, or continue. If the nulls are part of the filesystem metadata, all hell can break loose. The errors in the first paragraph won't be detected at all. They're rare, but... -- greg From hahn at mcmaster.ca Wed Jul 7 16:14:42 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 7 Jul 2010 19:14:42 -0400 (EDT) Subject: [Beowulf] instances where a failed storage block is not all zero? In-Reply-To: References: Message-ID: > With "modern" hardware are there currently any notable instances where a > failed read of a hardware storage area block results in that missing > data being filled in with something other than null bytes? For I'm surprised at the question: I expect failed reads to result in out-of-band errors, not zeros. a failed read on a disk, for instance, indicates that the ECC failed - considering that the ECC is quite strong, it would be surprising to encounter a failure which wasn't even detected by the ECC. on what kind of medium are you finding errors-returned-as-zero? > a region of a DVD goes bad. (Assuming that the software reading it can > even go on beyond the failure, which is often not possible, for instance > on many tapes.) I know it's sorta possible to read raw (extended) sectors from disks, but it's pretty deep voodoo. I guess I would expect the contents to not fail-to-zero on disks (which I guess use some kind of NRZ-like encoding). I wouldn't be surprised if damaged flash would read all-0 or all-1 (flash erase sets a block to all-1, right, and writing is basically clearing selective zeros?) > will fill in with null bytes, but there is a lot of other software out > there... I think it's a question of whether you're using an exotic interface or not - normal kernel block/char devices aren't ever going to do this. people in the forensics/recovery business would be the ones to ask. -mark hahn From ebiederm at xmission.com Wed Jul 7 20:12:53 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Wed, 07 Jul 2010 20:12:53 -0700 Subject: [Beowulf] instances where a failed storage block is not all zero? In-Reply-To: (David Mathog's message of "Wed\, 07 Jul 2010 15\:34\:01 -0700") References: Message-ID: "David Mathog" writes: > With "modern" hardware are there currently any notable instances where a > failed read of a hardware storage area block results in that missing > data being filled in with something other than null bytes? For > instance, if a disk swapped a bad block out of the inside of a file, or > a region of a DVD goes bad. (Assuming that the software reading it can > even go on beyond the failure, which is often not possible, for instance > on many tapes.) I seem to remember people looking at the expected failure rates were saying that were are getting close to the point where a single pass through the disk will have a single undetected bit flip. Which is one of the reasons disk manufacturers have wanted to go to 4K blocks recently so the can get more error checking/correcting per bit. Eric From cbergstrom at pathscale.com Thu Jul 8 17:06:47 2010 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Fri, 09 Jul 2010 07:06:47 +0700 Subject: [Beowulf] Slightly OT : GPU Optimized HPL and other benchmarks Message-ID: <4C366817.8040106@pathscale.com> Hi all We recently announced our new ENZO gpu solution for Nvidia hardware and now working on performance for various benchmarks. If anyone has a small gpu cluster or is planning to have one in the future we're looking for beta testers who can share kernels and give good feedback. Instead of a CUDA/OpenCL front-end we've opted for HMPP pragma approach which offers a lot of great benefits, but please contact me off list for more details on that. We're specifically interested to work with anyone or group who really wants to push the efficiency/performance of HPL (lapack). Thanks Christopher From mdidomenico4 at gmail.com Thu Jul 8 18:05:26 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Thu, 8 Jul 2010 21:05:26 -0400 Subject: [Beowulf] Slightly OT : GPU Optimized HPL and other benchmarks In-Reply-To: <4C366817.8040106@pathscale.com> References: <4C366817.8040106@pathscale.com> Message-ID: If you can point me/others towards some documentation on the system/api's, perhaps some of my/our researchers might be interested... 2010/7/8 "C. Bergstr?m" : > > Hi all > > We recently announced our new ENZO gpu solution for Nvidia hardware and now > working on performance for various benchmarks. ?If anyone has a small gpu > cluster or is planning to have one in the future we're looking for beta > testers who can share kernels and give good feedback. ?Instead of a > CUDA/OpenCL front-end we've opted for HMPP pragma approach which offers a > lot of great benefits, but please contact me off list for more details on > that. > > We're specifically interested to work with anyone or group who really wants > to push the efficiency/performance of HPL (lapack). > > Thanks > > Christopher > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From lathama at gmail.com Wed Jul 7 07:22:40 2010 From: lathama at gmail.com (Andrew Latham) Date: Wed, 7 Jul 2010 10:22:40 -0400 Subject: [Beowulf] diskless cluster questions In-Reply-To: References: Message-ID: Tools like DHCP can manage information for individual nodes as a server. The nodes can be identified by the network card MAC address. Things like IP, Netmask, Router, DNS, Hostname and other options can be set per MAC identifier. A great example is the mass deployment of VoIP Hardphones that ask a central server for a configuration based on the MAC address. ~ Andrew "lathama" Latham lathama at gmail.com * Learn more about OSS http://en.wikipedia.org/wiki/Open-source_software * Learn more about Linux http://en.wikipedia.org/wiki/Linux * Learn more about Tux http://en.wikipedia.org/wiki/Tux On Mon, Jul 5, 2010 at 1:36 AM, Holden Dapenor wrote: > How does diskless clustering work for those aspects of the OS that need to > be unique for each node? For instance, network configuration and hostfiles > need to be specified somewhere, but if all nodes boot the same root, then > where is this information stored? > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From scrusan at UR.Rochester.edu Wed Jul 7 08:15:46 2010 From: scrusan at UR.Rochester.edu (Steve Crusan) Date: Wed, 07 Jul 2010 11:15:46 -0400 Subject: [Beowulf] diskless cluster questions In-Reply-To: Message-ID: > then once you power them off if im not mistaken any data is sent back to the master for storage You should lose all of your changes if your OS is kept in RAM once you power off the node, reboot it, etc. On 7/7/10 10:33 AM, "Jonathan Aquilina" wrote: > > its actually easier if you use one os you know how to work with that way. the > os doesnt need to be unique for each node, but? you want the slave nodes to > have as much ram as possible. also you will need to use pxe to boot off the > master node as well as tftp to transfer the information from master to slaves. > when goign diskless all information on slaves is stored in ram. then once you > power them off if im not mistaken any data is sent back to the master for > storage > ? >> For instance, network configuration and hostfiles need to be specified >> somewhere, but if all nodes boot the same root, then where is this >> information stored? >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From vallard at benincosa.com Wed Jul 7 09:09:59 2010 From: vallard at benincosa.com (Vallard Benincosa) Date: Wed, 7 Jul 2010 09:09:59 -0700 Subject: [Beowulf] diskless cluster questions In-Reply-To: References: Message-ID: With diskless clusters you also need to be aware of the many ways to do it: - RAM root - where all of the OS is loaded in memory - NFS root - which is what a lot of people seem to call diskless - RamRoot/NFS root hybrid - where some directories like /root live on RAM and /usr lives on NFS for example. We really like RAM root for HPC because you can make a small image (150-300MB with InfiniBand) that is portable, has great performance and is easy to reproduce and update. The disadvantages are if you run multiple applications where some library may not be in the image. In that sense, using NFS root works better for those environments that run a lot of different applications. However, many large clusters I have worked with are dedicated to one single application and ram root fits the bill perfectly. Like Ashley said all the host name info is configured via DHCP. Many people also put arguments in the PXE boot file to help specify additional parameters. I think the old Red Hat stateless did NFSROOT= for example. I have also seen many other homegrown ones where they throw everything but the kitchen sink in as arguments. In addition for configuring other devices (like InfiniBand IP addresses) instead of just IPADDR=10.3.0.201 in the config file there would be some script: IPADDR=10.3.0.$(`hostname` | sed 's/node//') These seem to be the tricks I see on doing this. You may also want to look into two projects that do stateless/diskless booting: xCAT and Perceus. Both of them allow for all three methods described above. There may be others as well. Hope that helps some what. On Wed, Jul 7, 2010 at 8:02 AM, John Hearns wrote: > On 5 July 2010 06:36, Holden Dapenor wrote: > > How does diskless clustering work for those aspects of the OS that need > to > > be unique for each node? > > As Ashley says, you use DHCP for the network configuration. > There is very little else you should need to configure differently on > each individual host - for instance batch scheduler systems store > information on > batch nodes in a central place. All the node needs to do is be > configured to know its batch master, and to start the batch system > daemon, then > wait for the jobs to come in. > Any changes to (say) /etc/pbs.conf are generally made to all the > cluster nodes identically. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Vallard http://sumavi.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From akshar.bhosale at gmail.com Wed Jul 7 12:01:59 2010 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Thu, 8 Jul 2010 00:31:59 +0530 Subject: [Beowulf] shutting down pbs server and maui for half an hour will affect running jobs? Message-ID: hi, we have maintenance of pbs server so it is going down for half an hour ..will it affect running jobs?where is the timeout defined?can it be increased? on pbs mom side or pbs server side we need to change?any other parameter we need to check ?will it hold the already running jobs for half an hour? what care should we take in order to avoid jobs not getting killed?we have torque installed -------------- next part -------------- An HTML attachment was scrubbed... URL: From douglas.guptill at dal.ca Fri Jul 9 09:43:13 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Fri, 9 Jul 2010 13:43:13 -0300 Subject: [Beowulf] first cluster [was [OMPI users] trouble using openmpi under slurm] In-Reply-To: <4C35D614.1090607@ldeo.columbia.edu> References: <5CF41CDB-39F0-477C-B6D6-4F2E50BE6909@open-mpi.org> <2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org> <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> Message-ID: <20100709164313.GA25062@sopalepc> On Thu, Jul 08, 2010 at 09:43:48AM -0400, Gus Correa wrote: > Douglas Guptill wrote: >> On Wed, Jul 07, 2010 at 12:37:54PM -0600, Ralph Castain wrote: >> >>> No....afraid not. Things work pretty well, but there are places >>> where things just don't mesh. Sub-node allocation in particular is >>> an issue as it implies binding, and slurm and ompi have conflicting >>> methods. >>> >>> It all can get worked out, but we have limited time and nobody cares >>> enough to put in the effort. Slurm just isn't used enough to make it >>> worthwhile (too small an audience). >> >> I am about to get my first HPC cluster (128 nodes), and was >> considering slurm. We do use MPI. >> >> Should I be looking at Torque instead for a queue manager? >> > Hi Douglas > > Yes, works like a charm along with OpenMPI. > I also have MVAPICH2 and MPICH2, no integration w/ Torque, > but no conflicts either. Thanks, Gus. After some lurking and reading, I plan this: Debian (lenny) + fai - for compute-node operating system install + Torque - job scheduler/manager + MPI (Intel MPI) - for the application + MPI (OpenMP) - alternative MPI Does anyone see holes in this plan? Thanks, Douglas -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From douglas.guptill at dal.ca Fri Jul 9 13:57:26 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Fri, 9 Jul 2010 17:57:26 -0300 Subject: [Beowulf] first cluster In-Reply-To: References: <2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org> <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> Message-ID: <20100709205726.GA7313@sopalepc> On Fri, Jul 09, 2010 at 02:19:53PM -0400, Mark Hahn wrote: >> Debian (lenny) > > why? centos is generally considered the safest choice, > unless you're religiously committed to debian. Almost religiously. I have found it a very stable platform for everything up to clusters. If Debian fails to do the job, CentoOS is my backup plan. >> + fai - for compute-node operating system install > > do you explicitly want a diskful install? I think there's pretty wide > consensus that nfs root (or at least net-loaded ram image) clusters are > better. Thank you for that opinion, which is new to me. I believe fai can do a variety of install-types, including diskful, and nfs root. But then, I am still in the planning stage, and have no practical experience. Thanks, Douglas. -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From rpnabar at gmail.com Fri Jul 9 14:00:16 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 9 Jul 2010 16:00:16 -0500 Subject: [Beowulf] Question about maui scheduler and reservations on a node: logical AND or OR Message-ID: If there are twin reservations set for the same timespan on a certain node do they get ANDed or ORed? setres -u userfoo -s '+5' -d '10:00:00' node1 setres -u userbar -s '+5' -d '10:00:00' node1 Will userfoo have access to the node or userbar or neither? Or is it the first reservation that is always active? Actual situation: Due to the way funding and priorities work on our cluster there are weeks in which I am supposed to give exclusive access to a certain user on a certain node. But the downside is that sometimes for debugging or maintainance etc. I might have to use that same node to run system jobs from a "maintainance" user. It would be convenient of there was a way to tell the scheduler "Reserve node foo for use by either foouser or baruser" A related question: showres shows me reservations but doesn't indicate what nodes these have been made for. e.g. stotz.58 User - -7:02:35:04 54:01:21:17 61:03:56:21 1/8 Fri Jul 2 13:03:39 stotz.59 User - -7:02:35:04 54:01:21:17 61:03:56:21 1/8 Fri Jul 2 13:03:39 But is there a way to know what node stotz.58 is active for? PS. I had asked the first question a few weeks ago on the MAUI list but received no replies hence I thought I should check if anyone on this list has a tip. Sorry if someone gets the question twice! -- Rahul From gus at ldeo.columbia.edu Fri Jul 9 16:06:05 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 09 Jul 2010 19:06:05 -0400 Subject: [Beowulf] first cluster [was [OMPI users] trouble using openmpi under slurm] In-Reply-To: <20100709164313.GA25062@sopalepc> References: <5CF41CDB-39F0-477C-B6D6-4F2E50BE6909@open-mpi.org> <2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org> <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> Message-ID: <4C37AB5D.9050308@ldeo.columbia.edu> Douglas Guptill wrote: > On Thu, Jul 08, 2010 at 09:43:48AM -0400, Gus Correa wrote: >> Douglas Guptill wrote: >>> On Wed, Jul 07, 2010 at 12:37:54PM -0600, Ralph Castain wrote: >>> >>>> No....afraid not. Things work pretty well, but there are places >>>> where things just don't mesh. Sub-node allocation in particular is >>>> an issue as it implies binding, and slurm and ompi have conflicting >>>> methods. >>>> >>>> It all can get worked out, but we have limited time and nobody cares >>>> enough to put in the effort. Slurm just isn't used enough to make it >>>> worthwhile (too small an audience). >>> I am about to get my first HPC cluster (128 nodes), and was >>> considering slurm. We do use MPI. >>> >>> Should I be looking at Torque instead for a queue manager? >>> >> Hi Douglas >> >> Yes, works like a charm along with OpenMPI. >> I also have MVAPICH2 and MPICH2, no integration w/ Torque, >> but no conflicts either. > > Thanks, Gus. > > After some lurking and reading, I plan this: > Debian (lenny) > + fai - for compute-node operating system install > + Torque - job scheduler/manager > + MPI (Intel MPI) - for the application > + MPI (OpenMP) - alternative MPI > > Does anyone see holes in this plan? > > Thanks, > Douglas Hi Douglas I never used Debian, fai, or Intel MPI. We have two clusters with cluster management software, i.e., mostly the operating system install stuff. I made a toy Rocks cluster out of old computers. Rocks is a minimum-hassle way to deploy and maintain a cluster. Of course you can do the same from scratch, or do more, or do better, which makes some people frown at Rocks. However, Rocks works fine, particularly if your network(s) is (are) Gigabit Ethernet, and if you don't mix different processor architectures (i.e. only i386 or only x86_64, although there is some support for mixed stuff). It is developed/maintained by UCSD under an NSF grant (I think). It's been around for quite a while too. You may want to take a look, perhaps experiment with a subset of your nodes before you commit: http://www.rocksclusters.org/wordpress/ There is a decent user guide: http://www.rocksclusters.org/roll-documentation/base/5.3/ and additional documentation/tutorials: http://www.rocksclusters.org/wordpress/?page_id=4 The basic software comes in what they call "rolls". The (default) OS is actually CentOS. They only support a few "Red-Hat-type" distributions (IIRR, RHEL and Scientific Linux), but CentOS is fine. You could use the mandatory rolls (Kernel/Boot, Core, OS disks 1,2. I would suggest installing all OS disks, so as to have any packages that you may need later on. In addition, there a roll with Torque+Maui that you can get from the Univ. of Tromso, Norway: ftp://ftp.uit.no/pub/linux/rocks/torque-roll/ If you want to install Torque, *don't install the SGE (Sun Grid Engine) roll*. It is either one resource manager or the other (they're incompatible). I am a big fan and old user of Torque, so my bias is to recommend Torque, but other people prefer SGE. The basic software takes care of compute node installation, administration of user accounts, etc. It can be customized in several ways (e.g. if you have two networks, one for MPI, another for cluster control and I/O, which I would recommend). It also includes a basic web page for your cluster (via Wordpress), which you can also customize, and very nice web-based monitoring of your nodes through Ganglia. It also has support for upgrades, and they tend to come up with a new release once a year or so. There is also a large user base and an active mailing list: https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion http://marc.info/?l=npaci-rocks-discussion You can build OpenMPI (and MPICH2) from source, with any/all your favorite compilers, and install any compilers and all external software (even Matlab, if you are so inclined, or your users demand) in a NFS mounted directory (typically /share/apps in Rocks), so as to make them accessible by the compute nodes. You could do the same for, say, NetCDF libraries and utilities (NCO, NCL), etc. What is the interconnect/network hardware you have for MPI? Gigabit Ethernet? Infiniband? Myrinet? Other? If Gigabit Ethernet Rocks won't have any problem. If Infiniband you may need to add the OFED packages, but they may come with CentOS now, I am not sure. If Myrinet, I am not sure, Myrinet provided a Rocks roll up to Rocks 5.0, but I am not sure about the current status (Rocks is now 5.3). If you are going to handle a variety of different compilers, MPI flavors, with various versions, etc, I recommend using the "Environment module" package. It is a quite convenient (and consistent) way to allow users to switch from one environment to another, change compilers, MPI, etc, allowing good flexibility. You can install "environment modules" separately (say via yum or RPM) with no compatibility issues whatsoever with Rocks: http://modules.sourceforge.net/ I hope this helps. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From hahn at mcmaster.ca Fri Jul 9 16:11:18 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 9 Jul 2010 19:11:18 -0400 (EDT) Subject: [Beowulf] first cluster[B In-Reply-To: <20100709205726.GA7313@sopalepc> References: <2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org> <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> Message-ID: >>> Debian (lenny) >> >> why? centos is generally considered the safest choice, >> unless you're religiously committed to debian. > > Almost religiously. I have found it a very stable platform for > everything up to clusters. OK. you should know that the stability comes from linux itself and the underlying user-level packages, which have nothing to do with the distro (any of them). > I believe fai can do a variety of install-types, including diskful, > and nfs root. But then, I am still in the planning stage, and have no > practical experience. well, the thing about nfs root is that there's almost no installation, per se. if you wanted, you could boot the nodes off a live master's root filesystem. normally, master and node images are kept mostly separate, though, because it's handy to avoid entangling them (ie, you may not want mysql-server installed on compute nodes, but only on the master, etc. or just different versions.) From samuel at unimelb.edu.au Sun Jul 11 22:53:26 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 12 Jul 2010 15:53:26 +1000 Subject: [Beowulf] shutting down pbs server and maui for half an hour willaffect running jobs? In-Reply-To: References: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/07/10 05:01, akshar bhosale wrote: > we have maintenance of pbs server so it is going down > for half an hour ..will it affect running jobs? It shouldn't, though if they finish during that time they may not get full information logged about their state in the pbs_server logs. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw6rdUACgkQO2KABBYQAh/SSACfQYsyhsW51WwgBMwqbk2ILEui xrIAnR/vtUWXioE1g5OgY++sPmjHaXWa =XZw+ -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: From douglas.guptill at dal.ca Mon Jul 12 10:02:34 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Mon, 12 Jul 2010 14:02:34 -0300 Subject: [Beowulf] first cluster In-Reply-To: References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> Message-ID: <20100712170234.GB6134@sopalepc> Ah Ha. I see the point of a non-diskful, or nfs root, install for the compute nodes. One image to update/change, instead of a whole bunch. Thanks, Douglas. On Fri, Jul 09, 2010 at 07:11:18PM -0400, Mark Hahn wrote: > well, the thing about nfs root is that there's almost no installation, > per se. if you wanted, you could boot the nodes off a live master's > root filesystem. normally, master and node images are kept mostly > separate, though, because it's handy to avoid entangling them > (ie, you may not want mysql-server installed on compute nodes, but only > on the master, etc. or just different versions.) And from Steve Crusan: > As for the diskfull install, netbooting and statelite (NFS root) solutions > are very easy to scale and customize. Diskfull installs seem to be less > flexible in IMO. -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From gus at ldeo.columbia.edu Mon Jul 12 12:02:40 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 12 Jul 2010 15:02:40 -0400 Subject: [Beowulf] first cluster In-Reply-To: <20100712170234.GB6134@sopalepc> References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> Message-ID: <4C3B66D0.2080204@ldeo.columbia.edu> Hi Doug Consider disk for: A) swap space (say, if the user programs are large, or you can't buy a lot of RAM, etc); I wonder if swapping over NFS would be efficient for HPC. Disk may be a simple and cost effective solution. B) input/output data files that your application programs may require (if they already work in stagein-stageout mode, or if they do I/O so often that a NFS mounted file system may get overwhelmed, hence reading/writing on local disk may be preferred). C) Would diskless scaling be a real big advantage for a small/medium size cluster, say up to ~200 nodes? D) Most current node chassis have hot-swappable disks, not hard to replace, in case of failure. E) booting when the NFS root server is not reachable Disks don't prevent one to keep a single image and distribute it consistently across nodes, do they? In any case, I suppose you could both have disks and boot with NFS root. But if you have disks, is there really a point in doing so? I guess there are old threads about this in the list archives. Just some thoughts. Gus Correa Douglas Guptill wrote: > Ah Ha. I see the point of a non-diskful, or nfs root, install for the > compute nodes. One image to update/change, instead of a whole bunch. > > Thanks, > Douglas. > > On Fri, Jul 09, 2010 at 07:11:18PM -0400, Mark Hahn wrote: > >> well, the thing about nfs root is that there's almost no installation, >> per se. if you wanted, you could boot the nodes off a live master's >> root filesystem. normally, master and node images are kept mostly >> separate, though, because it's handy to avoid entangling them >> (ie, you may not want mysql-server installed on compute nodes, but only >> on the master, etc. or just different versions.) > > And from Steve Crusan: > >> As for the diskfull install, netbooting and statelite (NFS root) solutions >> are very easy to scale and customize. Diskfull installs seem to be less >> flexible in IMO. > > From bill at Princeton.EDU Mon Jul 12 13:28:29 2010 From: bill at Princeton.EDU (Bill Wichser) Date: Mon, 12 Jul 2010 16:28:29 -0400 Subject: [Beowulf] IB problem with openmpi 1.2.8 Message-ID: <4C3B7AED.9080606@princeton.edu> Machine is an older Intel Woodcrest cluster with a two tiered IB infrastructure with Topspin/Cisco 7000 switches. The core switch is a SFS-7008P with a single management module which runs the SM manager. The cluster runs RHEL4 and was upgraded last week to kernel 2.6.9-89.0.26.ELsmp. The openib-1.4 remained the same. Pretty much stock. After rebooting, the IB cards in the nodes remained in the INIT state. I rebooted the chassis IB switch as it appeared that no SM was running. No help. I manually started an opensm on a compute node telling it to ignore other masters as initially it would only come up in STANDBY. This turned all the nodes' IB ports to active and I thought that I was done. ibdiagnet complained that there were two masters. So I killed the opensm and now it was happy. osmtest -f c/osmtest -f a comes back with OSMTEST: TEST "All Validations" PASS. ibdiagnet -ls 2.5 -lw 4x finds all my switches and nodes with everything coming up roses. The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the node count goes over 32 (or maybe 40). This worked fine in the past, before the reboot. User apps are failing as well as IMB v3.2. I've increased the timeout using the "mpiexec -mca btl_openib_ib_timeout 20" which helped for 48 nodes but when increasing to 64 and 128 it didn't help at all. Typical error message follow. Right now I am stuck. I'm not sure what or where the problem might be. Nor where to go next. If anyone has a clue, I'd appreciate hearing it! Thanks, Bill typical error messages [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress] from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0 [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress] from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0 [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress] from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0 -------------------------------------------------------------------------- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. -------------------------------------------------------------------------- -------------------------------------------------------------------------- DIFFERENT RUN: [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress] from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0 ... From samuel at unimelb.edu.au Mon Jul 12 18:21:26 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 13 Jul 2010 11:21:26 +1000 Subject: [Beowulf] first cluster In-Reply-To: <4C3B66D0.2080204@ldeo.columbia.edu> References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/07/10 05:02, Gus Correa wrote: > I wonder if swapping over NFS would be efficient for HPC. There are out of tree patches for swap over NFS (and I've seen assertions that SuSE SLES 11 includes it) which has been doing the rounds for a few years now (originally by Peter Zijlstra but now maintained by Suresh Jayaraman) and appears to have last been updated October 2009. http://www.suse.de/~sjayaraman/patches/swap-over-nfs/ The last post (I could find) for it was here, it includes a diffstat to show which parts of the kernel are touched: http://lwn.net/Articles/355350/ This posting of Peter's from 2007 explains a bit more about the patches and why it is a hard problem: http://lwn.net/Articles/256462/ My personal feeling is "here be dragons". ;-) cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw7v5YACgkQO2KABBYQAh9J8wCffmBVB8cbBRTCYSAq6XGqBEdB ngEAnjWKlxtsA9ok7YJvdtX8cTCGl6FL =3XC7 -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: From samuel at unimelb.edu.au Mon Jul 12 18:43:42 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 13 Jul 2010 11:43:42 +1000 Subject: [Beowulf] shutting down pbs server and maui for half an hour will affect running jobs? In-Reply-To: References: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/07/10 01:21, akshar bhosale wrote: > Thanks for your information, but do i need to change > anything for increasing timeout if i dont want to kill > running jobs.. If you have jobs that will hit their walltime whilst the server is down then they will get killed by the pbs_mom unless you extend their walltime first. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw7xM4ACgkQO2KABBYQAh9SyACeOfjdCNER7W6GHr53Pm8JN7ks U+EAnjtF9WvQfFxDU8raIMyPDx7nmXkq =ISQy -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Mon Jul 12 21:04:48 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 12 Jul 2010 23:04:48 -0500 Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to specific addresses instead of a broadcast domain Message-ID: I am puzzled by a bunch of ARP requests on my network that I captured using tcpdump. Shouldn't ARP discovery requests always be sent to a broadcast address? I have requests of the type below which seemingly are addressed to a specific mAC address. 00:26:b9:58:d7:2f > 00:26:b9:58:eb:b8, ARP, length 42: arp who-has 10.0.0.36 tell 10.0.3.2 00:26:b9:58:eb:b8 > 00:26:b9:58:d7:2f, ARP, length 60: arp reply 10.0.0.36 is-at 00:26:b9:58:eb:b8 Now if mumble:d7:2f already knew that mumble:eb:b8 was 10.0.0.36 (which it indeed is) then why would it send out an ARP discovery request? I can see that something doesn't make sense here but I cannot figure out what is causing the problem. Any ideas? Has anyone seen this before? -- Rahul From patrick at myri.com Mon Jul 12 21:25:16 2010 From: patrick at myri.com (Patrick Geoffray) Date: Tue, 13 Jul 2010 00:25:16 -0400 Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to specific addresses instead of a broadcast domain In-Reply-To: References: Message-ID: <4C3BEAAC.6000306@myri.com> Rahul, On 7/13/2010 12:04 AM, Rahul Nabar wrote: > I am puzzled by a bunch of ARP requests on my network that I captured > using tcpdump. Shouldn't ARP discovery requests always be sent to a > broadcast address? No, the kernel regularly refreshes the entries in the ARP cache with unicast requests. If that fails, then it sends the expensive broadcasts. Patrick From rpnabar at gmail.com Mon Jul 12 21:29:13 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 12 Jul 2010 23:29:13 -0500 Subject: [Beowulf] first cluster In-Reply-To: <4C3B66D0.2080204@ldeo.columbia.edu> References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> Message-ID: On Mon, Jul 12, 2010 at 2:02 PM, Gus Correa wrote: > Consider disk for: > > A) swap space (say, if the user programs are large, > or you can't buy a lot of RAM, etc); Out of curiosity, is there the possibility of running a "swapless" compute-node? I mean most HPC nodes already have fairly generous RAM and once swapping to disk starts performance is degraded (severely?). Are there non-problem scenarios where one does desire swapping to disks? > D) Most current node chassis have hot-swappable disks, not hard to replace, > in case of failure. Hot-swappable disks are great on head nodes but on compute-nodes whenever I hear "redundant" or "hot swappable", I see it as an inefficiency. Or a excessive feature that could be traded off for a cost saving. (of course, sometimes hands are tied if the server comes with that feature "standard") What do others think? -- Rahul From rpnabar at gmail.com Mon Jul 12 21:48:47 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 12 Jul 2010 23:48:47 -0500 Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to specific addresses instead of a broadcast domain In-Reply-To: <4C3BEAAC.6000306@myri.com> References: <4C3BEAAC.6000306@myri.com> Message-ID: On Mon, Jul 12, 2010 at 11:25 PM, Patrick Geoffray wrote: > Rahul, > > On 7/13/2010 12:04 AM, Rahul Nabar wrote: >> >> I am puzzled by a bunch of ARP requests on my network that I captured >> using tcpdump. Shouldn't ARP discovery requests always be sent to a >> broadcast address? > > No, the kernel regularly refreshes the entries in the ARP cache with unicast > requests. If that fails, then it sends the expensive broadcasts. Thanks Patrick. I wasn't aware of this. I guess it makes sense now that I found the correct section of the RFP (http://tools.ietf.org/html/rfc1122#page-22). I see the converse situation too: Some ARP replies are being sent to a broadcast domain instead of a single MAC. Is that normal too? 00:26:b9:58:e5:9f > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply 172.16.0.29 is-at 00:26:b9:58:e5:9f 00:26:b9:56:38:71 > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply 172.16.0.14 is-at 00:26:b9:56:38:71 I'd have (naively) expected these replies to go to the specific MAC which had issued an ARP request on 172.16.0.29 or 172.16.0.14. -- Rahul From tom.ammon at utah.edu Mon Jul 12 22:04:30 2010 From: tom.ammon at utah.edu (Tom Ammon) Date: Mon, 12 Jul 2010 23:04:30 -0600 Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to specific addresses instead of a broadcast domain In-Reply-To: References: <4C3BEAAC.6000306@myri.com> Message-ID: <4C3BF3DE.2030002@utah.edu> This is called a gratuitous ARP. Used to update the ARP caches of other nodes. On 07/12/2010 10:48 PM, Rahul Nabar wrote: > On Mon, Jul 12, 2010 at 11:25 PM, Patrick Geoffray wrote: > >> Rahul, >> >> On 7/13/2010 12:04 AM, Rahul Nabar wrote: >> >>> I am puzzled by a bunch of ARP requests on my network that I captured >>> using tcpdump. Shouldn't ARP discovery requests always be sent to a >>> broadcast address? >>> >> No, the kernel regularly refreshes the entries in the ARP cache with unicast >> requests. If that fails, then it sends the expensive broadcasts. >> > Thanks Patrick. I wasn't aware of this. I guess it makes sense now > that I found the correct section of the RFP > (http://tools.ietf.org/html/rfc1122#page-22). > > I see the converse situation too: Some ARP replies are being sent to a > broadcast domain instead of a single MAC. Is that normal too? > > 00:26:b9:58:e5:9f> ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply > 172.16.0.29 is-at 00:26:b9:58:e5:9f > 00:26:b9:56:38:71> ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply > 172.16.0.14 is-at 00:26:b9:56:38:71 > > I'd have (naively) expected these replies to go to the specific MAC > which had issued an ARP request on 172.16.0.29 or 172.16.0.14. > > -- -------------------------------------------------------------------- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu From samuel at unimelb.edu.au Mon Jul 12 22:55:23 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 13 Jul 2010 15:55:23 +1000 Subject: [Beowulf] first cluster In-Reply-To: References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/07/10 14:29, Rahul Nabar wrote: > Out of curiosity, is there the possibility of running > a "swapless" compute-node? Yes of course, it just means that the kernel no longer has the option of paging out infrequently accessed dirty pages to free space for active processes. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw7/8sACgkQO2KABBYQAh+AOQCgjIxb/CaTsElcZi0bTiKfjTns tn8AoISVbA8hwJgwFIs/rADfIJN8FBg/ =NT3M -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: From samuel at unimelb.edu.au Mon Jul 12 22:56:41 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 13 Jul 2010 15:56:41 +1000 Subject: [Beowulf] shutting down pbs server and maui for half an hour will affect running jobs? In-Reply-To: References: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/07/10 13:12, akshar bhosale wrote: > thanks..any other related info ? Not that comes to mind, I'm afraid! - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw8ABkACgkQO2KABBYQAh/GPACfel9pCS4/clqgwCFCDB2Fv+pS Vp4AnR9dxZHjxKpnQMaFjIZxE6t8XaQf =IvbE -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: From reuti at staff.uni-marburg.de Tue Jul 13 05:07:27 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Tue, 13 Jul 2010 14:07:27 +0200 Subject: [Beowulf] first cluster In-Reply-To: References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> Message-ID: <158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de> Am 13.07.2010 um 06:29 schrieb Rahul Nabar: > On Mon, Jul 12, 2010 at 2:02 PM, Gus Correa wrote: >> Consider disk for: >> >> A) swap space (say, if the user programs are large, >> or you can't buy a lot of RAM, etc); > > Out of curiosity, is there the possibility of running a "swapless" > compute-node? I mean most HPC nodes already have fairly generous RAM > and once swapping to disk starts performance is degraded (severely?). > Are there non-problem scenarios where one does desire swapping to > disks? As already said: yes, it's possible and you can even switch swap on and off during normal operation (`swapon` and `swapoff`). Disadvantage is of course, when the system runs out of memory the oom-killer will look for an eligible process to be killed to free up some space. As you mentioned, the application should fit into the physical installed RAM, and you may just want 2 GB or so as a last resort to swap out parts of the OS which are currently not in use. You may want more swap, when you want to setup some kind of preemption using a job scheduler. E.g. GridEngine can suspend a low priority job once a urgent one comes in, but resources like memory are not freed automatically (the job is still on the node - you would need some kind of checkpointing to free the node completely). When you setup the queuing system that all running applications fit into physical memory, the swap of the suspended application is a one time issue and won't affect the ongoing computation. >> D) Most current node chassis have hot-swappable disks, not hard to replace, >> in case of failure. > > Hot-swappable disks are great on head nodes but on compute-nodes > whenever I hear "redundant" or "hot swappable", I see it as an > inefficiency. Or a excessive feature that could be traded off for a > cost saving. (of course, sometimes hands are tied if the server comes > with that feature "standard") What do others think? Correct. Often it's included in chassis as default, although you can't make much use of it when you use a e.g. RAID0 on a node for performance reasons and have to reinstall the node anyway. It will just avoid that you have to switch off the node completely and remove it from the rack to access the inner parts of the node. But there might been chassis, where you can access the drive from the front w/o hot-swap capability but with a big label: don't remove under operation. -- Reuti > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From Glen.Beane at jax.org Tue Jul 13 05:09:46 2010 From: Glen.Beane at jax.org (Glen Beane) Date: Tue, 13 Jul 2010 08:09:46 -0400 Subject: [Beowulf] first cluster In-Reply-To: Message-ID: On 7/13/10 12:29 AM, "Rahul Nabar" wrote: > On Mon, Jul 12, 2010 at 2:02 PM, Gus Correa wrote: >> Consider disk for: >> >> A) swap space (say, if the user programs are large, >> or you can't buy a lot of RAM, etc); > > Out of curiosity, is there the possibility of running a "swapless" > compute-node? I mean most HPC nodes already have fairly generous RAM > and once swapping to disk starts performance is degraded (severely?). at a previous job (seems like a million years ago) we had a Top-500 cluster with completely diskless compute nodes - no local disk for swap or /tmp space. My current cluster has a small amount of swap on each node (~1GB) and we avoid swapping. Our attitude still is swapping is not an option - it is an indication that we should be decomposing our problem further and distributing it across more nodes. This cluster happens to have a local OS install. We're deploying a new cluster in the next month or so based on the 8-core magny-cours with 128GB RAM per node. The nodes will network boot but we have local disks for /tmp and a very small amount of swap. -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153 From douglas.guptill at dal.ca Tue Jul 13 08:57:57 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Tue, 13 Jul 2010 12:57:57 -0300 Subject: [Beowulf] Re: Beowulf Digest, Vol 77, Issue 14 In-Reply-To: References: <201007091648.o69Glf7R016032@bluewest.scyld.com> Message-ID: <20100713155757.GA18339@sopalepc> On Tue, Jul 13, 2010 at 05:11:46PM +0200, Ivan Rossi wrote: > On Fri, 9 Jul 2010, beowulf-request at beowulf.org wrote: > >> After some lurking and reading, I plan this: >> Debian (lenny) >> + fai - for compute-node operating system install >> + Torque - job scheduler/manager > > we did a similar cluster for a client and they wanted torque. > IMHO torque sucks. pbs_moms are no-so-stable and configuration is a pain. > consider sun grid engine which is also available within debian lenny Interesting. Thanks for the suggestion. >> + MPI (Intel MPI) - for the application > > only if you use intel compilers, otherwise go openMPI Yes, we will be using Intel compilers. Douglas. -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From bill at Princeton.EDU Tue Jul 13 13:09:20 2010 From: bill at Princeton.EDU (Bill Wichser) Date: Tue, 13 Jul 2010 16:09:20 -0400 Subject: [Beowulf] IB problem with openmpi 1.2.8 In-Reply-To: <4C3B7AED.9080606@princeton.edu> References: <4C3B7AED.9080606@princeton.edu> Message-ID: <4C3CC7F0.1090308@princeton.edu> Just some more info. Went back to the prior kernel with no luck. Updated the firmware on the Topspin HBA cards to the latest (final) version (fw-25208-4_8_200-MHEL-CF128-T). Nothing changes. Still not sure where to look. Bill Wichser wrote: > Machine is an older Intel Woodcrest cluster with a two tiered IB > infrastructure with Topspin/Cisco 7000 switches. The core switch is a > SFS-7008P with a single management module which runs the SM manager. > The cluster runs RHEL4 and was upgraded last week to kernel > 2.6.9-89.0.26.ELsmp. The openib-1.4 remained the same. Pretty much > stock. > > After rebooting, the IB cards in the nodes remained in the INIT > state. I rebooted the chassis IB switch as it appeared that no SM was > running. No help. I manually started an opensm on a compute node > telling it to ignore other masters as initially it would only come up > in STANDBY. This turned all the nodes' IB ports to active and I > thought that I was done. > > ibdiagnet complained that there were two masters. So I killed the > opensm and now it was happy. osmtest -f c/osmtest -f a comes back > with OSMTEST: TEST "All Validations" PASS. > ibdiagnet -ls 2.5 -lw 4x finds all my switches and nodes with > everything coming up roses. > > The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the > node count goes over 32 (or maybe 40). This worked fine in the past, > before the reboot. User apps are failing as well as IMB v3.2. I've > increased the timeout using the "mpiexec -mca btl_openib_ib_timeout > 20" which helped for 48 nodes but when increasing to 64 and 128 it > didn't help at all. Typical error message follow. > > Right now I am stuck. I'm not sure what or where the problem might > be. Nor where to go next. If anyone has a clue, I'd appreciate > hearing it! > > Thanks, > Bill > > > typical error messages > > [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress] > from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY > EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0 > [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress] > from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY > EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0 > [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress] > from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY > EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0 > -------------------------------------------------------------------------- > > The InfiniBand retry count between two MPI processes has been > exceeded. "Retry count" is defined in the InfiniBand spec 1.2 > (section 12.7.38): > > The total number of times that the sender wishes the receiver to > retry timeout, packet sequence, etc. errors before posting a > completion error. > > This error typically means that there is something awry within the > InfiniBand fabric itself. You should note the hosts on which this > error has occurred; it has been observed that rebooting or removing a > particular host from the job can sometimes resolve this issue. > > Two MCA parameters can be used to control Open MPI's behavior with > respect to the retry count: > > * btl_openib_ib_retry_count - The number of times the sender will > attempt to retry (defaulted to 7, the maximum value). > > * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > to 10). The actual timeout value used is calculated as: > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > See the InfiniBand spec 1.2 (section 12.7.34) for more details. > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > > DIFFERENT RUN: > > [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress] > from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY > EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0 > ... > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rpnabar at gmail.com Tue Jul 13 13:32:30 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 13 Jul 2010 15:32:30 -0500 Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to specific addresses instead of a broadcast domain In-Reply-To: <4C3BF3DE.2030002@utah.edu> References: <4C3BEAAC.6000306@myri.com> <4C3BF3DE.2030002@utah.edu> Message-ID: On Tue, Jul 13, 2010 at 12:04 AM, Tom Ammon wrote: > This is called a gratuitous ARP. Used to update the ARP caches of other > nodes. Thanks Tom. It is curious that most of my gratuitous ARP is coming from my IPMI interface and not my main eth stack. Not sure why. Maybe the Dell IPMI just is more aggressive about it. -- Rahul From prentice at ias.edu Tue Jul 13 13:50:06 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 13 Jul 2010 16:50:06 -0400 Subject: [Beowulf] IB problem with openmpi 1.2.8 In-Reply-To: <4C3CC7F0.1090308@princeton.edu> References: <4C3B7AED.9080606@princeton.edu> <4C3CC7F0.1090308@princeton.edu> Message-ID: <4C3CD17E.8060708@ias.edu> Bill, Have you checked the health of the cables themselves? It could just be dumb luck that a hardware failure coincided with a software change, didn't manifest itself until the reboot of the nodes. Did you reboot the switches, too? I would try dividing your cluster into small sections and see if the problem exists across the sections. Can you disconnect the edge switches from the core switch, so that each edge switch is it's own, isolated fabric? If so, you could then start an sm on each fabric and see if the problem is on every smaller IB fabric, or just one. The other option would be to disconnect all the nodes and add them back one by one, but that wouldn't catch a problem with a switch-to-switch connection. How big is the cluster? Would it take hours or days to test each node like this? You say the problem occurs when the node count goes over 32 (or 40) do you mean 32 physical nodes, or 32 processors. How does your scheduler assign nodes? Would those 32 nodes always be in the same rack or on the same IB switch, but not when the count increases? Prentice Bill Wichser wrote: > Just some more info. Went back to the prior kernel with no luck. > Updated the firmware on the Topspin HBA cards to the latest (final) > version (fw-25208-4_8_200-MHEL-CF128-T). Nothing changes. Still not > sure where to look. > > Bill Wichser wrote: >> Machine is an older Intel Woodcrest cluster with a two tiered IB >> infrastructure with Topspin/Cisco 7000 switches. The core switch is a >> SFS-7008P with a single management module which runs the SM manager. >> The cluster runs RHEL4 and was upgraded last week to kernel >> 2.6.9-89.0.26.ELsmp. The openib-1.4 remained the same. Pretty much >> stock. >> >> After rebooting, the IB cards in the nodes remained in the INIT >> state. I rebooted the chassis IB switch as it appeared that no SM was >> running. No help. I manually started an opensm on a compute node >> telling it to ignore other masters as initially it would only come up >> in STANDBY. This turned all the nodes' IB ports to active and I >> thought that I was done. >> >> ibdiagnet complained that there were two masters. So I killed the >> opensm and now it was happy. osmtest -f c/osmtest -f a comes back >> with OSMTEST: TEST "All Validations" PASS. >> ibdiagnet -ls 2.5 -lw 4x finds all my switches and nodes with >> everything coming up roses. >> >> The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the >> node count goes over 32 (or maybe 40). This worked fine in the past, >> before the reboot. User apps are failing as well as IMB v3.2. I've >> increased the timeout using the "mpiexec -mca btl_openib_ib_timeout >> 20" which helped for 48 nodes but when increasing to 64 and 128 it >> didn't help at all. Typical error message follow. >> >> Right now I am stuck. I'm not sure what or where the problem might >> be. Nor where to go next. If anyone has a clue, I'd appreciate >> hearing it! >> >> Thanks, >> Bill >> >> >> typical error messages >> >> [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress] >> from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY >> EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0 >> [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress] >> from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY >> EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0 >> [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress] >> from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY >> EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0 >> -------------------------------------------------------------------------- >> >> The InfiniBand retry count between two MPI processes has been >> exceeded. "Retry count" is defined in the InfiniBand spec 1.2 >> (section 12.7.38): >> >> The total number of times that the sender wishes the receiver to >> retry timeout, packet sequence, etc. errors before posting a >> completion error. >> >> This error typically means that there is something awry within the >> InfiniBand fabric itself. You should note the hosts on which this >> error has occurred; it has been observed that rebooting or removing a >> particular host from the job can sometimes resolve this issue. >> >> Two MCA parameters can be used to control Open MPI's behavior with >> respect to the retry count: >> >> * btl_openib_ib_retry_count - The number of times the sender will >> attempt to retry (defaulted to 7, the maximum value). >> >> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted >> to 10). The actual timeout value used is calculated as: >> >> 4.096 microseconds * (2^btl_openib_ib_timeout) >> >> See the InfiniBand spec 1.2 (section 12.7.34) for more details. >> -------------------------------------------------------------------------- >> >> -------------------------------------------------------------------------- >> >> >> DIFFERENT RUN: >> >> [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress] >> from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY >> EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0 >> ... >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ From bill at Princeton.EDU Tue Jul 13 14:50:35 2010 From: bill at Princeton.EDU (Bill Wichser) Date: Tue, 13 Jul 2010 17:50:35 -0400 Subject: [Beowulf] IB problem with openmpi 1.2.8 In-Reply-To: <4C3CD17E.8060708@ias.edu> References: <4C3B7AED.9080606@princeton.edu> <4C3CC7F0.1090308@princeton.edu> <4C3CD17E.8060708@ias.edu> Message-ID: <4C3CDFAB.2000309@princeton.edu> On 7/13/2010 4:50 PM, Prentice Bisbal wrote: > Bill, > > Have you checked the health of the cables themselves? It could just be > dumb luck that a hardware failure coincided with a software change, > didn't manifest itself until the reboot of the nodes. Did you reboot the > switches, too? > Just looked at all the lights and they all seem fine. > I would try dividing your cluster into small sections and see if the > problem exists across the sections. > > Can you disconnect the edge switches from the core switch, so that each > edge switch is it's own, isolated fabric? If so, you could then start an > sm on each fabric and see if the problem is on every smaller IB fabric, > or just one. > I've thought about this one. Non-trivial. I have a core switch connecting 12 leaf switches. Each switch connects to 16 nodes. I need to use that core switch in order to make the problem appear. > The other option would be to disconnect all the nodes and add them back > one by one, but that wouldn't catch a problem with a switch-to-switch > connection. > > How big is the cluster? Would it take hours or days to test each node > like this? > > 192 nodes (8 cores each). > You say the problem occurs when the node count goes over 32 (or 40) do > you mean 32 physical nodes, or 32 processors. How does your scheduler > assign nodes? Would those 32 nodes always be in the same rack or on the > same IB switch, but not when the count increases? > It starts failing at 48 nodes. PBS allocates as least loaded, round robin fashion. But sequentially, minus the PVFS nodes, which are distributed throughout the cluster and allocated last in round robin. The 32 nodes definately go through the core. And it never seems to matter where. I've tried to pinpoint some nodes by keeping lists but this happens everywhere. I was hoping that some tool I'm not aware of exists but apparently not. My next attempt may be to pull the management card from the core and just run opensm on nodes themselves, like we do for other clusters. But I can test with osmtest all day and never get errors. This makes me feel very uncomfortable! Of course, nothing is under warranty anymore. Divide and conquer seems like the only solution. Thanks, Bill > Prentice > > > > Bill Wichser wrote: > >> Just some more info. Went back to the prior kernel with no luck. >> Updated the firmware on the Topspin HBA cards to the latest (final) >> version (fw-25208-4_8_200-MHEL-CF128-T). Nothing changes. Still not >> sure where to look. >> >> Bill Wichser wrote: >> >>> Machine is an older Intel Woodcrest cluster with a two tiered IB >>> infrastructure with Topspin/Cisco 7000 switches. The core switch is a >>> SFS-7008P with a single management module which runs the SM manager. >>> The cluster runs RHEL4 and was upgraded last week to kernel >>> 2.6.9-89.0.26.ELsmp. The openib-1.4 remained the same. Pretty much >>> stock. >>> >>> After rebooting, the IB cards in the nodes remained in the INIT >>> state. I rebooted the chassis IB switch as it appeared that no SM was >>> running. No help. I manually started an opensm on a compute node >>> telling it to ignore other masters as initially it would only come up >>> in STANDBY. This turned all the nodes' IB ports to active and I >>> thought that I was done. >>> >>> ibdiagnet complained that there were two masters. So I killed the >>> opensm and now it was happy. osmtest -f c/osmtest -f a comes back >>> with OSMTEST: TEST "All Validations" PASS. >>> ibdiagnet -ls 2.5 -lw 4x finds all my switches and nodes with >>> everything coming up roses. >>> >>> The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the >>> node count goes over 32 (or maybe 40). This worked fine in the past, >>> before the reboot. User apps are failing as well as IMB v3.2. I've >>> increased the timeout using the "mpiexec -mca btl_openib_ib_timeout >>> 20" which helped for 48 nodes but when increasing to 64 and 128 it >>> didn't help at all. Typical error message follow. >>> >>> Right now I am stuck. I'm not sure what or where the problem might >>> be. Nor where to go next. If anyone has a clue, I'd appreciate >>> hearing it! >>> >>> Thanks, >>> Bill >>> >>> >>> typical error messages >>> >>> [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress] >>> from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY >>> EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0 >>> [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress] >>> from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY >>> EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0 >>> [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress] >>> from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY >>> EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0 >>> -------------------------------------------------------------------------- >>> >>> The InfiniBand retry count between two MPI processes has been >>> exceeded. "Retry count" is defined in the InfiniBand spec 1.2 >>> (section 12.7.38): >>> >>> The total number of times that the sender wishes the receiver to >>> retry timeout, packet sequence, etc. errors before posting a >>> completion error. >>> >>> This error typically means that there is something awry within the >>> InfiniBand fabric itself. You should note the hosts on which this >>> error has occurred; it has been observed that rebooting or removing a >>> particular host from the job can sometimes resolve this issue. >>> >>> Two MCA parameters can be used to control Open MPI's behavior with >>> respect to the retry count: >>> >>> * btl_openib_ib_retry_count - The number of times the sender will >>> attempt to retry (defaulted to 7, the maximum value). >>> >>> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted >>> to 10). The actual timeout value used is calculated as: >>> >>> 4.096 microseconds * (2^btl_openib_ib_timeout) >>> >>> See the InfiniBand spec 1.2 (section 12.7.34) for more details. >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> >>> >>> DIFFERENT RUN: >>> >>> [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress] >>> from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY >>> EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0 >>> ... >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> > From douglas.guptill at dal.ca Tue Jul 13 15:05:38 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Tue, 13 Jul 2010 19:05:38 -0300 Subject: [Beowulf] first cluster [was [OMPI users] trouble using openmpi under slurm] In-Reply-To: <4C37AB5D.9050308@ldeo.columbia.edu> References: <2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org> <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <4C37AB5D.9050308@ldeo.columbia.edu> Message-ID: <20100713220538.GA15163@sopalepc> Hello Gus, list: On Fri, Jul 09, 2010 at 07:06:05PM -0400, Gus Correa wrote: > Douglas Guptill wrote: >> On Thu, Jul 08, 2010 at 09:43:48AM -0400, Gus Correa wrote: >>> Douglas Guptill wrote: >>>> On Wed, Jul 07, 2010 at 12:37:54PM -0600, Ralph Castain wrote: >>>> >>>>> No....afraid not. Things work pretty well, but there are places >>>>> where things just don't mesh. Sub-node allocation in particular is >>>>> an issue as it implies binding, and slurm and ompi have conflicting >>>>> methods. >>>>> >>>>> It all can get worked out, but we have limited time and nobody cares >>>>> enough to put in the effort. Slurm just isn't used enough to make it >>>>> worthwhile (too small an audience). >>>> I am about to get my first HPC cluster (128 nodes), and was >>>> considering slurm. We do use MPI. >>>> >>>> Should I be looking at Torque instead for a queue manager? >>>> >>> Hi Douglas >>> >>> Yes, works like a charm along with OpenMPI. >>> I also have MVAPICH2 and MPICH2, no integration w/ Torque, >>> but no conflicts either. >> >> Thanks, Gus. >> >> After some lurking and reading, I plan this: >> Debian (lenny) >> + fai - for compute-node operating system install >> + Torque - job scheduler/manager >> + MPI (Intel MPI) - for the application >> + MPI (OpenMP) - alternative MPI >> >> Does anyone see holes in this plan? >> >> Thanks, >> Douglas > > > Hi Douglas > > I never used Debian, fai, or Intel MPI. > > We have two clusters with cluster management software, i.e., > mostly the operating system install stuff. > > I made a toy Rocks cluster out of old computers. > Rocks is a minimum-hassle way to deploy and maintain a cluster. > Of course you can do the same from scratch, or do more, or do better, > which makes some people frown at Rocks. > However, Rocks works fine, particularly if your network(s) > is (are) Gigabit Ethernet, > and if you don't mix different processor architectures (i.e. only i386 > or only x86_64, although there is some support for mixed stuff). > It is developed/maintained by UCSD under an NSF grant (I think). > It's been around for quite a while too. > > You may want to take a look, perhaps experiment with a subset of your > nodes before you commit: > > http://www.rocksclusters.org/wordpress/ I am sure Rocks suits many, but not me, at first glance. I am too much of a tinkerer. That comes, partially, from starting this business too earlier; my first computer was a Univac II - vacuum tubes, no operating system. > What is the interconnect/network hardware you have for MPI? > Gigabit Ethernet? Infiniband? Myrinet? Other? Infiniband - QLogic 12300-BS18 > If Infiniband you may need to add the OFED packages, Gotcha. Thanks. > If you are going to handle a variety of different compilers, MPI > flavors, with various versions, etc, I recommend using the > "Environment module" package. My one user has requested that. > I hope this helps. A Big help. Much appreciated. Douglas. -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From samuel at unimelb.edu.au Tue Jul 13 23:27:03 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 14 Jul 2010 16:27:03 +1000 Subject: [Beowulf] first cluster In-Reply-To: <158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de> References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> <158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de> Message-ID: <4C3D58B7.7070506@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/07/10 22:07, Reuti wrote: > Disadvantage is of course, when the system runs out of > memory the oom-killer will look for an eligible process > to be killed to free up some space. That assumes that you are permitting your compute nodes to overcommit their memory, if you disable overcommit I believe that you will instead just get malloc()'s failing when there is nothing for them to grab. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw9WLYACgkQO2KABBYQAh9jPwCfSFVGCEuLc9kDuNnkpeTmcL7e MfQAniejMy14z5xZO2wyiE6QAkQzfH2W =YKCa -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Tue Jul 13 23:31:07 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 14 Jul 2010 16:31:07 +1000 Subject: [Beowulf] Re: Beowulf Digest, Vol 77, Issue 14 In-Reply-To: <20100713155757.GA18339@sopalepc> References: <201007091648.o69Glf7R016032@bluewest.scyld.com> <20100713155757.GA18339@sopalepc> Message-ID: <4C3D59AB.8020004@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 14/07/10 01:57, Douglas Guptill wrote: > On Tue, Jul 13, 2010 at 05:11:46PM +0200, Ivan Rossi wrote: [...] >>> >> + MPI (Intel MPI) - for the application >> > >> > only if you use intel compilers, otherwise go openMPI > > Yes, we will be using Intel compilers. Even so I'd suggest benchmarking between OMPI and Intel MPI to see which does better by your application. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw9WasACgkQO2KABBYQAh9dUACeMesGS4FUk57nU11A8roOSye2 IpYAn2Du/FP8VgoqG6O96cywwTuH8Qk5 =YQKp -----END PGP SIGNATURE----- From beat at 0x1b.ch Tue Jul 13 22:04:48 2010 From: beat at 0x1b.ch (Beat Rubischon) Date: Wed, 14 Jul 2010 07:04:48 +0200 Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to specific addresses instead of a broadcast domain In-Reply-To: Message-ID: Hello! Quoting (13.07.10 22:32): > It is curious that most of my gratuitous ARP is coming > from my IPMI interface and not my main eth stack. Not sure why. Maybe > the Dell IPMI just is more aggressive about it. This behavour could be controlled by some flags in the BMC: # ipmitool lan set 1 arp respond on # ipmitool lan set 1 arp generate on I had BMCs where gratuitous ARP was needed as the standard ARP responses were not working. To minimize the impact of the broadcasts I expanded the delay between those packages up to 127 seconds. Beat -- \|/ Beat Rubischon ( 0^0 ) http://www.0x1b.ch/~beat/ oOO--(_)--OOo--------------------------------------------------- Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/ From rpnabar at gmail.com Wed Jul 14 00:15:16 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 14 Jul 2010 02:15:16 -0500 Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to specific addresses instead of a broadcast domain In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 12:04 AM, Beat Rubischon wrote: > I had BMCs where gratuitous ARP was needed as the standard ARP responses > were not working. To minimize the impact of the broadcasts I expanded the > delay between those packages up to 127 seconds. Are you with Dell servers too? I can't find any settings that control this delay. How did you do it? -- Rahul From beat at 0x1b.ch Wed Jul 14 00:47:17 2010 From: beat at 0x1b.ch (Beat Rubischon) Date: Wed, 14 Jul 2010 09:47:17 +0200 Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to specific addresses instead of a broadcast domain In-Reply-To: Message-ID: Hi Nahul! Quoting (14.07.10 09:15): > Are you with Dell servers too? I can't find any settings that > control this delay. How did you do it? Nope. I'm working for a company selling clusters, servers and workstations. Wo do you own :-) Learn to use "ipmitool". It's generic and works with all BMCs today. Use it remotly with ipmitool -H -U -P or locally after loading the appropriate modules modprobe ipmi_si modprobe ipmi_devintf ipmitool Beat -- \|/ Beat Rubischon ( 0^0 ) http://www.0x1b.ch/~beat/ oOO--(_)--OOo--------------------------------------------------- Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/ From rpnabar at gmail.com Wed Jul 14 10:05:04 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 14 Jul 2010 12:05:04 -0500 Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to specific addresses instead of a broadcast domain In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 2:47 AM, Beat Rubischon wrote: > Learn to use "ipmitool". It's generic and works with all BMCs today. Use it > remotly with THanks! I am already using ipmitool. Just wasn't aware this can set the ARP interval (should have read that manpage closer). :) Found the way to do it: ipmitool -H 172.16.0.1 -U root -f ~/ipmi_pw -I lanplus lan set 1 arp interval -- Rahul From mdidomenico4 at gmail.com Thu Jul 15 17:31:13 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Thu, 15 Jul 2010 20:31:13 -0400 Subject: [Beowulf] intel mkl lapack Message-ID: Does anyone have a specific C-code example of using the zgelss function with complex numbers, which is part of the lapack libraries. (i'm using the intel mkl) I'm unable to locate a specific example for this particular function call. I followed the (terse) API docs that come with MKL, but I'm unable to figure what I'm doing wrong in the code. From douglas.guptill at DAL.CA Thu Jul 15 17:46:28 2010 From: douglas.guptill at DAL.CA (Douglas Guptill) Date: Thu, 15 Jul 2010 21:46:28 -0300 Subject: [Beowulf] first cluster In-Reply-To: <4C3B66D0.2080204@ldeo.columbia.edu> References: <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> Message-ID: <20100716004628.GB7810@sopalepc> Hello Gus: On Mon, Jul 12, 2010 at 03:02:40PM -0400, Gus Correa wrote: > Hi Doug > > Consider disk for: > > A) swap space (say, if the user programs are large, > or you can't buy a lot of RAM, etc); > I wonder if swapping over NFS would be efficient for HPC. > Disk may be a simple and cost effective solution. We have bought enough RAM (6 GB /core) that will I hope prevent swapping. > B) input/output data files that your application programs may require > (if they already work in stagein-stageout mode, Now there you have me. What is stagein-stageout? > or if they do I/O so often that a NFS mounted file system > may get overwhelmed, hence reading/writing on local disk may be preferred). I am hoping to do that - write to local disk. Each node has a 1 TB disk, which I would like to split between the OS and user space. How to do that is still an unsolved problem at this point. The head node will have (6) 2 TB disks. > C) Would diskless scaling be a real big advantage for > a small/medium size cluster, say up to ~200 nodes? Good question. The node count is 16 (not 124, as I said previously - brain fart - 124 is the core count), and seems to me just over the border of what can be easily maintained as separate, diskful installs. Our one user has expressed a preference for "refreshing" the nodes before a job runs. By that, he means re-install the operating system. > E) booting when the NFS root server is not reachable > > Disks don't prevent one to keep a single image and distribute > it consistently across nodes, do they? I like that idea. > I guess there are old threads about this in the list archives. I looked in the beowulf archives, and only found very old (+years) articles. Is there another archive I should be looking at? > Just some thoughts. Much appreciated, Douglas. -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From hahn at mcmaster.ca Thu Jul 15 18:29:59 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 15 Jul 2010 21:29:59 -0400 (EDT) Subject: [Beowulf] first cluster In-Reply-To: <4C3D58B7.7070506@unimelb.edu.au> References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> <158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de> <4C3D58B7.7070506@unimelb.edu.au> Message-ID: >> Disadvantage is of course, when the system runs out of >> memory the oom-killer will look for an eligible process >> to be killed to free up some space. > > That assumes that you are permitting your compute nodes > to overcommit their memory, if you disable overcommit I > believe that you will instead just get malloc()'s failing > when there is nothing for them to grab. yes. actually, configuring memory and swap is an interesting topic. the feature Chris is referring to is, I think, the vm.overcommit_memory sysctl (and the associated vm.overcommit_ratio.) every distro I've seen leaves these at the default seting: vm.overcommit_memory=0. this is basically the traditional setting that tells the kernel to feel free to allocate way too much memory, and to resolve memory crunches via OOM killing. obviously, this isn't great, since it never tells apps to conserve memory (malloc returning zero), and often kills processes that you're rather not be killed (sshd, other system daemons). on clusters where a node may be shared across users/jobs, OOM can result serious collateral damage... we've used vm.overcommit_memory=2 fairly often. in this mode, the kernel limits its VM allocations to a combination of the size of ram and swap. this is reflected in /proc/meminfo:CommitLimit which will be computed as /proc/meminfo:SwapTotal + vm.overcommit_ratio * /proc/meminfo:MemTotal. /proc/meminfo:Committed_AS is the kernel's idea of total VM usage. IMO, it's essential to also run with RLIMIT_AS on all processes. this is basically a VM limit per process (not totalled across processes, though of course threads by definition share a single VM.) you might be thinking that RLIMIT_RSS would be better - indeed it would, but the kernel doesn't implement it. basically, limiting RSS is a bit tricky because you have to deal with how to count shared pages, and the limiting logic is going to slow down some important hot paths. (unlike AS (vsz), which only needs logic during explicit brk/mmap/munmap ops.) of course, to be useful, this requires users to provide reasonable memory limits at job-submission time. (our user population is pretty diverse, and isn't very good at doing wallclock limits, let alone "wizardly" issues like VM footprint.) batch systems often also provide their own resource management systems. I'm not fond of putting much effort in this direction, since it's usually based on a load-balancing model (which doesn't work if job memory use fluctuates), and upon on-node daemons which are assumed to be able to stay alive long enough to kill over-large job processes. yes, one can harden such system daemons by locking them into ram, but that's not an unalloyed win: they'll probably be nontrivial in size, and such memory usage is unswapable, even if some of the pages are never used... anyway, back to the topic: it's eminently possible to run nodes without swap, and reasonably safe to do so if your user community is not totally random, and if you make smart use of vm.overcommit_memory=2 and RLIMIT_AS. 5 years ago, running swapless was somewhat risky because the kernel was dramatically better tested/tuned in a normal swap-able configuration. my guess is that the huge embedded ecosystem has made swapless more robust, especially if you take the time to configure some basic sanity limits on user processes. regards, mark hahn. From gus at ldeo.columbia.edu Thu Jul 15 20:01:03 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 15 Jul 2010 23:01:03 -0400 Subject: [Beowulf] first cluster In-Reply-To: <20100716004628.GB7810@sopalepc> References: <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> <20100716004628.GB7810@sopalepc> Message-ID: <4C3FCB6F.4040504@ldeo.columbia.edu> Hi Douglas Douglas Guptill wrote: > Hello Gus: > > On Mon, Jul 12, 2010 at 03:02:40PM -0400, Gus Correa wrote: >> Hi Doug >> >> Consider disk for: >> >> A) swap space (say, if the user programs are large, >> or you can't buy a lot of RAM, etc); >> I wonder if swapping over NFS would be efficient for HPC. >> Disk may be a simple and cost effective solution. > > We have bought enough RAM (6 GB /core) that will I hope prevent swapping. > Sure, of course swapping is a disaster for HPC, and for MPI. Your memory configuration sounds great, specially compared to my meager 2GB/core, the most we could afford. :) >> B) input/output data files that your application programs may require >> (if they already work in stagein-stageout mode, > > Now there you have me. What is stagein-stageout? > Old fashioned term for copying input files to the compute nodes before the program starts, then the output files back to the head node after the program ends. You can still find this service in Torque/PBS, maybe other resource managers, but it can also be done through scripts. >> or if they do I/O so often that a NFS mounted file system >> may get overwhelmed, hence reading/writing on local disk may be preferred). > > I am hoping to do that - write to local disk. Actually, we seldom do this here. Most programs we run are ocean/atmosphere/climate, with other Earth Science applications also. Since you are in oceanography (am I right?) I would guess you would be running ocean models, and they tend to do a moderate amount of I/O, or better, to have a moderate I/O-to-computation ratio. Hence, they normally don't require local disk for I/O, which can be done in a central NFS mounted directory. We don't have a big cluster, so we use Infinband (in the past it was Myrinet) for MPI and Gigabit Ethernet for control and I/O. We have a separate file server with a RAID array, where the home directories and scratch file systems live, and are NFS mounted on the nodes. I think this setup is more or less standard for small clusters. I mentioned local disk for scratch space because this was common when Ethernet 100Mb/s was the interconnect, and would barely handle MPI, so it was preferred to do I/O locally, and 'stagein/stageout' the files. On the other hand, as per several postings in this and other mailing lists, some computational chemistry and genome sequencing programs apparently do I/O so often that they cannot live without local disk, or a more expensive parallel file system. > Each node has a 1 TB > disk, which I would like to split between the OS and user space. We have much less, 80GB or 250GB disk on compute nodes, which is more than enough for the OS and the scratch space (seldom used). Somebody mentioned that you also need the local disk for /tmp, besides possible (not desirable) swap. And of course you can have local /scratch, if you want. > How > to do that is still an unsolved problem at this point. > The head node > will have (6) 2 TB disks. > Have you considered a separate storage node, NAS, whatever, with RAID, to put home directories, scratch space, and mount them on the nodes via NFS. The head node can also play this role, hosting the storage. Given your total investment, this may not be so expensive. Since you have only a few users, you could even use the head node for this, to avoid extra cost. Buy a decent Gigabit Ethernet switch (or switches), and connect this storage to it via 10Gbit Ethernet card. Most good switches have modules for that. >> C) Would diskless scaling be a real big advantage for >> a small/medium size cluster, say up to ~200 nodes? > > Good question. The node count is 16 (not 124, as I said previously - > brain fart - 124 is the core count), OK, with 16 nodes you could certainly centralize home and scratch directories in a single server (say the head node) with RAID (say, RAID6), for better performance, and mount them on the nodes via NFS, even on a Gigabit Etherenet network. (I would suggest having one network for control & I/O, another for MPI). I would rather put smaller disks on the nodes, save the money to buy a decent RAID controller, a head node chassis with hot-swappable disk bays, enterprise class SATA disks of 2TB, and you would have a central storage in the head node with, say 16-24TB (nominal), with RAID6, xfs file system, for /home, /scratch[1,2,3...], all NFS mounted on the nodes. Easier to administer than separate home directories for each user on the nodes, and probably not noticeably slower (from the user standpoint) than the local disks. I suppose this is a very common setup. You could still create local /scratch on the compute nodes, for those users that like to read/write on local disk, and perhaps have a cleanup cron script to wipe off the excess of old local /scratch files. > and seems to me just over the > border of what can be easily maintained as separate, diskful installs. > Our one user has expressed a preference for "refreshing" the nodes > before a job runs. By that, he means re-install the operating system. > Why? I reinstall when I detect a problem. Rocks (which you already declined to use :) ) reinstalls on any hard reboot or power failure, assuming that those can lead to inconsistencies across the compute nodes. This is default, but you can change that. I think that even this is too much. However, reinstalling before every new job starts sounds like washing your hands before you strike any new key on the keyboard. You can't write an email this way, and you cant extract useful work from the cluster if you have to reinstall the nodes so often. Even rebooting the node before a job starts is already too much. You can do it periodically, to refresh the system, but before every job, I never heard of anybody that does this. >> E) booting when the NFS root server is not reachable >> >> Disks don't prevent one to keep a single image and distribute >> it consistently across nodes, do they? > > I like that idea. That has been working fine here and in many many places. > >> I guess there are old threads about this in the list archives. > > I looked in the beowulf archives, and only found very old (+years) > articles. Is there another archive I should be looking at? > In general, since many discussions in this list go astray, the subject/title may have very little relation to the actual arguments in the thread. I am not criticizing this, I like it. Some of the best discussions here started with a simple question that was hijacked for a worthy cause, and turned into a completely new dimension. It is going to be hard to find anything searching the subject line. You can try to search the message bodies with keywords like "diskless", "ram disk", etc. Google advanced search may help in this regard. Unlikely that you will find much about diskless clusters in the Rocks archive, as they are diskfull clusters. However, there may have been some discussions there too. >> Just some thoughts. > > Much appreciated, > Douglas. Best of luck with your new cluster! Gus From samuel at unimelb.edu.au Thu Jul 15 21:09:36 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 16 Jul 2010 14:09:36 +1000 Subject: [Beowulf] first cluster In-Reply-To: References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> <158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de> <4C3D58B7.7070506@unimelb.edu.au> Message-ID: <4C3FDB80.5010706@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 16/07/10 11:29, Mark Hahn wrote: > every distro I've seen leaves these at the default seting: > vm.overcommit_memory=0. this is basically the traditional > setting that tells the kernel to feel free to allocate way > too much memory, and to resolve memory crunches via OOM Looking at the kernel code if you set it vm.overcommit_memory to 0 (OVERCOMMIT_GUESS) then the kernel allows *each process* to allocate up to 97% of the total of RAM+swap (the last 3% is reserved for root, or processes with CAP_SYS_ADMIN). The catch is that (as highlighted) the limit is a per process one, not a system wide one. With it set to 1 (OVERCOMMIT_ALWAYS) there are no checks at all, it just returns 0 (OK) so any process can allocate as much as it wants, just that you don't know who or what will get OOM'd when you want to use it.. ;-) With 2 (OVERCOMMIT_NEVER) you can never specify more than your entire RAM+swap and the limit is applied across the system. We enforce RLIMIT_AS for MPI and single CPU processes by setting pvmem limits in Torque in the default queue. That doesn't work for SMP jobs so we have an 'smp' queue for then which sets mem= instead, this means that pbs_mom monitors the children and kills them if they go over their limits. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw/238ACgkQO2KABBYQAh88PQCfdmVZjYE2GznidzDNPOJ2zO6U DbIAnjKaviRyxIIsNVmsS3zfgbM0M7uZ =eLad -----END PGP SIGNATURE----- From rpnabar at gmail.com Thu Jul 15 21:10:31 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 15 Jul 2010 23:10:31 -0500 Subject: [Beowulf] first cluster In-Reply-To: <20100716004628.GB7810@sopalepc> References: <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> <20100716004628.GB7810@sopalepc> Message-ID: On Thu, Jul 15, 2010 at 7:46 PM, Douglas Guptill wrote: > Good question. ?The node count is 16 (not 124, as I said previously - > brain fart - 124 is the core count), and seems to me just over the > border of what can be easily maintained as separate, diskful installs. Not really the limit. We have ~300 nodes (300x8 cores) and they are all maintained "diskful" (if I understand the usage correctly). i.e. They each have a local disk for the OS and scratch space. Of course, the OS image are all essentially identical and installation is automated via PXE. -- Rahul From rpnabar at gmail.com Thu Jul 15 21:30:17 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 15 Jul 2010 23:30:17 -0500 Subject: [Beowulf] first cluster In-Reply-To: References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> <158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de> <4C3D58B7.7070506@unimelb.edu.au> Message-ID: On Thu, Jul 15, 2010 at 8:29 PM, Mark Hahn wrote: > yes. ?actually, configuring memory and swap is an interesting topic. > the feature Chris is referring to is, I think, the vm.overcommit_memory > sysctl (and the associated vm.overcommit_ratio.) ?every distro I've seen > leaves these at the default seting: vm.overcommit_memory=0. ?this is Is it possible to know how much over-committed my OS was, say in the last one day. Or at least instantaneously. I want to see how good or bad my user apps have been at requesting memory. Thus, if I were to take the strict approach of memory assignment I can know in advance if or not a lot of malloc calls are going to get a zero returned. > basically the traditional setting that tells the kernel to feel free > to allocate way too much memory, and to resolve memory crunches via OOM > killing. ?obviously, this isn't great, since it never tells apps to conserve > memory (malloc returning zero), and often kills processes that > you're rather not be killed (sshd, other system daemons). ?on clusters Ah! this might explain why once in a while I have a node with sshd dead. Is it possible to tell the kernel that certain processes are "privileged" and when it seeks to find random processes to kill it should not select these "privileged" processes? Some candidates that come to my mind are sshd, nagios and pbs_mom -- Rahul From samuel at unimelb.edu.au Thu Jul 15 22:31:25 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 16 Jul 2010 15:31:25 +1000 Subject: [Beowulf] first cluster In-Reply-To: References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org> <45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org> <20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc> <20100709205726.GA7313@sopalepc> <20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu> <158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de> <4C3D58B7.7070506@unimelb.edu.au> Message-ID: <4C3FEEAD.2030909@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 16/07/10 14:30, Rahul Nabar wrote: > Is it possible to know how much over-committed my OS was, > say in the last one day. Or at least instantaneously. I would suggest that you may not want to run your systems overcommitted, I feel that it's much nicer for an application to have a malloc() fail than for the OOM killer to get invoked. On the topic of memory usage, the Linux kernel has been (until fairly recently) rather bad at reporting that reliably (or at least usefully). There were some recent patches that improved its memory accounting and there's a tool called "smem" which gives an interesting way of looking at things (packaged in Debian and Ubuntu): http://www.selenic.com/smem/ Not sure if it'll work on RHEL 5 though, the kernel is likely too ancient for it. > Ah! this might explain why once in a while I have a node with sshd > dead. Is it possible to tell the kernel that certain processes are > "privileged" and when it seeks to find random processes to kill it > should not select these "privileged" processes? Some candidates that > come to my mind are sshd, nagios and pbs_mom You're in luck, there was an LWN article last year which touched on this: http://lwn.net/Articles/317814/ # Users and system administrators have often asked for ways to # control the behavior of the OOM killer. To facilitate control, # the /proc//oom_adj knob was introduced to save important # processes in the system from being killed, and define an order # of processes to be killed. The possible values of oom_adj # range from -17 to +15. The higher the score, more likely the # associated process is to be killed by OOM-killer. If oom_adj # is set to -17, the process is not considered for OOM-killing. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw/7q0ACgkQO2KABBYQAh+5dwCdH7FvlO6Fv1XP0f58r1q+0cVC YV4AniFwSLScUnqkgmE/crX+htauzx2P =DnRX -----END PGP SIGNATURE----- From john.hearns at mclaren.com Fri Jul 16 02:02:53 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 16 Jul 2010 10:02:53 +0100 Subject: [Beowulf] first cluster In-Reply-To: References: <2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org><45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org><20100707191645.GA25781@sopalepc><4C35D614.1090607@ldeo.columbia.edu><20100709164313.GA25062@sopalepc><20100709205726.GA7313@sopalepc><20100712170234.GB6134@sopalepc><4C3B66D0.2080204@ldeo.columbia.edu><158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de><4C3D58B7.7070506@unimelb.edu.au> Message-ID: <68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com> > > Is it possible to know how much over-committed my OS was, say in the > last one day. Or at least instantaneously. I want to see how good or > bad my user apps have been at requesting memory. Thus, if I were to > take the strict approach of memory assignment I can know in advance if > or not a lot of malloc calls are going to get a zero returned. > I'm a bit busy this morning - Tube line was down, then my Bnew Brompton had a mechanical. Performance Copilot will give you very detailed plots of various types of memory use http://oss.sgi.com/projects/pcp/ Also worth looking at your Ganglia plots - probably easier to install. As an aside, my two pence worth on this thread. To the original poster - you have done your research on what is needed for a first cluster. Take may advice, and that of a lot of people on this list, and contact a cluster vendor in your area. You will be surprised at how competitive the price is versus sourcing the parts yourself. And remember - people who build clusters are specialists in that task, you are a specialist in oceanography. Get on with doing your science, and let the cluster people get on with building you a brilliant cluster and looking after it. John Hearns McLaren Racing The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From douglas.guptill at dal.ca Fri Jul 16 07:59:09 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Fri, 16 Jul 2010 11:59:09 -0300 Subject: [Beowulf] first cluster In-Reply-To: <68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com> Message-ID: <20100716145909.GB9850@sopalepc> On Fri, Jul 16, 2010 at 10:02:53AM +0100, Hearns, John wrote: > As an aside, my two pence worth on this thread. I agree, the topic seems to have shifted... > To the original poster - you have done your research on what is needed > for a first cluster. > Take may advice, and that of a lot of people on this list, and contact a > cluster vendor in your area. > You will be surprised at how competitive the price is versus sourcing > the parts yourself. > And remember - people who build clusters are specialists in that task, > you are a specialist in oceanography. We have ordered the cluster from a local builder, and expect delivery in about 4 weeks. I am a servant of the Oceanographers, my specialty is software. Thanks for the advice. I conclude there are no magic bullet solutions. One must do research, make an educated guess, and then hold one's nose and jump in at the deep end. There is one question that perplexes me, to which I have not found an answer. How does the presence of a job scheduler interact with the ability of a user to ssh to , ssh to , and then type mpirun -np 64 my_application Intuition tells me there has to be something in a cluster setup, when it has a scheduler, that prevents a user from circumventing the scheduler by doing something like the above. Any hints? > John Hearns McLaren Racing BTW, congratulations on a great season this year. Regards, Douglas. -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From dag at sonsorol.org Fri Jul 16 10:01:01 2010 From: dag at sonsorol.org (Chris Dagdigian) Date: Fri, 16 Jul 2010 13:01:01 -0400 Subject: [Beowulf] first cluster In-Reply-To: <20100716145909.GB9850@sopalepc> References: <68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com> <20100716145909.GB9850@sopalepc> Message-ID: <4C40904D.2010307@sonsorol.org> You want the honest answer? There are technical things you can do to to prevent users from bypassing the scheduler and resource allocation policies. One of the cooler things I've seen in Grid Engine environments was a cron job that did a "kill -9" against any user process that was not a child of a sge_shepherd daemon. Very effective. Other people play games with pam settings and the like. The honest truth is that technical countermeasures are mostly a waste of time. A motivated user always has more time and effort to spend trying to game the system than an overworked administrator. My recommendation is to subject users to a cluster acceptable use policy. Any abuses of the policy are treated as a teamwork and human resources issue. The first time you screw up you get a warning, the second time you get caught I'll send a note to your manager. After that any abuses are treated with a loss of cluster access and a referral to human resources for further action. Simply put -- you don't have enough time in the day to deal with users who want to game/abuse the system. It's far easier for all concerned to have everyone agree on a fair use policy and treat any infractions via management rather than cluster settings. This is another reason why having a cluster governance body helps a lot. A committee of cluster power users and IT staff is a great way to get consensus on queue setup, cluster policies, disk quotas and the like. They can also come down hard with peer pressure on pissy users. my $.02 -Chris Douglas Guptill wrote: > How does the presence of a job scheduler interact with the ability of a user to > ssh to, > ssh to, and then type > mpirun -np 64 my_application > > Intuition tells me there has to be something in a cluster setup, when > it has a scheduler, that prevents a user from circumventing the > scheduler by doing something like the above. From douglas.guptill at dal.ca Fri Jul 16 10:11:29 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Fri, 16 Jul 2010 14:11:29 -0300 Subject: [Beowulf] first cluster In-Reply-To: References: <20100716145909.GB9850@sopalepc> Message-ID: <20100716171129.GB10537@sopalepc> On Fri, Jul 16, 2010 at 12:51:49PM -0400, Steve Crusan wrote: > We use a PAM module (pam_torque) to stop this behavior. Basically, if you > your job isn't currently running on a node, you cannot SSH into a node. > > > http://www.rpmfind.net/linux/rpm2html/search.php?query=torque-pam > > That way one is required to use the queuing system for jobs, so the cluster > isn't like the wild wild west... Ah Ha!. The key. Thanks, Douglas. > On 7/16/10 10:59 AM, "Douglas Guptill" wrote: > > > On Fri, Jul 16, 2010 at 10:02:53AM +0100, Hearns, John wrote: > > > >> As an aside, my two pence worth on this thread. > > > > I agree, the topic seems to have shifted... > > > >> To the original poster - you have done your research on what is needed > >> for a first cluster. > >> Take may advice, and that of a lot of people on this list, and contact a > >> cluster vendor in your area. > >> You will be surprised at how competitive the price is versus sourcing > >> the parts yourself. > >> And remember - people who build clusters are specialists in that task, > >> you are a specialist in oceanography. > > > > We have ordered the cluster from a local builder, and expect delivery > > in about 4 weeks. I am a servant of the Oceanographers, my specialty > > is software. > > > > Thanks for the advice. I conclude there are no magic bullet > > solutions. One must do research, make an educated guess, and then > > hold one's nose and jump in at the deep end. > > > > There is one question that perplexes me, to which I have not found an > > answer. > > > > How does the presence of a job scheduler interact with the ability of a user > > to > > ssh to , > > ssh to , and then type > > mpirun -np 64 my_application > > > > Intuition tells me there has to be something in a cluster setup, when > > it has a scheduler, that prevents a user from circumventing the > > scheduler by doing something like the above. > > > > Any hints? > > > >> John Hearns McLaren Racing > > > > BTW, congratulations on a great season this year. > > > > Regards, > > Douglas. > > > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From prentice at ias.edu Fri Jul 16 10:13:55 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 16 Jul 2010 13:13:55 -0400 Subject: [Beowulf] first cluster In-Reply-To: <20100716145909.GB9850@sopalepc> References: <68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com> <20100716145909.GB9850@sopalepc> Message-ID: <4C409353.60302@ias.edu> > > There is one question that perplexes me, to which I have not found an > answer. > > How does the presence of a job scheduler interact with the ability of a user to > ssh to , > ssh to , and then type > mpirun -np 64 my_application > > Intuition tells me there has to be something in a cluster setup, when > it has a scheduler, that prevents a user from circumventing the > scheduler by doing something like the above. That is definitely a problem that must be dealt with. No point in having a schedluer if everyone bypasses it. There are a few ways you could do it. Here's how I do it: My cluster mounts the same NFS file systems (/home directories, /usr/local, etc.) as all the user workstations, and our more powerful multi-user 64-bit servers with lots of RAM. We call the latter 'compute servers' (just to avoid confusion on the list with compute nodes in the cluster). The compute servers are outside the cluster network, but can communicate with the head node. I use SGE, which separates the roles of compute host, submission host, and administration host (not sure if other resource managers behave the same way. A member of an SGE cluster can be any combination of these 3 things. Our computer servers are set up as submit hosts, so users can use them to compile their programs and submit jobs, and check on the status, without ever actually logging into any cluster node at all. The SSH configuration in the head node prevents anyone other than the administrative staff from logging in, and the rest of the cluster is on a private network, so the only way to run a job in this was is through submitting a batch job through SGE. The only drawback of this system, is that users cannot request interactive jobs on my cluster, but I don't see that as a very big problem, since most cluster jobs are batch jobs anyway. Not sure if you can do the same thing with other resource managers, SGE is the only one I've used. Not sure if other resource managers still rely on rsh/ssh to start jobs. If they do, that can add some complexity to allowing the cluster nodes to allow jobs to run, but disallow interactive logins. -- Prentice From john.hearns at mclaren.com Fri Jul 16 10:34:48 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 16 Jul 2010 18:34:48 +0100 Subject: [Beowulf] first cluster References: <68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com> <20100716145909.GB9850@sopalepc> <4C40904D.2010307@sonsorol.org> Message-ID: <68A57CCFD4005646957BD2D18E60667B09ECFECD@milexchmb1.mil.tagmclarengroup.com> -----Original Message----- From: beowulf-bounces at beowulf.org on behalf of Chris Dagdigian This is another reason why having a cluster governance body helps a lot. A committee of cluster power users and IT staff is a great way to get consensus on queue setup, cluster policies, disk quotas and the like. Talking about disk space, 'agedu' is a great tool. http://www.chiark.greenend.org.uk/~sgtatham/agedu/ I have spent a fun day running agedu and beating users over the head. Apologies if I have already flagged this one up. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From gus at ldeo.columbia.edu Fri Jul 16 10:50:04 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 16 Jul 2010 13:50:04 -0400 Subject: [Beowulf] first cluster In-Reply-To: <4C40904D.2010307@sonsorol.org> References: <68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com> <20100716145909.GB9850@sopalepc> <4C40904D.2010307@sonsorol.org> Message-ID: <4C409BCC.9040202@ldeo.columbia.edu> Chris Dagdigian wrote: > You want the honest answer? > > There are technical things you can do to to prevent users from bypassing > the scheduler and resource allocation policies. One of the cooler things > I've seen in Grid Engine environments was a cron job that did a "kill > -9" against any user process that was not a child of a sge_shepherd > daemon. Very effective. > > Other people play games with pam settings and the like. > > The honest truth is that technical countermeasures are mostly a waste of > time. A motivated user always has more time and effort to spend trying > to game the system than an overworked administrator. > > My recommendation is to subject users to a cluster acceptable use > policy. Any abuses of the policy are treated as a teamwork and human > resources issue. The first time you screw up you get a warning, the > second time you get caught I'll send a note to your manager. After that > any abuses are treated with a loss of cluster access and a referral to > human resources for further action. > > Simply put -- you don't have enough time in the day to deal with users > who want to game/abuse the system. It's far easier for all concerned to > have everyone agree on a fair use policy and treat any infractions via > management rather than cluster settings. > > This is another reason why having a cluster governance body helps a lot. > A committee of cluster power users and IT staff is a great way to get > consensus on queue setup, cluster policies, disk quotas and the like. > They can also come down hard with peer pressure on pissy users. > > my $.02 > > -Chris > > Hi Chris, Douglas, list Very wise words, and match my experience here, particularly to have a small cluster committee to share the responsibility of policies and their enforcement. As Chris said, this is a not a technical issue, this is about hacking. Resource managers rely on ssh, you can tweak with it, with IP tables, with pam, launch cron jobs to kill recalcitrant behavior, etc, to prevent some stuff, but there will always be a back door to be found by those so inclined. Also, too many restrictions on the technical side may become hurdles to legitimate use of the cluster. Since Douglas is in an university, I would suggest also that when you set up new accounts, have the user agree to the general IT policies of your university. Or, as I do, just send an email to the new user telling the account is up and adding something like this: "By accepting this account you are automatically agreeing with the general IT regulations of our university, which I encourage you to read at http://your.univ.it.regulations , and to abide by any other policies established by the cluster committee and system administrators." It is a bit like that lovely paradigm of Realpolitik: "Speak softly and carry a big stick ..." (Teddy Roosevelt) :) Gus Correa > > Douglas Guptill wrote: >> How does the presence of a job scheduler interact with the ability of >> a user to >> ssh to, >> ssh to, and then type >> mpirun -np 64 my_application >> >> Intuition tells me there has to be something in a cluster setup, when >> it has a scheduler, that prevents a user from circumventing the >> scheduler by doing something like the above. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From peter.st.john at gmail.com Fri Jul 16 11:26:41 2010 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 16 Jul 2010 14:26:41 -0400 Subject: [Beowulf] intel mkl lapack In-Reply-To: References: Message-ID: Michael, Hopefully someone has an example call in C, but the fortran source is pretty well commented and might be helpful: http://www.netlib.org/lapack/explore-html/a01461_source.html Peter On Thu, Jul 15, 2010 at 8:31 PM, Michael Di Domenico wrote: > Does anyone have a specific C-code example of using the zgelss > function with complex numbers, which is part of the lapack libraries. > (i'm using the intel mkl) > > I'm unable to locate a specific example for this particular function call. > > I followed the (terse) API docs that come with MKL, but I'm unable to > figure what I'm doing wrong in the code. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From samuel at unimelb.edu.au Sun Jul 18 20:27:22 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 19 Jul 2010 13:27:22 +1000 Subject: [Beowulf] first cluster In-Reply-To: <20100716171129.GB10537@sopalepc> References: <20100716145909.GB9850@sopalepc> <20100716171129.GB10537@sopalepc> Message-ID: <4C43C61A.5000800@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 17/07/10 03:11, Douglas Guptill wrote: > On Fri, Jul 16, 2010 at 12:51:49PM -0400, Steve Crusan wrote: > >> > We use a PAM module (pam_torque) to stop this behavior. Basically, if you >> > your job isn't currently running on a node, you cannot SSH into a node. >> > >> > http://www.rpmfind.net/linux/rpm2html/search.php?query=torque-pam > > Ah Ha!. The key. It's worth noting that comes with Torque, you just need to configure it with the --with-pam directive. The code is in src/pam and there's a README there too. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxDxhoACgkQO2KABBYQAh+3jQCeJ2VTXrFqiq3RJAShaSjG0n+d mdoAnRMlt56m/6hMdIxXabxqL+prjN5p =FNZ7 -----END PGP SIGNATURE----- From tjrc at sanger.ac.uk Mon Jul 19 01:54:28 2010 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon, 19 Jul 2010 09:54:28 +0100 Subject: [Beowulf] first cluster In-Reply-To: <20100716171129.GB10537@sopalepc> References: <20100716145909.GB9850@sopalepc> <20100716171129.GB10537@sopalepc> Message-ID: <553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk> On 16 Jul 2010, at 6:11 pm, Douglas Guptill wrote: > On Fri, Jul 16, 2010 at 12:51:49PM -0400, Steve Crusan wrote: >> We use a PAM module (pam_torque) to stop this behavior. Basically, if you >> your job isn't currently running on a node, you cannot SSH into a node. >> >> >> http://www.rpmfind.net/linux/rpm2html/search.php?query=torque-pam >> >> That way one is required to use the queuing system for jobs, so the cluster >> isn't like the wild wild west... > > Ah Ha!. The key. It's a very neat idea, but it has the disadvantage - unless I'm misunderstanding - that if the job fails, and leaves droppings in, say, /tmp on the cluster node, the user can't log in to diagnose things or clean up after themselves. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From reuti at staff.uni-marburg.de Mon Jul 19 03:46:03 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 19 Jul 2010 12:46:03 +0200 Subject: [Beowulf] first cluster In-Reply-To: <553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk> References: <20100716145909.GB9850@sopalepc> <20100716171129.GB10537@sopalepc> <553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk> Message-ID: Am 19.07.2010 um 10:54 schrieb Tim Cutts: > > On 16 Jul 2010, at 6:11 pm, Douglas Guptill wrote: > >> On Fri, Jul 16, 2010 at 12:51:49PM -0400, Steve Crusan wrote: >>> We use a PAM module (pam_torque) to stop this behavior. Basically, if you >>> your job isn't currently running on a node, you cannot SSH into a node. >>> >>> >>> http://www.rpmfind.net/linux/rpm2html/search.php?query=torque-pam >>> >>> That way one is required to use the queuing system for jobs, so the cluster >>> isn't like the wild wild west... >> >> Ah Ha!. The key. > > It's a very neat idea, but it has the disadvantage - unless I'm misunderstanding - that if the job fails, and leaves droppings in, say, /tmp on the cluster node, the user can't log in to diagnose things or clean up after themselves. Yep. With GridEngine the $TMPDIR will be removed automatically, at least when the user honors the variable. I disable ssh and rsh in my clusters except for admin staff. Normal users can use an interactive job in SGE, which is limited to a cpu time of 60 sec., if they really want to peak on the nodes. -- Reuti > > Tim > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From hahn at mcmaster.ca Mon Jul 19 06:47:53 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 19 Jul 2010 09:47:53 -0400 (EDT) Subject: [Beowulf] first cluster In-Reply-To: <553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk> References: <20100716145909.GB9850@sopalepc> <20100716171129.GB10537@sopalepc> <553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk> Message-ID: > It's a very neat idea, but it has the disadvantage - unless I'm >misunderstanding - that if the job fails, and leaves droppings in, say, /tmp >on the cluster node, the user can't log in to diagnose things or clean up >after themselves. my organization has ~4k users (~3-500 active at any time), and does not attempt to prevent access to compute nodes by users. it just doesn't seem like a real, worth-solving problem. heck, we have more trouble with users running jobs on _login_ nodes, rather than compute notes. (many of our systems came with a pam-slurm module which did this; we remove it.) I don't think this is at all surprising. if a user groks clusters at all, they'll know that cheating is not very effective (and not very scalable) and stands a good chance of bringing trouble. those who don't grok wind up running on the login nodes (where we have fairly tight RLIMIT_AS and CPU...) regards, mark hahn. From hahn at mcmaster.ca Tue Jul 20 09:07:32 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 20 Jul 2010 12:07:32 -0400 (EDT) Subject: [Beowulf] compilers vs mpi? Message-ID: Hi all, I'm interested in hearing about experiences with mixing compilers between the application and MPI. that is, I would like to be able to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly with apps compiled with other compilers. I guess I'm reasoning by analogy to normal distro libs. the OpenMPI FAQ has this comment: NOTE: The Open MPI team recommends using a single compiler suite whenever possible. Unexpeced or undefined behavior can occur when you mix compiler suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 compilers between different compiler suites is almost guaranteed not to work). and there are complaints elsewhere in the FAQ about f90 bindings. I'd appreciate it if someone could help a humble C/C++/perl hacker understand the issues here... thanks, mark hahn. PS: we have a large and diverse user base, so tend to have to support gcc, intel, pathscale and pgi. we even have people who want to use intel's damned synthetic 128b FP over MPI :( From siegert at sfu.ca Tue Jul 20 10:46:55 2010 From: siegert at sfu.ca (Martin Siegert) Date: Tue, 20 Jul 2010 10:46:55 -0700 Subject: [Beowulf] compilers vs mpi? In-Reply-To: References: Message-ID: <20100720174655.GG24917@stikine.its.sfu.ca> Hi Mark, we do exactly what you describe: compile OpenMPI with the gcc suite and then use it with gcc, intel and open64 compilers. This works out-of-the-box, almost. The problem is the f90 module mpi.mod. This is (usually) a binary file and specific to the f90 compiler that was used to compile OpenMPI. But there is a way to solve this problem: 1. compile openmpi using the gcc compilers, i.e., gfortran as the Fortran compiler and install it in /usr/local/openmpi 2. move the Fortran module to the directory /usr/local/openmpi/include/gfortran. In that directory create softlinks to the files in /usr/local/openmpi/include. 3. compile openmpi using ifort and install the Fortran module (and only the Fortran module!) in /usr/local/openmpi/include/intel. In that directory create softlinks to the files in /usr/local/openmpi/include. 4. in /usr/local/openmpi/bin create softlinks mpif90.ifort and mpif90.gfortran pointing to opal_wrapper. Remove the mpif90 softlink. 5. Move /usr/local/openmpi/share/openmpi/mpif90-wrapper-data.txt to /usr/local/openmpi/share/openmpi/mpif90.ifort-wrapper-data.txt. Change the line includedir=${includedir} to: includedir=${includedir}/intel Copy the file to /usr/local/openmpi/share/openmpi/mpif90.gfortran-wrapper-data.txt and change the line includedir=${includedir} to includedir=${includedir}/gfortran 6. Create a wrapper script /usr/local/openmpi/bin/mpif90: #!/bin/bash OMPI_WRAPPER_FC=`basename $OMPI_FC 2> /dev/null` if [ "$OMPI_WRAPPER_FC" = 'gfortran' ]; then exec $0.gfortran "$@" else exec $0.ifort "$@" fi The reason we use gfortran in step 1 is that otherwise you get those irritating error messages from the Intel libraries, cf. http://www.open-mpi.org/faq/?category=building#intel-compiler-wrapper-compiler-w arnings Cheers, Martin -- Martin Siegert Head, Research Computing WestGrid/ComputeCanada Site Lead IT Services phone: 778 782-4691 Simon Fraser University fax: 778 782-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 On Tue, Jul 20, 2010 at 12:07:32PM -0400, Mark Hahn wrote: > Hi all, > I'm interested in hearing about experiences with mixing compilers > between the application and MPI. that is, I would like to be able > to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly > with apps compiled with other compilers. I guess I'm reasoning by analogy > to normal distro libs. > > the OpenMPI FAQ has this comment: > > NOTE: The Open MPI team recommends using a single compiler suite whenever > possible. Unexpeced or undefined behavior can occur when you mix compiler > suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 compilers > between different compiler suites is almost guaranteed not to work). > > and there are complaints elsewhere in the FAQ about f90 bindings. I'd > appreciate it if someone could help a humble C/C++/perl hacker understand > the issues here... > > thanks, mark hahn. > PS: we have a large and diverse user base, so tend to have to support gcc, > intel, pathscale and pgi. we even have people who want to use intel's > damned synthetic 128b FP over MPI :( > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From gus at ldeo.columbia.edu Tue Jul 20 10:48:11 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 20 Jul 2010 13:48:11 -0400 Subject: [Beowulf] compilers vs mpi? In-Reply-To: References: Message-ID: <4C45E15B.8020702@ldeo.columbia.edu> HI Mark Mark Hahn wrote: > Hi all, > I'm interested in hearing about experiences with mixing compilers > between the application and MPI. that is, I would like to be able > to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly > with apps compiled with other compilers. I guess I'm reasoning by > analogy to normal distro libs. > I haven't built OpenMPI this way, but you may try to link statically with commercial compiler libraries (say -static-intel, -Bstatic_pgi), to avoid too much mess with the user environment, when they are use a different compiler than the one underlying the MPI wrappers. > the OpenMPI FAQ has this comment: > > NOTE: The Open MPI team recommends using a single compiler suite whenever > possible. Unexpeced or undefined behavior can occur when you mix compiler > suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 > compilers > between different compiler suites is almost guaranteed not to work). > Yes, they do recommend compiler homogeneity. However, I have built hybrids gcc+ifort and gcc+pgf90 and both work fine. (I have the homogeneous versions also.) I do not even use a different Fortran77 compiler, it is the same as Fortran90 (F77=FC). In any case, my experience is that many applications come with such messy configure/Makefile scheme (which often times refuses to use the MPI compiler wrappers properly), that no matter what you do to provide a variety of MPI builds, there are always problems to build some applications, and you need to lend a hand to the user. > and there are complaints elsewhere in the FAQ about f90 bindings. In my experience they are not perfect but work reasonably well. For instance, they do not check all interfaces (or whatever F90 calls the analog of C function prototypes), but if the program is correctly written, and doesn't go OO-verboard in relying that such checks will be done, things work. Fortran77 never had these features anyway, and I guess mpif77 doesn't check if you are passing an integer where it should be a real, or if your argument list is shorter than the function requires. > I'd > appreciate it if someone could help a humble C/C++/perl hacker > understand the issues here... > > thanks, mark hahn. > PS: we have a large and diverse user base, so tend to have to support > gcc, intel, pathscale and pgi. ... and don't forget Open64! :) we even have people who want to use > intel's damned synthetic 128b FP over MPI :( It's hard to keep the customer satisfied. You give them the sky, they want the universe. I hope this helps, Gus Correa > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Tue Jul 20 10:52:25 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 20 Jul 2010 10:52:25 -0700 Subject: [Beowulf] compilers vs mpi? In-Reply-To: References: Message-ID: <20100720175224.GC22136@bx9.net> On Tue, Jul 20, 2010 at 12:07:32PM -0400, Mark Hahn wrote: > I'm interested in hearing about experiences with mixing compilers > between the application and MPI. that is, I would like to be able > to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly > with apps compiled with other compilers. I guess I'm reasoning by > analogy to normal distro libs. Everyone's C compilers are pretty much compatible. C++ and Fortran, not at all. -- greg From prentice at ias.edu Tue Jul 20 11:05:51 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 20 Jul 2010 14:05:51 -0400 Subject: [Beowulf] compilers vs mpi? In-Reply-To: References: Message-ID: <4C45E57F.2070802@ias.edu> Mark Hahn wrote: > Hi all, > I'm interested in hearing about experiences with mixing compilers > between the application and MPI. that is, I would like to be able > to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly > with apps compiled with other compilers. I guess I'm reasoning by > analogy to normal distro libs. > > the OpenMPI FAQ has this comment: > > NOTE: The Open MPI team recommends using a single compiler suite whenever > possible. Unexpeced or undefined behavior can occur when you mix compiler > suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 > compilers > between different compiler suites is almost guaranteed not to work). > > and there are complaints elsewhere in the FAQ about f90 bindings. I'd > appreciate it if someone could help a humble C/C++/perl hacker > understand the issues here... > > thanks, mark hahn. > PS: we have a large and diverse user base, so tend to have to support > gcc, intel, pathscale and pgi. we even have people who want to use > intel's damned synthetic 128b FP over MPI :( Mark, I'm not a developer, but I do spend a lot of my time compiling codes (mostly C and Fortran) for users, and I've dealt with the problem plent of times. Here's the library compatibility problem as I understand it: The C programming language standard defines the syntax for symbol names in libraries, so if you have a function named printf, when compiled into a library, the symbol for it will always be printf. >From what I've heard, For C++ the standard isn't as strict as C, but usually doesn't cause any problems. I rarely compile C++ code, so I have no real experience with this. For Fortran, the standard doesn't define a standard naming convention for library symbols, so the compilers have more freedom with symbol naming. Usually, the symbol names will have 0, 1, or two underscores before/after the symbol name, and it can be in all caps or all lower case. This is why gfortran has these options: -fno-underscoring -fsecond-underscore -fcase-lower In theory, these options should work, but they only really work if your using one compiler to link against libraries compiled by only one other compilers (linking with gfortran to ifort-compiled libraries, for example). Once you have libraries from two different compilers, the libraries from one compiler might need --fno_underscoring, while the other needs -fcase-lower and fsecond-underscore. I highly recommend you read the man pages for g77 and/or gfortran where these switches are explained. There is a much better explanation of why they're needed there. >From my experience, it's easier to just compile the libraries again using a different compiler suite, and put them in a separate location to make it clear. For example, I have compiled Open MPI with GNU, Intel, and PGI compilers, $ pwd /usr/local/openmpi $ ls -ld * lrwxrwxrwx 1 root root 9 Jul 17 2009 gcc -> gcc-4.1.2 drwxr-xr-x 3 root root 4096 Feb 10 2009 gcc-4.1.2 lrwxrwxrwx 1 root root 8 Jul 17 2009 intel -> intel-11 drwxr-xr-x 3 root root 4096 Feb 5 2009 intel-11 lrwxrwxrwx 1 root root 7 Jul 17 2009 pgi -> pgi-8.0 drwxr-xr-x 3 root root 4096 Jan 28 2009 pgi-8.0 As long as users specify the correct paths to include and library files in their compiler commands, they can compile using whatever compiler they want. To save work, I only do this for libraries that I absolutely know that users will be using with Fortran. -- Prentice From siegert at sfu.ca Tue Jul 20 12:13:03 2010 From: siegert at sfu.ca (Martin Siegert) Date: Tue, 20 Jul 2010 12:13:03 -0700 Subject: [Beowulf] compilers vs mpi? In-Reply-To: References: <20100720174655.GG24917@stikine.its.sfu.ca> Message-ID: <20100720191303.GI24917@stikine.its.sfu.ca> On Tue, Jul 20, 2010 at 02:25:32PM -0400, Mark Hahn wrote: >> we do exactly what you describe: compile OpenMPI with the gcc suite >> and then use it with gcc, intel and open64 compilers. > > nice. > >> This works out-of-the-box, almost. >> The problem is the f90 module mpi.mod. This is (usually) a binary >> file and specific to the f90 compiler that was used to compile >> OpenMPI. But there is a way to solve this problem: > > ah, .mod files. we have for the most part ignored them entirely. > (what are we losing by doing that?) For starters, a program with a "use mpi" statement won't compile, if you don't have mpi.mod. >> 1. compile openmpi using the gcc compilers, i.e., gfortran as the Fortran >> compiler and install it in /usr/local/openmpi > > this is perhaps a tangent, but we install everything we support > under /opt/sharcnet/$packagebasename/$ver. for openmpi, we've had to bodge > the compiler flavor onto that (/opt/sharcnet/openmpi/1.4.2/intel). (I simplified the description: we install in /usr/local/openmpi-version and then create a softlink /usr/local/openmpi that points to the current default version) >> 2. move the Fortran module to the directory >> /usr/local/openmpi/include/gfortran. In that directory >> create softlinks to the files in /usr/local/openmpi/include. >> 3. compile openmpi using ifort and install the Fortran module (and only >> the Fortran module!) in /usr/local/openmpi/include/intel. In that >> directory create softlinks to the files in /usr/local/openmpi/include. > > I guess I'm surprised this works - aren't you effectively assuming that the > intel and gfortran interfaces are compatible here? that is, the app > compiles with the compiler-specific module, which basically promises a > particular type-safe interface (signature) for MPI functions, but then the > linker just glues them together without any way to verify the signature > compatibility... > > am I misunderstanding? The (default) name mangling scheme of gfortran, ifort, openf90 is the same: append a single underscore. As soon as a user uses -fno-underscoring or -fsecond-underscore nothing would work anymore. So, don't do that. We have had not a single user who tried to change the default name mangling scheme, thus this is not really a problem. We no longer support g77 (which does have a different default). As far as I understand the type checking with respect to mpi.mod is done at compile time. That's why you need to have the correct fortran module when compiling. At link time the linker only needs to find the library routines like mpi_send_ which is a C wrapper routine anyway that just calls MPI_Send. >> 4. in /usr/local/openmpi/bin create softlinks mpif90.ifort >> and mpif90.gfortran pointing to opal_wrapper. Remove the >> mpif90 softlink. >> 5. Move /usr/local/openmpi/share/openmpi/mpif90-wrapper-data.txt >> to /usr/local/openmpi/share/openmpi/mpif90.ifort-wrapper-data.txt. >> Change the line includedir=${includedir} to: >> includedir=${includedir}/intel >> Copy the file to >> /usr/local/openmpi/share/openmpi/mpif90.gfortran-wrapper-data.txt >> and change the line includedir=${includedir} to >> includedir=${includedir}/gfortran >> 6. Create a wrapper script /usr/local/openmpi/bin/mpif90: >> >> #!/bin/bash >> OMPI_WRAPPER_FC=`basename $OMPI_FC 2> /dev/null` >> if [ "$OMPI_WRAPPER_FC" = 'gfortran' ]; then >> exec $0.gfortran "$@" >> else >> exec $0.ifort "$@" >> fi > > this is a tangent, but perhaps interesting. we don't use the wrappers from > the MPI package, but rather our own single wrapper which has some > built-in intelligence (augmented by info from the compiler's (environment) > module.) Again, I simplified. We use, e.g., compiler env modules to set env. variables like OMPI_FC. >> The reason we use gfortran in step 1 is that otherwise you get those >> irritating error messages from the Intel libraries, cf. >> http://www.open-mpi.org/faq/?category=building#intel-compiler-wrapper-compiler-warnings > > hmm. we work around those by manipulating the link arguments in our wrapper. Does that work when using gfortran to link with libraries compiled with icc/ifort? Anyway, it appeared to be an unnecessary complication as the fortran compiler has no affect on the performance of the MPI distribution; it is only really needed for compiling the f90 module (and for configure to determine the name mangling scheme). - Martin From hahn at mcmaster.ca Tue Jul 20 11:54:59 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 20 Jul 2010 14:54:59 -0400 (EDT) Subject: [Beowulf] compilers vs mpi? In-Reply-To: <4C45E15B.8020702@ldeo.columbia.edu> References: <4C45E15B.8020702@ldeo.columbia.edu> Message-ID: >> between the application and MPI. that is, I would like to be able >> to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly >> with apps compiled with other compilers. I guess I'm reasoning by analogy >> to normal distro libs. >> > > I haven't built OpenMPI this way, > but you may try to link statically with commercial compiler libraries > (say -static-intel, -Bstatic_pgi), I'd rather build with gcc if possible. I guess I'd be surprised if there were compute-intensive-enough parts of MPI to justify using some other compiler. (please, if anyone has any quantitative observations on the quality of current compilers, let me/list know!) > Yes, they do recommend compiler homogeneity. > However, I have built hybrids gcc+ifort > and gcc+pgf90 and both work fine. > (I have the homogeneous versions also.) oh. so the idea here is that the C part of OpenMPI has an ABI which is compatible with basically all the other C compilers, such as would be used to compile app-side code. but that the fortran side has to be matched, library and app sides? if that's the case, then would it make sense to factor out the fortran interface? > Fortran77 never had these features anyway, and I guess > mpif77 doesn't check if you are passing an integer > where it should be a real, or if your argument list is shorter > than the function requires. so if I have f90 code that uses an mpi header (not .mod interface), does that mean there's no function signature checking at all? as far as I know, my organization has never done .mod-based MPI, so maybe this is why we're facing the issue now, after 10 years and 4k users ;) >> PS: we have a large and diverse user base, so tend to have to support gcc, >> intel, pathscale and pgi. > > ... and don't forget Open64! :) well, that's an interesting point. I haven't quite figured out who is doing the canonical release for Open64 nowadays (highest ver number seems to be from AMD). have you done any comparisons? >> we even have people who want to use >> intel's damned synthetic 128b FP over MPI :( > > It's hard to keep the customer satisfied. > You give them the sky, they want the universe. for me, the real problem is knowing whether the user understands that synthetic 128b FP is drastically slower than 64b hardware FP. has anyone tried to do a comparison? thanks, mark. From hahn at mcmaster.ca Tue Jul 20 11:25:32 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 20 Jul 2010 14:25:32 -0400 (EDT) Subject: [Beowulf] compilers vs mpi? In-Reply-To: <20100720174655.GG24917@stikine.its.sfu.ca> References: <20100720174655.GG24917@stikine.its.sfu.ca> Message-ID: > we do exactly what you describe: compile OpenMPI with the gcc suite > and then use it with gcc, intel and open64 compilers. nice. > This works out-of-the-box, almost. > The problem is the f90 module mpi.mod. This is (usually) a binary > file and specific to the f90 compiler that was used to compile > OpenMPI. But there is a way to solve this problem: ah, .mod files. we have for the most part ignored them entirely. (what are we losing by doing that?) > 1. compile openmpi using the gcc compilers, i.e., gfortran as the Fortran > compiler and install it in /usr/local/openmpi this is perhaps a tangent, but we install everything we support under /opt/sharcnet/$packagebasename/$ver. for openmpi, we've had to bodge the compiler flavor onto that (/opt/sharcnet/openmpi/1.4.2/intel). > 2. move the Fortran module to the directory > /usr/local/openmpi/include/gfortran. In that directory > create softlinks to the files in /usr/local/openmpi/include. > 3. compile openmpi using ifort and install the Fortran module (and only > the Fortran module!) in /usr/local/openmpi/include/intel. In that > directory create softlinks to the files in /usr/local/openmpi/include. I guess I'm surprised this works - aren't you effectively assuming that the intel and gfortran interfaces are compatible here? that is, the app compiles with the compiler-specific module, which basically promises a particular type-safe interface (signature) for MPI functions, but then the linker just glues them together without any way to verify the signature compatibility... am I misunderstanding? > 4. in /usr/local/openmpi/bin create softlinks mpif90.ifort > and mpif90.gfortran pointing to opal_wrapper. Remove the > mpif90 softlink. > 5. Move /usr/local/openmpi/share/openmpi/mpif90-wrapper-data.txt > to /usr/local/openmpi/share/openmpi/mpif90.ifort-wrapper-data.txt. > Change the line includedir=${includedir} to: > includedir=${includedir}/intel > Copy the file to > /usr/local/openmpi/share/openmpi/mpif90.gfortran-wrapper-data.txt > and change the line includedir=${includedir} to > includedir=${includedir}/gfortran > 6. Create a wrapper script /usr/local/openmpi/bin/mpif90: > > #!/bin/bash > OMPI_WRAPPER_FC=`basename $OMPI_FC 2> /dev/null` > if [ "$OMPI_WRAPPER_FC" = 'gfortran' ]; then > exec $0.gfortran "$@" > else > exec $0.ifort "$@" > fi this is a tangent, but perhaps interesting. we don't use the wrappers from the MPI package, but rather our own single wrapper which has some built-in intelligence (augmented by info from the compiler's (environment) module.) > The reason we use gfortran in step 1 is that otherwise you get those > irritating error messages from the Intel libraries, cf. > http://www.open-mpi.org/faq/?category=building#intel-compiler-wrapper-compiler-warnings hmm. we work around those by manipulating the link arguments in our wrapper. thanks! -mark From niftyompi at niftyegg.com Tue Jul 20 13:28:43 2010 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Tue, 20 Jul 2010 13:28:43 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <76097BB0C025054786EFAB631C4A2E3C09331921@MERCMBX04R.na.SAS.com> References: <386oFdNJv5776S01.1277903447@web01.cms.usa.net> <76097BB0C025054786EFAB631C4A2E3C09331921@MERCMBX04R.na.SAS.com> Message-ID: <20100720202843.GA20023@tosh2egg.ca.sanfran.comcast.net> On Wed, Jun 30, 2010 at 03:11:54PM +0000, Bill Rankin wrote: > > > I think the money part will be difficult to get (it is like a > > politically > > incorrect question). > > Joe addressed this pretty well. For the large systems, it's almost always under NDA. > And then there is always the issue of timing. Some groups or departments obtain hardware used from another project often at $1.00 and other consideration prices. Divide by zero rules and NAN specters should scare ya. Then there are the aggressive pricing curves that processors suffer. Six months later the same hardware can sometimes be purchased well discounted such that two identical clusters could have a price per teraflops that is 40% of an identical system. And they the question is in dollars. With international exchange rates other computations come to play making historic comparisons interesting. -- T o m M i t c h e l l Found me a new hat, now what? From gus at ldeo.columbia.edu Tue Jul 20 13:43:47 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 20 Jul 2010 16:43:47 -0400 Subject: [Beowulf] compilers vs mpi? In-Reply-To: References: <4C45E15B.8020702@ldeo.columbia.edu> Message-ID: <4C460A83.4010906@ldeo.columbia.edu> HI Mark Mark Hahn wrote: >>> between the application and MPI. that is, I would like to be able >>> to compile MPI (say, OpenMPI) with gcc, and expect it to work >>> correctly with apps compiled with other compilers. I guess I'm >>> reasoning by analogy to normal distro libs. >>> >> >> I haven't built OpenMPI this way, >> but you may try to link statically with commercial compiler libraries >> (say -static-intel, -Bstatic_pgi), > > I'd rather build with gcc if possible. I guess I'd be surprised if > there were compute-intensive-enough parts of MPI to justify using some > other compiler. You are probably right, and gcc is so germane to Linux, so why bother with other C compilers, unless they are significantly faster. I have builds with icc and pgcc to avoid trouble and too much digging into Makefiles, etc, to fix things when the code or the user prefers to use the commercial compiler. > (please, if anyone has any quantitative observations on > the quality of current compilers, let me/list know!) > I guess this candid question may flare yet another war. I don't have direct comparisons. I remember some discussion some time ago I don't remember where, on how memcpy is done in gcc vs. icc, and how efficient each one is. This is presumably important for MPI, as memcpy is likely to be on the base of all "intra-node" MPI communication. >> Yes, they do recommend compiler homogeneity. >> However, I have built hybrids gcc+ifort >> and gcc+pgf90 and both work fine. >> (I have the homogeneous versions also.) > > oh. so the idea here is that the C part of OpenMPI has an ABI > which is compatible with basically all the other C compilers, > such as would be used to compile app-side code. but that the fortran > side has to be matched, library and app sides? if that's the case, > then would it make sense to factor out the fortran interface? I don't know the guts of OpenMPI, but I believe the Fortran 77 and 90 interfaces build on top of the C interface. I don't really know if the OpenMPI ABI is compatible across all C compilers. Since most of the code here is Fortran, with a few tidbits of C, I try to provide a variety of MPI builds for the commercial compilers around (plus gfortran, and now openf90, which I have yet to test). Some programs just refuse to compile with one commercial compiler, but may compile with the other. Short from modifying the code (which we often have to do), this creates the need for several MPI compiler wrappers, built with different Fortran compilers. I haven't seen this happen with C programs, but as I said, there is not much C code in our area. I didn't mean to factor Fortran out, although your interpretation of it is interesting. The gcc+some_fortran hybrids I built were mostly because: 1) we didn't have icc for a while (only funds to buy ifort), although more recently we bought the whole compiler suite; 2) pgcc for a while had trouble building OpenMPI, although the problem is now gone. Very mundane reasons, but the hybrids work. > >> Fortran77 never had these features anyway, and I guess >> mpif77 doesn't check if you are passing an integer >> where it should be a real, or if your argument list is shorter >> than the function requires. > > so if I have f90 code that uses an mpi header (not .mod interface), > does that mean there's no function signature checking at all? > as far as I know, my organization has never done .mod-based MPI, > so maybe this is why we're facing the issue now, after 10 years and 4k > users ;) > There is quite a bit of code here written with Fortran90 constructs, but that has "#include mpif.h", instead of "use mpi", and some with "use mpi". I think this is for historic reasons, because the MPI F90 interface may not have been very good in the past. The mpif90 wrapper compiles both cases. If I remember right, the mpif77 (as long as it is built with F77=FC=[ifort,pgf90,gfortran]) also compiles the first case (because the underlying compiler is actually a F90 compiler), but not the second. Of course, you need to build the MPI Fortran90 interface to do "use mpi", and you must use mpif90 in this case. If I remember right (somebody please correct me if I am wrong), the MPI subroutine/function calls are the same in F77 and F90, the main difference is that the MPI F90 bindings add type and interface checking, the OO-universe that is absent in F77. >>> PS: we have a large and diverse user base, so tend to have to support >>> gcc, intel, pathscale and pgi. >> >> ... and don't forget Open64! :) > > well, that's an interesting point. I haven't quite figured out who is > doing > the canonical release for Open64 nowadays (highest ver number seems to > be from AMD). have you done any comparisons? > Well, I built OpenMPI 1.4.2 with Open64, tested the basic functionality, but I have yet to find time to compile and run one or two atmosphere/climate/ocean codes here with the Open64 OpenMPI wrappers to see if it outperforms builds with the other compilers. I am curious about this one because we have Opteron quad-core, and I was wondering if the AMD-sponsored compiler would do better than the Intel compiler (which doesn't let me use anything beyond SSE2, -xW on the Opterons, if I remember right). Unfortunately, this type of comparison can take quite some time, if you try to tweak with optimization, check if the results are OK (in IA64 I had some bad surprises with hidden/bundled optimization flags), test also with MVAPICH2, and so on. I can't possibly test everything, I have production runs to do, I am also one of my users! >>> we even have people who want to use >>> intel's damned synthetic 128b FP over MPI :( >> >> It's hard to keep the customer satisfied. >> You give them the sky, they want the universe. > > for me, the real problem is knowing whether the user understands that > synthetic 128b FP is drastically slower than 64b hardware FP. > has anyone > tried to do a comparison? > > thanks, mark. From what I observe here, the primary level of astonishment and satisfaction for most users is: "It works!". "It runs faster than on my laptop." comes later, if ever. Only a few users try to compare. If you gave them a functional synthetic 128b FP you may have already accomplished a lot. In the specific case of 128 bit arithmetic I wonder if you can make it run fast in 64 bit machines. There was a discussion, maybe here, a few weeks ago about this, right? Or was it in one of the MPI lists? It was about why one would need 128 bit arithmetic, and whether this would be more of an issue with a possibly poor/noisy algorithm/numerics. Long ago I banned Matlab from our old cluster, because of abusive behavior on the head node, etc. Recently I set it up again to run in batch mode on the compute nodes. I thought that would be a reasonable compromise, and drive some people to run their heavy Matlab calculations in the cluster, instead of on their desktops. (A lot of post-processing of climate data is done in Matlab.) Well, nobody got interested. Trying to please users beyond their strict requests, specially when this requires changing their habits, may not necessarily work. My $0.02 Gus Correa From niftyompi at niftyegg.com Tue Jul 20 15:29:32 2010 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Tue, 20 Jul 2010 15:29:32 -0700 Subject: [Beowulf] compilers vs mpi? In-Reply-To: References: Message-ID: <20100720222932.GA22712@tosh2egg.ca.sanfran.comcast.net> On Tue, Jul 20, 2010 at 12:07:32PM -0400, Mark Hahn wrote: > > Hi all, > I'm interested in hearing about experiences with mixing compilers > between the application and MPI. that is, I would like to be able > to compile MPI (say, OpenMPI) with gcc, and expect it to work > correctly with apps compiled with other compilers. I guess I'm > reasoning by analogy to normal distro libs. > > the OpenMPI FAQ has this comment: > > NOTE: The Open MPI team recommends using a single compiler suite whenever > possible. Unexpected or undefined behavior can occur when you mix compiler > suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 compilers > between different compiler suites is almost guaranteed not to work). > > and there are complaints elsewhere in the FAQ about f90 bindings. > I'd appreciate it if someone could help a humble C/C++/perl hacker > understand the issues here... > > thanks, mark hahn. > PS: we have a large and diverse user base, so tend to have to > support gcc, Intel, pathscale and pgi. we even have people who want > to use intel's damned synthetic 128b FP over MPI :( Some of this is historic and has been addressed transparently in OpenMPI. OFED pulls from OpenMPI I believe... MPICH, MPICH2 is unknown to me. Different compilers have the option of representing things differently ... One is Fortran's notion of True/False There are two conventions you can test your compiler set with a debugger and short test code. 0:1 0:-1 i.e. a choice was made in the space -1:0:1 Depending on some logic reductions some things might work in code that breaks at a different optimization level. Two other places to double check are: strings and arg() handling. The older Pathscale/QLogic MPI had libs with symbol handling magic that could make most of this transparent via mpicc and friends. The OpenMPI folk did the same thing differently if I recall. Synthetic 128b is unknown to me. C++ bindings can be more difficult and each compiler should be used to generate bindings as needed perhaps based on the OpenMPI source/makefiles. Some caution is justified as the link line gets longer and longer and the users pull this GCC bit, a PGI built lib, an Intel-lib, goto-BLAS (pick one) etc... Summary: boolian, args(), strings times a list of compilers can generate a pile of permutations.... -- T o m M i t c h e l l Found me a new hat, now what? From mathog at caltech.edu Wed Jul 21 13:26:17 2010 From: mathog at caltech.edu (David Mathog) Date: Wed, 21 Jul 2010 13:26:17 -0700 Subject: [Beowulf] Re: OT: recoverable optical media archive format? Message-ID: I wasn't thrilled with the limitations of rsbep and eventually wrote a program rsbd (Reed-Solomon for Block Devices) to do this. rsbd reuses the RS encoding/decoding routines from the rsbep distribution (written by Phil Karn) but the rest is new code. rsbd uses message digests (SHA1) that let it skip the RS decode step on data that is not corrupted, which speeds up "decode" a lot. It also keeps track of erasures and so can restore 32 erasures (rsbep is limited to 16) in a block of 255 bytes. rsbd does everything it can to verify data integrity, and either aborts on error or optionally slogs on anyway while noting the locations of bad output. I do not believe there are any cases where it will output bad data and not tell you. (Could be a bug somewhere though.) rsbd can be retrieved from rsbd.sourceforge.net. Here is an example (uses one core on a dual Opteron 280, single SATA disk on the machine) that corresponds roughly to the size of a DVD-R: % cat test | time rsbd -e >test.rsbd 90.98user 14.36system 3:04.17elapsed 57%CPU % cat test.rsbd | time rsbd -d -c >restored Output size: 4056879025 input size: 4648980480 Input blocks total: 9080040 Input blocks erased: 0 Neighborhoods processed: 4451 Sections processed: 35602 Sections Spec. Blk verified: ddgst good: 35602 Sections Spec. Blk verified: ddgst bad: 0 Sections Spec. Blk verified: RS: ddgst good: 0 Sections Spec. Blk verified: RS: ddgst bad: 0 Sections Spec. Blk reverified: ddgst good: 0 Sections Spec. Blk reverified: ddgst bad: 0 Sections Spec. Blk reverified: RS: ddgst good: 0 Sections Spec. Blk reverified: RS: ddgst bad: 0 Sections Spec. Blk corrupt: ddgst good: 0 Sections Spec. Blk corrupt: ddgst bad: 0 Sections Spec. Blk corrupt: RS: ddgst good: 0 Sections Spec. Blk corrupt: RS: ddgst bad: 0 RSblks total: 18192622 RSblks clean: 18192622 RSblks corrected: 0 RSblks excess erasures: 0 RSblks uncorrectable: 0 RSblks avg corr. bytes: 0 RSblks max corr. bytes: 0 50.24user 12.43system 2:28.60elapsed 42%CPU % cat test.rsbd | \ pockmark -bs 512 -maxgap 4000 -maxrun 40 > test.rsbd.pox % cat test.rsbd.pox | time rsbd -d -c >restored Output size: 4056879025 input size: 4648980480 Input blocks total: 9080040 Input blocks erased: 91639 Neighborhoods processed: 4451 Sections processed: 35602 Sections Spec. Blk verified: ddgst good: 12608 Sections Spec. Blk verified: ddgst bad: 17152 Sections Spec. Blk verified: RS: ddgst good: 17152 Sections Spec. Blk verified: RS: ddgst bad: 0 Sections Spec. Blk reverified: ddgst good: 0 Sections Spec. Blk reverified: ddgst bad: 5842 Sections Spec. Blk reverified: RS: ddgst good: 5842 Sections Spec. Blk reverified: RS: ddgst bad: 0 Sections Spec. Blk corrupt: ddgst good: 0 Sections Spec. Blk corrupt: ddgst bad: 0 Sections Spec. Blk corrupt: RS: ddgst good: 0 Sections Spec. Blk corrupt: RS: ddgst bad: 0 RSblks total: 18192622 RSblks clean: 6454984 RSblks corrected: 11737638 RSblks excess erasures: 0 RSblks uncorrectable: 0 RSblks avg corr. bytes: 3.7 RSblks max corr. bytes: 19 246.49user 11.73system 5:10.14elapsed 83%CPU % md5sum test restored b22a361554771045df4424e547eaa558 restored b22a361554771045df4424e547eaa558 test >From the above one can see that it doesn't waste time doing RS decoding unless it needs to. Consequently the decode on a file which isn't corrupt runs faster than the encode. Somewhat more on subject for this group, the current version of rsbd is completely single threaded. There is plenty of room here for parallelization. For instance, for each "neighborhood" the sha1 digests are independent and can be done 8 at a time, the RS encode/decode are performed 8*511 times (all of which are completely independent), the XOR step is performed on blocks of 512 consecutive bytes 4096*255/512 times, all independent. However, the [255,4096] <-> [4096,255] transpose of a byte array, once per neighborhood, isn't going to be as trivial to split into threads, and that could be rate limiting. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From madskaddie at gmail.com Sat Jul 24 11:18:15 2010 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Sat, 24 Jul 2010 19:18:15 +0100 Subject: [Beowulf] first cluster In-Reply-To: References: <20100716145909.GB9850@sopalepc> <20100716171129.GB10537@sopalepc> <553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk> Message-ID: I manage a small cluster with a central image for the execution hosts (fully decoupled from the master/ login nodes). To deal with direct access to nodes: - Every user has an "*" on the password field of the /etc/shadow file in the execution hosts images - Access through ssh to the exec hosts is enabled to work only with passwords (no certificate files) - Direct access to nodes: gridengine's (GE) qrsh - MPI via GE parallel environments Things to be solved: - Monitoring of the resources usage; Now is only possible to query by using GE qhost or looking at ganglia. But the latency is quite high :/ (anything above instantaneous is high latency) - Administration can be boring sometimes because I need to input the password. I'll study a bit of PAM rules to bypass or learn the tcl Expect tool (or equivalent libs in other languages) Gil Brand?o -- " It can't continue forever. The nature of exponentials is that you push them out and eventually disaster happens. " Gordon Moore? (Intel co-founder and author of the Moore's law) From john.hearns at mclaren.com Mon Jul 26 07:40:26 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Mon, 26 Jul 2010 15:40:26 +0100 Subject: [Beowulf] Top of the Green 500 Message-ID: <68A57CCFD4005646957BD2D18E60667B114134DA@milexchmb1.mil.tagmclarengroup.com> http://www.engadget.com/2010/07/13/tokyo-universitys-grape-dr-supercompu ter-is-a-tangled-green-pow/ The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From hahn at mcmaster.ca Mon Jul 26 10:24:06 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 26 Jul 2010 13:24:06 -0400 (EDT) Subject: [Beowulf] Top of the Green 500 In-Reply-To: <68A57CCFD4005646957BD2D18E60667B114134DA@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B114134DA@milexchmb1.mil.tagmclarengroup.com> Message-ID: > http://www.engadget.com/2010/07/13/tokyo-universitys-grape-dr-supercompu > ter-is-a-tangled-green-pow/ kind of weird - it would be interesting to get some first-person commentary. I've got nothing against wire racks, not even against somewhat messy cabling. (the goal of cabling is to _work_, while being reasonably maintainable, etc. labeling is great, but I've seen some _abominations_ that were very neat but impossible to maintain. not to mention cables so tidily tie-wrapped that they exceeded their min mend radius...) the open-air cooling is also pretty amateur-looking. I guess one benefit of going green is that you can be sloppy about airflow ;) From mathog at caltech.edu Mon Jul 26 12:46:30 2010 From: mathog at caltech.edu (David Mathog) Date: Mon, 26 Jul 2010 12:46:30 -0700 Subject: [Beowulf] Re: Top of the Green 500 Message-ID: John Hearns wrote: > > http://www.engadget.com/2010/07/13/tokyo-universitys-grape-dr-supercompu > ter-is-a-tangled-green-pow/ > Mysterious ventilation in that room. Near as I can tell the floor, ceiling, and back (left) wall are solid. The only other wall looks a bit like a shoji screen, but maybe that is filter material and air moves in or out through that wall? (Arguing against that is the book case which would then be impeding the flow, and of course also that the racks would then be parallel to the flow.) The one visible air duct is a narrow slit above the right most rack, which appears to be about 10cm. high by 1m wide. Maybe the heat is carried away by the rats nest of cables? Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rpnabar at gmail.com Wed Jul 28 10:42:52 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 28 Jul 2010 12:42:52 -0500 Subject: [Beowulf] Any recommendations for a small (~4 port) KVM-over-IP switch Message-ID: Are there any small (~4 port) KVM-over-IP switches out there? All the one I know have at least 16 ports or so. But I just need KVM over IP ability for my 3 head nodes so that's a lot of wasted ports (and money!). I did see some 2 / 4 / 6 port KVMs but those are all without the "over IP" facility. Just curious if anyone knows of a suitable product...... -- Rahul From crhea at mayo.edu Wed Jul 28 12:12:34 2010 From: crhea at mayo.edu (Cris Rhea) Date: Wed, 28 Jul 2010 14:12:34 -0500 Subject: [Beowulf] Re: Any recommendations for a small (~4 port)... In-Reply-To: <201007281900.o6SJ0DTS028058@bluewest.scyld.com> References: <201007281900.o6SJ0DTS028058@bluewest.scyld.com> Message-ID: <20100728191234.GA1821@kaizen.mayo.edu> > Subject: [Beowulf] Any recommendations for a small (~4 port) > KVM-over-IP switch > > Are there any small (~4 port) KVM-over-IP switches out there? All the > one I know have at least 16 ports or so. But I just need KVM over IP > ability for my 3 head nodes so that's a lot of wasted ports (and > money!). > > I did see some 2 / 4 / 6 port KVMs but those are all without the "over > IP" facility. Just curious if anyone knows of a suitable > product...... > > -- > Rahul You might look at Avocent's MPU104E (4 port) or MPU108E (8 port). -- Cristopher J. Rhea Mayo Clinic - Research Computing Facility 200 First St SW, Rochester, MN 55905 crhea at Mayo.EDU (507) 284-0587 From rpnabar at gmail.com Wed Jul 28 12:25:04 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 28 Jul 2010 14:25:04 -0500 Subject: [Beowulf] Any recommendations for a small (~4 port) KVM-over-IP switch In-Reply-To: References: Message-ID: On Wed, Jul 28, 2010 at 1:51 PM, Andrew Latham wrote: > I have found the Lantronix Spider product a great solution for N+1 > issues. ?It has dropped in price a great deal. ?The benefit of the > added Serial over LAN also helps out. ?Virtual CD / Floppy has become > standard across all new IP KVM solutions... Thanks Andrew! That sounds like exactly what I need. But at it's ~$300 / module price tag it means that it's probably just as well to buy a conventional 16 port KVM (approx. $800 *) and waste 12 ports. I guess the small port count KVM (over IP) market just doesn't exist. [*]=http://www.belkin.com/iwcatproductpage.process?product_id=291685 -- Rahul From rpnabar at gmail.com Wed Jul 28 14:28:34 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 28 Jul 2010 16:28:34 -0500 Subject: [Beowulf] Any recommendations for a small (~4 port) KVM-over-IP switch In-Reply-To: References: Message-ID: On Wed, Jul 28, 2010 at 3:02 PM, Andrew Latham wrote: > I keep a Lantronix Spider in my laptop bag. ?I paid $550 USD when they > were new. ?I would say that at ~$270 @ Provantage that they are great > deals. ?They are also zero units (They come with a mounting kit that > can attach to most any cable management arm or rack rail). > > The Spider is not the solution for everything but when you need just > one more node it is an option. > > I was reminded by a peer to mention that the Spider can authenticate > against Radius, LDAP, and Microsoft Active Directory etc... Thanks Andrew for the additional tips! Yup, I'm impressed with it too and might buy one for my laptop bag. Especially the fact that it totally does away with the intermediate KVM-switch that's in most other alternatives is pretty nice. I'm curious, can this actually replace, say a crash cart? Could one have one of these handy and then whenever a compute-node goes bust in a faraway hot-aisle, just plug the Lantronix into a spare eth port and then go back and play with the node from a central console? [I guess the issue is does it have to be preplugged or can it be plugged into the monitor port AFTER a crash] Or will there be situations where a crash cart is still needed? -- Rahul From john.hearns at mclaren.com Fri Jul 30 05:00:01 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 30 Jul 2010 13:00:01 +0100 Subject: [Beowulf] Scale modl Cray-1 Message-ID: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com> Enjoy. http://www.theregister.co.uk/2010/07/29/cray_1_replica/ The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From gus at ldeo.columbia.edu Fri Jul 30 08:15:55 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 30 Jul 2010 11:15:55 -0400 Subject: [Beowulf] Scale modl Cray-1 In-Reply-To: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4C52ECAB.1020502@ldeo.columbia.edu> Hearns, John wrote: > Enjoy. http://www.theregister.co.uk/2010/07/29/cray_1_replica/ > > The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf In this age of virtualization, I was wondering if there are simulators in software (say, for Linux) of famous old computers: PDP-11, VAX, Cray-1, IBM 1130, IBM/360, CDC 6600, even the ENIAC perhaps. From instruction set, to OS, to applications. Any references? Thanks, Gus Correa From douglas.guptill at dal.ca Fri Jul 30 09:02:44 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Fri, 30 Jul 2010 13:02:44 -0300 Subject: [Beowulf] Scale modl Cray-1 In-Reply-To: <4C52ECAB.1020502@ldeo.columbia.edu> References: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com> <4C52ECAB.1020502@ldeo.columbia.edu> Message-ID: <20100730160244.GA18848@sopalepc> On Fri, Jul 30, 2010 at 11:15:55AM -0400, Gus Correa wrote: > Hearns, John wrote: >> Enjoy. http://www.theregister.co.uk/2010/07/29/cray_1_replica/ >> > In this age of virtualization, > I was wondering if there are simulators in software (say, for Linux) > of famous old computers: PDP-11, VAX, Cray-1, IBM 1130, IBM/360, > CDC 6600, even the ENIAC perhaps. > From instruction set, to OS, to applications. > > Any references? I once (early 1990s) wrote an emulator for something like an H316 (Honeywell 16-bit mini-computer). In Lisp. It was a graduate course project. I expect that I still have the source, on floppies...I wonder if they are still readable? Cheers, Douglas. -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From james.p.lux at jpl.nasa.gov Fri Jul 30 09:09:15 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 30 Jul 2010 09:09:15 -0700 Subject: [Beowulf] Scale modl Cray-1 In-Reply-To: <4C52ECAB.1020502@ldeo.columbia.edu> Message-ID: Certainly, there are simulators for the 1130, PDP-11, PDP-8... They run in simulation faster on a PC than on the original machine.. But you don't have the thwap of the cards in the reader, or the whine of the chain in the printer, or need to swap disks between passes for the Fortran compiler. On 7/30/10 8:15 AM, "Gus Correa" wrote: Hearns, John wrote: > Enjoy. http://www.theregister.co.uk/2010/07/29/cray_1_replica/ > > The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf In this age of virtualization, I was wondering if there are simulators in software (say, for Linux) of famous old computers: PDP-11, VAX, Cray-1, IBM 1130, IBM/360, CDC 6600, even the ENIAC perhaps. From instruction set, to OS, to applications. Any references? Thanks, Gus Correa _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From dnlombar at ichips.intel.com Fri Jul 30 12:38:19 2010 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Fri, 30 Jul 2010 12:38:19 -0700 Subject: [Beowulf] Scale modl Cray-1 In-Reply-To: <4C52ECAB.1020502@ldeo.columbia.edu> References: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com> <4C52ECAB.1020502@ldeo.columbia.edu> Message-ID: <20100730193819.GA19935@nlxcldnl2.cl.intel.com> On Fri, Jul 30, 2010 at 08:15:55AM -0700, Gus Correa wrote: > Hearns, John wrote: > > Enjoy. http://www.theregister.co.uk/2010/07/29/cray_1_replica/ > > > In this age of virtualization, > I was wondering if there are simulators in software (say, for Linux) > of famous old computers: PDP-11, VAX, Cray-1, IBM 1130, IBM/360, > CDC 6600, even the ENIAC perhaps. > From instruction set, to OS, to applications. http://www.ibm1130.org You can run DMS R2 V12 along with ASM, IBM FORTRAN (not EMU FORTRAN), APL, RPG... I play with this at home... It's a hoot! Check out http://ibm1130.org/sim/other where various other sims are listed, all but CDC and Cray from your list above. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own.