From walid.shaari at gmail.com Sat Dec 2 02:45:00 2006 From: walid.shaari at gmail.com (Walid) Date: Sat, 2 Dec 2006 13:45:00 +0300 Subject: [Beowulf] non-proprietary IPMI card? In-Reply-To: References: <456CE9A3.1040103@cse.ucdavis.edu> <456D32CD.2030000@streamline-computing.com> Message-ID: On 11/29/06, Mark Hahn wrote: > >> The Supermicro IPMI cards can share the eth0 port with the main gigabit > >> channel. The cards are powered up and on the network whenever power is > >> applied to the chassis. There is a bridge somewhere on the motherboard > >> which bridges between eth0 and the IPMI interface. > > I would guess that the phy is just stolen by the IPMI, since sharing > the actual eth controller (registers, etc) would probably be too touchy. > > > Wouldn't that sharing introduce a limitation, we have seen that we are > > not able to manage nodes remotley when the node has a kernel panic for > > example. > > no - the IPMI wouldn't interact with the host OS, so if the latter is > paniced (or turned off), it doesn't matter. It does not interact with the OS, but it shares the same network buffer in the NIC, when this gets Full, you can not manage it any more, and apperantly it does get filled up during crashes. regards Walid. From csamuel at vpac.org Sun Dec 3 15:15:06 2006 From: csamuel at vpac.org (Chris Samuel) Date: Mon, 4 Dec 2006 10:15:06 +1100 Subject: [Beowulf] More technical information and spec of beowulf In-Reply-To: References: Message-ID: <200612041015.06679.csamuel@vpac.org> On Thursday 30 November 2006 18:19, reza bakhshi wrote: > How can i find some more detailed technical information on Beowulf > software infrastructure? Hopefully these will help.. http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book/ http://en.wikipedia.org/wiki/Beowulf_(computing) http://clustermonkey.net/ Good luck! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From ballen at gravity.phys.uwm.edu Mon Dec 4 03:48:55 2006 From: ballen at gravity.phys.uwm.edu (Bruce Allen) Date: Mon, 4 Dec 2006 05:48:55 -0600 (CST) Subject: [Beowulf] 'liquid cooled' racks Message-ID: Dear Beowulf list, For my next cluster room, I am hoping to use 'liquid cooled' racks make by Knurr (CoolTherm, http://www.thermalmanagement.de/). The scale is 67 racks x 7.5kW heat removal per rack (36 U usable per rack). Does anyone on the list have experiences with these or similar racks (you can reply to me privately if you don't want to share your experiences publicly)? I am told that there are lots of installations in Europe, especially in Germany, but that the only US sites are at Penn State, somewhere in the Chicago area, and U. of Vermont, and that these US installations are fairly small-scale. The racks have front and back doors that close, and contains fans in the back which circulate air through a heat exchanger located in the bottom of the rack. The heat exchanger transfers the heat into chilled water. The advantages of this are that it is quiet, and that you don't need vertical height for underfloor ducting or overhead hot air removal. Disadvantages are cost, potential difficulty of working within the rack. and loss of one rack of cooling capacity if you open both the front and back doors. (In this case I have access to 'building' funds that can not be used to buy more cpus, so the cost issue is not important.) I would be very interested to hear about people's experiences with these racks (or similar ones from another manufacturer) in an HPC environment. Cheers, Bruce From twm at tcg-hsv.com Mon Dec 4 07:15:34 2006 From: twm at tcg-hsv.com (Tim Moore) Date: Mon, 04 Dec 2006 09:15:34 -0600 Subject: [Beowulf] Node Drop-Off In-Reply-To: <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> References: <45578E91.1080407@tcg-hsv.com> <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> Message-ID: <45743B96.10003@tcg-hsv.com> Update to node drop-off: I wrote a few weeks ago to ask about node drop-off. A quick note...I had a cluster run for 3 years without failure and I upgraded the Opteron 240 CPUs to 250s. The upgrade required a BIOS upgrade and while I was at it, upgraded the OS and security. Some readers provided good suggestions for diagnosis. As it turned out, of the 16 CPU batch...two were flawed. No success was derived from replacing power supplies, HDD, resetting memory and the cooling solution. The CPU flaw only manifested itself (at first) after several hours of CPU usage. With each failure, the time duration shortened before the next failure and by the time I figured it out was down to about 2 minutes. The AMD engineer with whom I talked was amazed that such CPUs made it beyond quality control. He also suggested that the vendor may have inadvertently mixed returned (previously fetermined to be flawed processors) with the new ones and sent them out (again) as new. Just for future reference...is there an easy way to determine if a CPU is flawed with 2 weeks of down time and extensive hair extraction???? Tim -------------- next part -------------- A non-text attachment was scrubbed... Name: twm.vcf Type: text/x-vcard Size: 336 bytes Desc: not available URL: From ladd at che.ufl.edu Mon Dec 4 07:35:29 2006 From: ladd at che.ufl.edu (Tony Ladd) Date: Mon, 4 Dec 2006 10:35:29 -0500 Subject: [Beowulf] Node Drop-Off In-Reply-To: <45743B96.10003@tcg-hsv.com> Message-ID: <023501c717b9$d425e3b0$656ce30a@ladd02> Tim Our university HPC cluster had similar problems with dual-core opterons 275's. They had about 20 bad ones out of a batch of 400. The nodes would run OK for a while and then die. It took many months to track down the source of the problem-AMD gave the same lame excuse-bad QA. Tony -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Tim Moore Sent: Monday, December 04, 2006 10:16 AM To: beowulf at beowulf.org Subject: Re: [Beowulf] Node Drop-Off Update to node drop-off: I wrote a few weeks ago to ask about node drop-off. A quick note...I had a cluster run for 3 years without failure and I upgraded the Opteron 240 CPUs to 250s. The upgrade required a BIOS upgrade and while I was at it, upgraded the OS and security. Some readers provided good suggestions for diagnosis. As it turned out, of the 16 CPU batch...two were flawed. No success was derived from replacing power supplies, HDD, resetting memory and the cooling solution. The CPU flaw only manifested itself (at first) after several hours of CPU usage. With each failure, the time duration shortened before the next failure and by the time I figured it out was down to about 2 minutes. The AMD engineer with whom I talked was amazed that such CPUs made it beyond quality control. He also suggested that the vendor may have inadvertently mixed returned (previously fetermined to be flawed processors) with the new ones and sent them out (again) as new. Just for future reference...is there an easy way to determine if a CPU is flawed with 2 weeks of down time and extensive hair extraction???? Tim From gerry.creager at tamu.edu Mon Dec 4 06:17:01 2006 From: gerry.creager at tamu.edu (Gerry Creager N5JXS) Date: Mon, 04 Dec 2006 08:17:01 -0600 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: References: Message-ID: <45742DDD.3090407@tamu.edu> No experience yet but our new IBM P575 installation will use these or something similar. We don't have sufficient residual forced air cooling to deal with 640 P5+ CPUs in our campus data center and our plans for dedicated HPC space on campus are still embryonic. I'll relate our experiences. WRT noise load, the "cost" of another Leibert is trivial compared to the case fans we'll still have to endure on the new system. I've heard numbers in excess of 87 dBa for this system, necessitating hearing protection now for anyone entering the data center space. gerry Bruce Allen wrote: > Dear Beowulf list, > > For my next cluster room, I am hoping to use 'liquid cooled' racks make > by Knurr (CoolTherm, http://www.thermalmanagement.de/). The scale is 67 > racks x 7.5kW heat removal per rack (36 U usable per rack). > > Does anyone on the list have experiences with these or similar racks > (you can reply to me privately if you don't want to share your > experiences publicly)? I am told that there are lots of installations > in Europe, especially in Germany, but that the only US sites are at Penn > State, somewhere in the Chicago area, and U. of Vermont, and that these > US installations are fairly small-scale. > > The racks have front and back doors that close, and contains fans in the > back which circulate air through a heat exchanger located in the bottom > of the rack. The heat exchanger transfers the heat into chilled water. > > The advantages of this are that it is quiet, and that you don't need > vertical height for underfloor ducting or overhead hot air removal. > Disadvantages are cost, potential difficulty of working within the rack. > and loss of one rack of cooling capacity if you open both the front and > back doors. > > (In this case I have access to 'building' funds that can not be used to > buy more cpus, so the cost issue is not important.) > > I would be very interested to hear about people's experiences with these > racks (or similar ones from another manufacturer) in an HPC environment. > > Cheers, > Bruce > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From diep at xs4all.nl Mon Dec 4 08:57:50 2006 From: diep at xs4all.nl (Vincent Diepeveen) Date: Mon, 4 Dec 2006 17:57:50 +0100 Subject: [Beowulf] Node Drop-Off References: <45578E91.1080407@tcg-hsv.com><16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> <45743B96.10003@tcg-hsv.com> Message-ID: <002901c717c5$578bbbc0$9600000a@gourmandises> To let nodes fail quickly, go find BIG primes with latest GMP nonstop at the opterons. After a few days of nonstop calculation bad nodes should get total hung, when you hit worst case path. most use prime95 for such stuff, but that thing is using exclusively "intel" assembly SIMD code and the cpu's seem quite well tested/fixed for SIMD. Seems to me that a mix of integer multiplication and modulo there is worst case path of those opteron chips. It is of course also possible it hits a bug in the linux kernel, because my own code doesn't get hung at all when running at 4 cores, whereas GMP did do exactly that. My code currently works fastest at windows, as it doesn't have inline assembly for linux yet. I wouldn't rule out that linux kernel simply has bugs there. The testing of those kernels is total amateuristic. Vincent ----- Original Message ----- From: "Tim Moore" To: Sent: Monday, December 04, 2006 4:15 PM Subject: Re: [Beowulf] Node Drop-Off > Update to node drop-off: > > I wrote a few weeks ago to ask about node drop-off. A quick note...I > had a cluster run for 3 years without failure and I upgraded the Opteron > 240 CPUs to 250s. The upgrade required a BIOS upgrade and while I was > at it, upgraded the OS and security. Some readers provided good > suggestions for diagnosis. As it turned out, of the 16 CPU batch...two > were flawed. No success was derived from replacing power supplies, HDD, > resetting memory and the cooling solution. The CPU flaw only manifested > itself (at first) after several hours of CPU usage. With each failure, > the time duration shortened before the next failure and by the time I > figured it out was down to about 2 minutes. > > The AMD engineer with whom I talked was amazed that such CPUs made it > beyond quality control. He also suggested that the vendor may have > inadvertently mixed returned (previously fetermined to be flawed > processors) with the new ones and sent them out (again) as new. > > Just for future reference...is there an easy way to determine if a CPU > is flawed with 2 weeks of down time and extensive hair extraction???? > > Tim > > -------------------------------------------------------------------------------- > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From hahn at physics.mcmaster.ca Mon Dec 4 09:57:26 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Mon, 4 Dec 2006 12:57:26 -0500 (EST) Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: References: Message-ID: > For my next cluster room, I am hoping to use 'liquid cooled' racks make by > Knurr (CoolTherm, http://www.thermalmanagement.de/). The scale is 67 racks x > 7.5kW heat removal per rack (36 U usable per rack). 7.5 KW/rack isn't much; are you designing low-power nodes? > The racks have front and back doors that close, and contains fans in the back > which circulate air through a heat exchanger located in the bottom of the > rack. The heat exchanger transfers the heat into chilled water. APC has something similar with the cooling on the side. > The advantages of this are that it is quiet, and that you don't need vertical > height for underfloor ducting or overhead hot air removal. Disadvantages are > cost, potential difficulty of working within the rack. and loss of one rack > of cooling capacity if you open both the front and back doors. I'm pretty skeptical of the sealed-pod approach, since it seems to multiply the number of parts, create access issues, doesn't seem to actually save on space, etc. I've also been burned by cold water cooling, so to speak (assuming you don't have your own, well-controlled CW plant.) I would definitely consider a normal big-chillers approach _with_ back-of-rack CW boosters (heat exchangers). and I'd definitely consider creative layouts of racks and chillers (for instance, I'm not crazy about the hot/cold-aisle approach - something W-shaped would be better for my ~50-rack machineroom. or even just lining all the hot racks up against the chillers along one wall.) > (In this case I have access to 'building' funds that can not be used to buy > more cpus, so the cost issue is not important.) are you sure you can't just be clever about laying out normal front-to-back racks? possibly with back-of-rack heat-exchangers? regards, mark hahn. From James.P.Lux at jpl.nasa.gov Mon Dec 4 10:09:09 2006 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Mon, 04 Dec 2006 10:09:09 -0800 Subject: [Beowulf] Node Drop-Off In-Reply-To: <45743B96.10003@tcg-hsv.com> References: <45578E91.1080407@tcg-hsv.com> <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> <45743B96.10003@tcg-hsv.com> Message-ID: <6.2.3.4.2.20061204100214.032b7c30@mail.jpl.nasa.gov> At 07:15 AM 12/4/2006, Tim Moore wrote: >Update to node drop-off: > > >The AMD engineer with whom I talked was amazed that such CPUs made >it beyond quality control. He also suggested that the vendor may >have inadvertently mixed returned (previously fetermined to be >flawed processors) with the new ones and sent them out (again) as new. > >Just for future reference...is there an easy way to determine if a >CPU is flawed with 2 weeks of down time and extensive hair extraction???? This has been a persistent problem in the industry for decades. It's not unheard of for wholesalers to get batches of processors that have been remarked somewhere along the line. They'll run at the speed at room temp, but not over the entire temperature range, or, maybe with some bus loading conditions. Obviously, the mfrs hate it when this occurs, and so, they've been doing things like making the speed grade readable from some built in ROM or as a bond wire option. And, as much as the mfrs hate it, there's always the "fell of the back of the truck on the way to the disposal company" problem.. parts that don't pass the mfr inspection are supposed to be destroyed, but sometimes aren't. Processors are a high dollar item for something quite compact, they're sort of commodity (at least as far as the end user is concerned), so they're ripe for all the fiddles that have been used on such items for millenia. Hey, didn't Archimedes get famous for devising some sort of test along those lines? Jim From rgb at phy.duke.edu Mon Dec 4 10:59:17 2006 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon, 4 Dec 2006 13:59:17 -0500 (EST) Subject: [Beowulf] Node Drop-Off In-Reply-To: <6.2.3.4.2.20061204100214.032b7c30@mail.jpl.nasa.gov> References: <45578E91.1080407@tcg-hsv.com> <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> <45743B96.10003@tcg-hsv.com> <6.2.3.4.2.20061204100214.032b7c30@mail.jpl.nasa.gov> Message-ID: On Mon, 4 Dec 2006, Jim Lux wrote: > Processors are a high dollar item for something quite compact, they're sort > of commodity (at least as far as the end user is concerned), so they're ripe > for all the fiddles that have been used on such items for millenia. Hey, > didn't Archimedes get famous for devising some sort of test along those > lines? Eureka! So he did! Are you suggesting that he take his processors into the bath with him so they can talk to his rubber duckies:-) or that he run naked through the streets in front of the main AMD office in protest? The latter would probably work better than the former at least if >>I<< were doing the running. Any toplevel AMD exec would do anything to crank up quality control rather than have to endure the sight of an old fat bald guy running around naked out front screaming about defective processors...;-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From i.kozin at dl.ac.uk Mon Dec 4 11:26:21 2006 From: i.kozin at dl.ac.uk (Kozin, I (Igor)) Date: Mon, 4 Dec 2006 19:26:21 -0000 Subject: [Beowulf] MEW17 webcast Message-ID: If things go as planned we will be webcasting our 17th Machine Evaluation Workshop tomorrow 5th December and 6th December. As usual the presentations will be made available from our web site but hopefully this time with recorded video stream as well. Igor I. Kozin (i.kozin at dl.ac.uk) CCLRC Daresbury Laboratory, WA4 4AD, UK skype: in_kozin tel: +44 (0) 1925 603308 http://www.cse.clrc.ac.uk/disco From ballen at gravity.phys.uwm.edu Mon Dec 4 11:29:04 2006 From: ballen at gravity.phys.uwm.edu (Bruce Allen) Date: Mon, 4 Dec 2006 13:29:04 -0600 (CST) Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: References: Message-ID: Hi Mark, >> For my next cluster room, I am hoping to use 'liquid cooled' racks make by >> Knurr (CoolTherm, http://www.thermalmanagement.de/). The scale is 67 racks >> x 7.5kW heat removal per rack (36 U usable per rack). > > 7.5 KW/rack isn't much; are you designing low-power nodes? Yup -- I guess so. For example our current cluster nodes are dual core Opteron 175s (Supermicro H8SSL-i motherboards). They cost about $1200 for 2 x 2.2 GHz, with 1GB per core, and use 180 W under load (per 1U). I will have enough rack space (67 x 34 U, per 500kW power and cooling) that if I want to use nodes that dissipate 300 W per node then I will simply limit myself to 25 nodes per rack. >> The racks have front and back doors that close, and contains fans in the >> back which circulate air through a heat exchanger located in the bottom of >> the rack. The heat exchanger transfers the heat into chilled water. > > APC has something similar with the cooling on the side. Can you give a positive or negative opinion about either the Knurr or APC racks? Have you used them yourself in a system? >> The advantages of this are that it is quiet, and that you don't need >> vertical height for underfloor ducting or overhead hot air removal. >> Disadvantages are cost, potential difficulty of working within the rack. >> and loss of one rack of cooling capacity if you open both the front and >> back doors. > > I'm pretty skeptical of the sealed-pod approach, since it seems to > multiply the number of parts, create access issues, doesn't seem to > actually save on space, etc. It does save on vertical space, and it reduces noise to the level where you can work comfortably in the room. > I've also been burned by cold water cooling, so to speak (assuming you > don't have your own, well-controlled CW plant.) I'll have my own self-contained plant, designed by German engineers. They seem to know their stuff. > I would definitely consider a normal big-chillers approach _with_ > back-of-rack CW boosters (heat exchangers). Can you provide a URL or recommendation for these back-of-rack CW boosters? > and I'd definitely consider creative layouts of racks and chillers (for > instance, I'm not crazy about the hot/cold-aisle approach - something > W-shaped would be better for my ~50-rack machineroom. or even just > lining all the hot racks up against the chillers along one wall.) > >> (In this case I have access to 'building' funds that can not be used to buy >> more cpus, so the cost issue is not important.) > > are you sure you can't just be clever about laying out normal front-to-back > racks? possibly with back-of-rack heat-exchangers? I'd like a URL for the back-of-rack heat exchangers, please! (PS: though that approach still sounds noisy!) Cheers, Bruce From jmdavis1 at vcu.edu Mon Dec 4 11:55:41 2006 From: jmdavis1 at vcu.edu (Mike Davis) Date: Mon, 04 Dec 2006 14:55:41 -0500 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: References: Message-ID: <45747D3D.4040308@vcu.edu> Bruce, I am investigating the APC Infrastructure and Liebert XDH systems for a new small (1000 sqft). The benefits seem to outweigh the potential downsides. They include relatively easy n+1 capacity, tightly coupled cooling for efficiency, and containment to prevent air mixing. Virginia Tech has been using the Liebert XDV units as supplemental cooling for System X since it was installed. I believe that other top 100 installations are planned that will use the XDH. Infrastuxture seems to be good on a price for performance basis because the units are modular and generic. Mike Davis From hahn at physics.mcmaster.ca Mon Dec 4 12:28:38 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Mon, 4 Dec 2006 15:28:38 -0500 (EST) Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: References: Message-ID: >> 7.5 KW/rack isn't much; are you designing low-power nodes? > > Yup -- I guess so. For example our current cluster nodes are dual core > Opteron 175s (Supermicro H8SSL-i motherboards). They cost about $1200 for 2 > x 2.2 GHz, with 1GB per core, and use 180 W under load (per 1U). nice. for loosely-coupled workloads, that's a smart design, though I'm curious: did you consider and reject something from the Core2 world? (it's excellent in-cache FP throughput would also match loose-coupled.) >> APC has something similar with the cooling on the side. > > Can you give a positive or negative opinion about either the Knurr or APC > racks? Have you used them yourself in a system? I'm afraid not. if your heart is set on this approach, my main suggestion is to carefully consider the capacity and quality of your CW supply. for instance, if your CW loop is run by campus people and is shared with human-space cooling, you're probably in trouble. >> I'm pretty skeptical of the sealed-pod approach, since it seems to multiply >> the number of parts, create access issues, doesn't seem to actually save on >> space, etc. > > It does save on vertical space, and it reduces noise to the level where you > can work comfortably in the room. well, I can't speak to your room's limitations, but the Knurr stuff does appear to consume some vertical space on its own. I can't tell whether it's also significantly deeper (to allow for hot and cold airhandling internale plenums (plena?) my experience with system noise is that it's very dependent on the systems themselves. the chillers are not particularly noisy, but some systems are 80dB all the time; others are 65-70 when cool and 85 when warm. >> I would definitely consider a normal big-chillers approach _with_ >> back-of-rack CW boosters (heat exchangers). > > Can you provide a URL or recommendation for these back-of-rack CW boosters? um, well, SGI, HP and IBM all seem to source them from somewhere, not sure. I think it makes more sense than the liebert XDR approach (over-the-top boosters). > (PS: though that approach still sounds noisy!) why? the main noise source is the fans in the server, not the chillers. and I think the back-door coolers are normally passive. the "eServer Rear Door Heat Exchanger" appears to be. HP's thing: http://h18004.www1.hp.com/products/servers/proliantstorage/racks/mcs/index.html is like a rack with a half-rack of fans+heat-ex next to it. APC has a similar half-rack slice that can be slotted anywhere (it just cools back-to-front). http://www-03.ibm.com/press/us/en/pressrelease/20343.wss --- if you need a quiet room, well OK, but it sounds more like putting computers in office space, rather than a machineroom. (I don't find there's much reason to be in machinerooms any more.) I hate to sound like an ass, but I'm pretty skeptical of the vertical-space argument as well, since you lose some vertical space inside the rack with Knurr, and normal open-concept rooms _don't_ actually consume much vertical space. for instance, consider a room with three normal liebert downdraft units, a 16" underfloor plenum and all the racks arranged in a row with their hot side facing the chillers. that would work awesomely, and would probably work with a total of 8 ft. I'd worry more about optimizing the layout of the racks, inside and out (for instance, eth and power breakers consume 4u out of my compute racks; further, I have 11 racks of incredibly low-dissipation quadrics switches, which should really be taken out of the airflow, since they're <2KW/rack.) From James.P.Lux at jpl.nasa.gov Mon Dec 4 14:16:48 2006 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Mon, 04 Dec 2006 14:16:48 -0800 Subject: [Beowulf] Node Drop-Off In-Reply-To: References: <45578E91.1080407@tcg-hsv.com> <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> <45743B96.10003@tcg-hsv.com> <6.2.3.4.2.20061204100214.032b7c30@mail.jpl.nasa.gov> Message-ID: <6.2.3.4.2.20061204141006.03296550@mail.jpl.nasa.gov> At 10:59 AM 12/4/2006, Robert G. Brown wrote: >On Mon, 4 Dec 2006, Jim Lux wrote: > >>Processors are a high dollar item for something quite compact, >>they're sort of commodity (at least as far as the end user is >>concerned), so they're ripe for all the fiddles that have been used >>on such items for millenia. Hey, didn't Archimedes get famous for >>devising some sort of test along those lines? > >Eureka! So he did! > >Are you suggesting that he take his processors into the bath with him so >they can talk to his rubber duckies:-) No.. you have that confused with the liquid cooling thread. >or that he run naked through the >streets in front of the main AMD office in protest? I suppose so.. I hadn't thought about that aspect of Archimedes's "work". I was more thinking of the detection of fraudulent jewelry. >The latter would probably work better than the former at least if >>I<< >were doing the running. Any toplevel AMD exec would do anything to >crank up quality control rather than have to endure the sight of an old >fat bald guy running around naked out front screaming about defective >processors...;-) They're in Silcon Valley in Northern California, aren't they. While it might not be as "counter culture" as downtown San Francisco a few miles away, or as mellow as Marin county a few miles farther north or Santa Cruz to the west, you might not get as much attention as you might think. They might just think that you're just another one of those eccentric cluster monkeys (the "naked ape" revisited).. At least the weather is usually temperate enough that you wouldn't die instantly of hypothermia or sun stroke. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From rgb at phy.duke.edu Mon Dec 4 16:01:52 2006 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon, 4 Dec 2006 19:01:52 -0500 (EST) Subject: [Beowulf] Node Drop-Off In-Reply-To: <6.2.3.4.2.20061204141006.03296550@mail.jpl.nasa.gov> References: <45578E91.1080407@tcg-hsv.com> <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> <45743B96.10003@tcg-hsv.com> <6.2.3.4.2.20061204100214.032b7c30@mail.jpl.nasa.gov> <6.2.3.4.2.20061204141006.03296550@mail.jpl.nasa.gov> Message-ID: On Mon, 4 Dec 2006, Jim Lux wrote: > At 10:59 AM 12/4/2006, Robert G. Brown wrote: >> On Mon, 4 Dec 2006, Jim Lux wrote: >> >>> Processors are a high dollar item for something quite compact, they're >>> sort of commodity (at least as far as the end user is concerned), so >>> they're ripe for all the fiddles that have been used on such items for >>> millenia. Hey, didn't Archimedes get famous for devising some sort of >>> test along those lines? >> >> Eureka! So he did! >> >> Are you suggesting that he take his processors into the bath with him so >> they can talk to his rubber duckies:-) > > No.. you have that confused with the liquid cooling thread. Good thinking! Another benefit! He can measure their density and cool them at the same time! > They're in Silcon Valley in Northern California, aren't they. While it might > not be as "counter culture" as downtown San Francisco a few miles away, or as > mellow as Marin county a few miles farther north or Santa Cruz to the west, > you might not get as much attention as you might think. They might just > think that you're just another one of those eccentric cluster monkeys (the > "naked ape" revisited).. At least the weather is usually temperate enough > that you wouldn't die instantly of hypothermia or sun stroke. Oh yeah, forgot about that. There are plenty of old bald fat naked guys on the streets and beaches of CA, especially near SV. Well, that does it. Out of ideas. Guess one has to fall back on the old standbys, pulling one's hair out by the roots, cursing, drinking heavily, sacrificing chickens, running 48 hours of large-prime-number generation, or (my personal favorite) buying from a tier 1 vendor who also sells you 3 years of onsite service so you pick up a phone and say in delicate tones "Gee, these three nodes crash as soon as they start to actually work -- fix them. Now." This is really the basic difference between tier 1 and tier 2. You can save short term money with the latter, but have to do things like just plain throw out hardware -- after sweating over it for a long time, nagging your tier 2 vendor, getting angry, losing a lot of productivity and time. For some projects that works -- for others it doesn't. Note that I'm not asserting any sort of TCO bullshit advantage to one or the other -- as long as you spend some of the money you save on throwaway and replacement you can minimize losses as long as you aren't TOO unlucky you can do fine with tier 2, but you do need to recognize up front the trade-offs of the decision and deal with the additional hassles with a sigh if not a smile...;-) rgb > > > James Lux, P.E. > Spacecraft Radio Frequency Subsystems Group > Flight Communications Systems Section > Jet Propulsion Laboratory, Mail Stop 161-213 > 4800 Oak Grove Drive > Pasadena CA 91109 > tel: (818)354-2075 > fax: (818)393-6875 > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From hahn at physics.mcmaster.ca Mon Dec 4 16:59:03 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Mon, 4 Dec 2006 19:59:03 -0500 (EST) Subject: [Beowulf] Node Drop-Off In-Reply-To: References: <45578E91.1080407@tcg-hsv.com> <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> <45743B96.10003@tcg-hsv.com> <6.2.3.4.2.20061204100214.032b7c30@mail.jpl.nasa.gov> Message-ID: > were doing the running. Any toplevel AMD exec would do anything to > crank up quality control rather than have to endure the sight of an old AMD and our system vendor (HP) did a good job replacing >1500 opteron 252's in my cluster this year. this was the result of a "test escape", which did eventually show up publicly on amd.com. it's important to choose your vendor carefully, for reasons like this. regards, mark hahn. (not all of (old,fat,bald), and living in a no-streak-in-december climate) From jlb17 at duke.edu Tue Dec 5 04:03:22 2006 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Tue, 5 Dec 2006 07:03:22 -0500 (EST) Subject: [Beowulf] Node Drop-Off In-Reply-To: References: <45578E91.1080407@tcg-hsv.com> <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> <45743B96.10003@tcg-hsv.com> <6.2.3.4.2.20061204100214.032b7c30@mail.jpl.nasa.gov> <6.2.3.4.2.20061204141006.03296550@mail.jpl.nasa.gov> Message-ID: On Mon, 4 Dec 2006 at 7:01pm, Robert G. Brown wrote > This is really the basic difference between tier 1 and tier 2. You can > save short term money with the latter, but have to do things like just > plain throw out hardware -- after sweating over it for a long time, > nagging your tier 2 vendor, getting angry, losing a lot of productivity > and time. For some projects that works -- for others it doesn't. I think you're painting with too broad a brush there. I find the Tier 2 I buy from to be *far* more helpful and proactive than the dominant Tier 1 here on campus. I can relate a number of stories about them finding spare parts for systems long out of warranty, upgrading components when it's the quicker rather than the cheaper way to fix a problem, replacing shipper damaged systems within days, etc. In short, the sort of personal, helpful service I've never seen from a Tier 1. And it's not like I'm a major customer, either -- they supply some *big* operations. Sure, there are many less than good Tier 1s out there, so caveat beowulfer. But you can some who, IMHO, outperform the big boys considerably, and not just on price. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From amjad11 at gmail.com Mon Dec 4 02:30:01 2006 From: amjad11 at gmail.com (amjad ali) Date: Mon, 4 Dec 2006 10:30:01 +0000 Subject: [Beowulf] More technical information and spec of beowulf In-Reply-To: <200612041015.06679.csamuel@vpac.org> References: <200612041015.06679.csamuel@vpac.org> Message-ID: <428810f20612040230s16d1f9b4m9284e8c129105bcf@mail.gmail.com> > > How can i find some more detailed technical information on Beowulf > > software infrastructure? > > Hopefully these will help.. > > http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book/ > > http://en.wikipedia.org/wiki/Beowulf_(computing) > > http://clustermonkey.net/ > > Good luck! > Chris Might start with the following as well: () http://cs.boisestate.edu/~amit/research/beowulf/beowulf-setup.pdf () Building a Beowulf System Jan Lindheim, Caltech () "A survey of open source cluster management systems" by StoneLion at http://www.linux.com/ () http://oscar.openclustergroup.org () Docs availaible at the website of ASPEN SYSTEMS () "How to Build a Beowulf Linux Cluster" at teh website of The Mississippi Center for Supercomputing Research () "Configuring a Beowulf Cluster" by Forrest Hoffman, Extreme Linux regards, AMJAD ALI, BZU, Multan. From merc4krugger at gmail.com Mon Dec 4 03:54:13 2006 From: merc4krugger at gmail.com (Krugger) Date: Mon, 4 Dec 2006 11:54:13 +0000 Subject: [Beowulf] A suitable motherboard for newbie In-Reply-To: <428810f20611280325j44e500cdm7c9e91c6abd57899@mail.gmail.com> References: <1bef2ce30611260352r33e8d1aia809f75280e1ca07@mail.gmail.com> <428810f20611280325j44e500cdm7c9e91c6abd57899@mail.gmail.com> Message-ID: I would like to contribute that going quad core is a really bad idea at the moment, as linux support for the new boards is not very good at the moment. Even some Core Duo 2 ready boards have trouble, either it is the SATA controller or some other new stuff that they put in it. So before you buy check your hardware against known hardware problems in google: like some Tyan models that happen to have erratic behavior and need BIOS upgrades or SATA controller that don't work or RAID boards that break under heavy load. This is especially true if your are buying in a small amount and don't have a maintainance contract with a supplier. Now to get the most out of your money you should be really considering how much you get for your money. For example going rack mounted is only an option if you intent to expand and have a cooled room. Another thing is not to buy the latest tecnology available but go for the reliable and cheaper computers. Why? Because you get more computers which in the end will get you more overall computing power. For example for 20.000 dolares you can get almost twice as many opterons if you had dropped the dual core option, which means twice the total memory, which allows you to run bigger simulations. Basically you get the same number of tasks, 5(machines) * 2(cpus) * 2(cores) = 10(machines) * 2(cpus), but you get more memory 20Gb RAM compared to 40Gb. This considering if you had chosen the top of the line dual core opteron, which costed three times the cost of top of the line cpu without dual core, which really isn't considered top of the line. From eric-shook at uiowa.edu Mon Dec 4 07:13:38 2006 From: eric-shook at uiowa.edu (Eric Shook) Date: Mon, 04 Dec 2006 09:13:38 -0600 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: References: Message-ID: <45743B22.5090908@uiowa.edu> Hi Bruce, Our University is also looking into these racks. We have also looked at other vendors with similar liquid cooling and something called "Spraycool" technology (limited in deployment) among others. I would also be interested in the information you collect on or off the list and would be willing to share some information. Thanks, Eric -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu Bruce Allen wrote: > Dear Beowulf list, > > For my next cluster room, I am hoping to use 'liquid cooled' racks make > by Knurr (CoolTherm, http://www.thermalmanagement.de/). The scale is 67 > racks x 7.5kW heat removal per rack (36 U usable per rack). > > Does anyone on the list have experiences with these or similar racks > (you can reply to me privately if you don't want to share your > experiences publicly)? I am told that there are lots of installations > in Europe, especially in Germany, but that the only US sites are at Penn > State, somewhere in the Chicago area, and U. of Vermont, and that these > US installations are fairly small-scale. > > The racks have front and back doors that close, and contains fans in the > back which circulate air through a heat exchanger located in the bottom > of the rack. The heat exchanger transfers the heat into chilled water. > > The advantages of this are that it is quiet, and that you don't need > vertical height for underfloor ducting or overhead hot air removal. > Disadvantages are cost, potential difficulty of working within the rack. > and loss of one rack of cooling capacity if you open both the front and > back doors. > > (In this case I have access to 'building' funds that can not be used to > buy more cpus, so the cost issue is not important.) > > I would be very interested to hear about people's experiences with these > racks (or similar ones from another manufacturer) in an HPC environment. > > Cheers, > Bruce > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From ruhollah.mb at gmail.com Mon Dec 4 23:34:14 2006 From: ruhollah.mb at gmail.com (Ruhollah Moussavi Baygi ) Date: Tue, 5 Dec 2006 11:04:14 +0330 Subject: [Beowulf] SATA II Message-ID: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> Hi All at Beowulf ! There are some questions about implementation of a Beowulf cluster: 1-Regarding OS, is "Fedora Core 64bit" a good option for AMD Athlon 64 X2 4200+? 2- Is SATA II HDD compatible with Fedora Core 64bit? 3- Concerning RAM, is "2 GB 800 MHz DDR2" sufficient? Any other suggestion? Thanks in Advance -- Best, Ruhollah Moussavi B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruhollah.mb at gmail.com Sun Dec 3 08:13:57 2006 From: ruhollah.mb at gmail.com (Ruhollah Moussavi Baygi ) Date: Sun, 3 Dec 2006 19:43:57 +0330 Subject: [Beowulf] More technical information and spec of beowulf In-Reply-To: References: Message-ID: <1bef2ce30612030813j1d8c2cb6r9c9c49f8db617683@mail.gmail.com> Hi Reza! What do u mean exactly by "Beowulf software infrastructure" ? u know Beowulf is just a manner of connecting commonplace computers (nodes) together to make a robust system. Then your computational job will be divided among nodes. As far as I know, there is no especial software for this framework. All you need is a message passing interface (MPI) which makes it possible for nodes to pass messages together while running job. There are a lot of scientific packages (and mainly open source) which have been developed in such a way that you can run in parallel, like GROMACS, DLPOLY, AMBER, WIEN2K, and many others. All these packages (at least their computational parts), have been compiled by FORTRAN or C/C++ .This is due to the capability of these two compilers which can be run in parallel and work with MPI. However, there are some engineering and finite element software which are absolutely commercial and capable of being run in parallel, among them are FLUENT, CFX, and so on. I hope this could be helpful for you, Ruhollah Moussavi B. On 11/30/06, reza bakhshi wrote: > > Hi there, > How can i find some more detailed technical information on Beowulf > software infrastructure? > Thank you :) > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Best, Ruhollah Moussavi B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlb17 at duke.edu Tue Dec 5 06:01:53 2006 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Tue, 5 Dec 2006 09:01:53 -0500 (EST) Subject: [Beowulf] Node Drop-Off In-Reply-To: References: <45578E91.1080407@tcg-hsv.com> <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> <45743B96.10003@tcg-hsv.com> <6.2.3.4.2.20061204100214.032b7c30@mail.jpl.nasa.gov> <6.2.3.4.2.20061204141006.03296550@mail.jpl.nasa.gov> Message-ID: On Tue, 5 Dec 2006 at 7:03am, Joshua Baker-LePain wrote > Sure, there are many less than good Tier 1s out there, so caveat ^^ *sigh* That should read 'Tier 2s', of course. That'll teach me to post before coffee. > beowulfer. But you can some who, IMHO, outperform the big boys > considerably, and not just on price. > > -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From cap at nsc.liu.se Tue Dec 5 08:48:42 2006 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Tue, 5 Dec 2006 17:48:42 +0100 Subject: [Beowulf] SATA II In-Reply-To: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> Message-ID: <200612051748.46469.cap@nsc.liu.se> On Tuesday 05 December 2006 08:34, Ruhollah Moussavi Baygi wrote: > Hi All at Beowulf ! > > There are some questions about implementation of a Beowulf cluster: > > 1-Regarding OS, is "Fedora Core 64bit" a good option for AMD Athlon 64 X2 > 4200+? You'll have to upgrade to a later fedora core after a while when support runs out, but pretty much a matter of taste. (technically it'll be as good as any) > 2- Is SATA II HDD compatible with Fedora Core 64bit? The sata and/or raid controller will be a lot more critical to your linux experience than any choice of HDD. > 3- Concerning RAM, is "2 GB 800 MHz DDR2" sufficient? That depends close to 100% on what you'll run on it. /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From amacater at galactic.demon.co.uk Tue Dec 5 08:49:43 2006 From: amacater at galactic.demon.co.uk (Andrew M.A. Cater) Date: Tue, 5 Dec 2006 16:49:43 +0000 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <45743B22.5090908@uiowa.edu> References: <45743B22.5090908@uiowa.edu> Message-ID: <20061205164943.GA2722@galactic.demon.co.uk> On Mon, Dec 04, 2006 at 09:13:38AM -0600, Eric Shook wrote: > Hi Bruce, > > Our University is also looking into these racks. We have also looked at > other vendors with similar liquid cooling and something called > "Spraycool" technology (limited in deployment) among others. I would > also be interested in the information you collect on or off the list and > would be willing to share some information. > The only things I know that used spraycool technology were big machines like Crays and ?? Thinking Machines ?? which dunked circuit boards in freon and sprayed the liquid to keep it moving. Surely they can't have revived that :) AndyC From hahn at physics.mcmaster.ca Tue Dec 5 09:07:36 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Tue, 5 Dec 2006 12:07:36 -0500 (EST) Subject: [Beowulf] SATA II In-Reply-To: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> Message-ID: > 1-Regarding OS, is "Fedora Core 64bit" a good option for AMD Athlon 64 X2 > 4200+? sure. distros are just desktop decoration, and anything recent will perform equally well. you do probably want 64b, but that's not rare. > 2- Is SATA II HDD compatible with Fedora Core 64bit? disks don't have compatibility - controllers do. so it depends on your motherboard choice. but I haven't seen any builtin controllers that don't work well. > 3- Concerning RAM, is "2 GB 800 MHz DDR2" sufficient? any general answer is wrong. 2G is a huge waste of money for some applications, and not nearly enough for others. you'll pay a noticable premium for ddr2/800 as opposed to ddr2/667, though, and might not notice the difference (again, depending on your workload). there is only one general correlations I'll draw: many, many loosely-coupled and/or serial jobs have tiny memory footprints (so 2GB is overkill, and since the cache is effective, higher bandwidth is wasted). it's hard to say anything useful about tighter-than-loose parallel jobs, since memory size/intensiveness varies a lot - moreso than for loose/serial, at least in my experience... regards, mark hahn. From rbw at ahpcrc.org Tue Dec 5 10:11:07 2006 From: rbw at ahpcrc.org (Richard Walsh) Date: Tue, 05 Dec 2006 12:11:07 -0600 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <20061205164943.GA2722@galactic.demon.co.uk> References: <45743B22.5090908@uiowa.edu> <20061205164943.GA2722@galactic.demon.co.uk> Message-ID: <4575B63B.5080709@ahpcrc.org> Andrew M.A. Cater wrote: > On Mon, Dec 04, 2006 at 09:13:38AM -0600, Eric Shook wrote: > >> Hi Bruce, >> >> Our University is also looking into these racks. We have also looked at >> other vendors with similar liquid cooling and something called >> "Spraycool" technology (limited in deployment) among others. I would >> also be interested in the information you collect on or off the list and >> would be willing to share some information. >> >> > The only things I know that used spraycool technology were big machines > like Crays and ?? Thinking Machines ?? which dunked circuit boards in > freon and sprayed the liquid to keep it moving. Surely they can't have > revived that :) > > AndyC > All, In case there is interest ... A "spray-cool" technology is the approach taken currently in the Cray X1. The liquid is a high-heat capacity fluoro-carbon, fairly inert, very expensive, but nasty if it starts to burn. One of the key innovations on the Cray X1 is that the circuits are "on the ceiling" so to speak and sprayed from below. The fluid is gravity collected and cycled up again. A "spray-cooled" rack would seem to require hermetic enclosure at some level if the same liquid were used. rbw -- Richard B. Walsh "The world is given to me only once, not one existing and one perceived. The subject and object are but one." Erwin Schroedinger Project Manager Network Computing Services, Inc. Army High Performance Computing Research Center (AHPCRC) rbw at ahpcrc.org | 612.337.3467 ----------------------------------------------------------------------- This message (including any attachments) may contain proprietary or privileged information, the use and disclosure of which is legally restricted. If you have received this message in error please notify the sender by reply message, do not otherwise distribute it, and delete this message, with all of its contents, from your files. ----------------------------------------------------------------------- From amjad11 at gmail.com Tue Dec 5 06:27:41 2006 From: amjad11 at gmail.com (amjad ali) Date: Tue, 5 Dec 2006 14:27:41 +0000 Subject: [Beowulf] More technical information and spec of beowulf In-Reply-To: <1bef2ce30612030813j1d8c2cb6r9c9c49f8db617683@mail.gmail.com> References: <1bef2ce30612030813j1d8c2cb6r9c9c49f8db617683@mail.gmail.com> Message-ID: <428810f20612050627v629d284fy84a86eeb459ce331@mail.gmail.com> Moussavi, you should also have a look at the links given in the email from Chris Samuel and Me (for better undestanding). Specially the following, "A survey of open source cluster management systems" by StoneLion at http://www.linux.com/ http://oscar.openclustergroup.org CBeST from PSSC labs. MOAB Scali-Manage Scyld Software. regards, AMJAD ALI. From ajitm at cdac.in Tue Dec 5 08:46:18 2006 From: ajitm at cdac.in (ajit mote) Date: Tue, 05 Dec 2006 22:16:18 +0530 Subject: [Beowulf] mpi cleanup of tasks Message-ID: <1165337178.27963.10.camel@sysadm.stp.cdac.ernet.in> Hi, I was trying to write a script that cleans up orphaned mpi tasks.If u can actually mail me a mpi-cleanup script (and install instructions) that cleans all orphaned mpi processors ,it will save me a lot of trouble writing the script. And I will be indebted to you for life. Thanking you. From amacater at galactic.demon.co.uk Tue Dec 5 12:24:31 2006 From: amacater at galactic.demon.co.uk (Andrew M.A. Cater) Date: Tue, 5 Dec 2006 20:24:31 +0000 Subject: [Beowulf] SATA II In-Reply-To: References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> Message-ID: <20061205202431.GA3520@galactic.demon.co.uk> On Tue, Dec 05, 2006 at 12:07:36PM -0500, Mark Hahn wrote: > >1-Regarding OS, is "Fedora Core 64bit" a good option for AMD Athlon 64 X2 > >4200+? > > sure. distros are just desktop decoration, and anything recent will > perform equally well. you do probably want 64b, but that's not rare. > Think carefully about what it is you are looking to do. Fedora is relatively "bleeding edge" and has a very short lifetime - typically a year or 18 months of support. If you are looking to build a large cluster, you may want to evaluate something with a longer lifecycle / more stability. I will always recommend Debian for this: there is a huge amount of software ready packaged but, more importantly, the minimal install really is minimal if that is what you want. Debian Etch for AMD64 is stable enough for me: it is the current testing release but should be released as stable Real Soon Now (it was expected yesterday - it may be released as stable by the end of December). > >2- Is SATA II HDD compatible with Fedora Core 64bit? > > disks don't have compatibility - controllers do. so it depends on your > motherboard choice. but I haven't seen any builtin controllers that > don't work well. > How long is a piece of string? As Mark says, it depends on the motherboard controller. Brand new boards may need brand new kernels to support the chipset. > >3- Concerning RAM, is "2 GB 800 MHz DDR2" sufficient? > > any general answer is wrong. 2G is a huge waste of money for some > applications, and not nearly enough for others. you'll pay a noticable > premium for ddr2/800 as opposed to ddr2/667, though, and might not > notice the difference (again, depending on your workload). > Is interconnect speed a limiting factor or is memory access speed most important? Build a toy Beowulf for evaluation from four machines and run a subset of your work on it. Where are you sourcing your memory from and what premium are you paying if you want to take the cluster down to improve memory on the nodes later? Power,environment, cooling and heat wise, memory is cheaper and more reliable than spinning hard disks. > there is only one general correlations I'll draw: many, many > loosely-coupled and/or serial jobs have tiny memory footprints (so 2GB is > overkill, and since > the cache is effective, higher bandwidth is wasted). > > it's hard to say anything useful about tighter-than-loose parallel jobs, > since memory size/intensiveness varies a lot - moreso than for loose/serial, > at least in my experience... > > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From john.hearns at streamline-computing.com Tue Dec 5 14:45:13 2006 From: john.hearns at streamline-computing.com (John Hearns) Date: Tue, 05 Dec 2006 22:45:13 +0000 Subject: [Beowulf] SATA II In-Reply-To: <20061205202431.GA3520@galactic.demon.co.uk> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <20061205202431.GA3520@galactic.demon.co.uk> Message-ID: <4575F679.8040908@streamline-computing.com> Andrew M.A. Cater wrote: > On Tue, Dec 05, 2006 at 12:07:36PM -0500, Mark Hahn wrote: >>> 1-Regarding OS, is "Fedora Core 64bit" a good option for AMD Athlon 64 X2 >>> 4200+? >> sure. distros are just desktop decoration, and anything recent will >> perform equally well. you do probably want 64b, but that's not rare. >> > Think carefully about what it is you are looking to do. Fedora is > relatively "bleeding edge" and has a very short lifetime - typically a > year or 18 months of support. If you are looking to build a large I agree with what Andrew says. I would give some very serious consideration to SuSE Linux - which has a reputation for better support for 64-bit processors and the latest SATA/SAS controllers. There is the free professional version or SLES. On the Redhat side, give consideration to Redhat Enterprise. Or one of the Redhat-alike recompiles. I'm hinting at Scientific Linux here. Yes, SL 4 may not have the very latest leading edge kernel, but it does have a long suport lifetime. From greg.lindahl at qlogic.com Tue Dec 5 15:34:49 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Tue, 5 Dec 2006 15:34:49 -0800 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <4575B63B.5080709@ahpcrc.org> References: <45743B22.5090908@uiowa.edu> <20061205164943.GA2722@galactic.demon.co.uk> <4575B63B.5080709@ahpcrc.org> Message-ID: <20061205233449.GA5628@greglaptop> On Tue, Dec 05, 2006 at 12:11:07PM -0600, Richard Walsh wrote: > One of the key innovations on the Cray X1 is that the circuits > are "on the ceiling" > so to speak and sprayed from below. The fluid is gravity collected and > cycled up again. This technology predates the X1 by a while. -- g From rbw at ahpcrc.org Tue Dec 5 15:45:27 2006 From: rbw at ahpcrc.org (Richard Walsh) Date: Tue, 05 Dec 2006 17:45:27 -0600 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <20061205233449.GA5628@greglaptop> References: <45743B22.5090908@uiowa.edu> <20061205164943.GA2722@galactic.demon.co.uk> <4575B63B.5080709@ahpcrc.org> <20061205233449.GA5628@greglaptop> Message-ID: <45760497.3030101@ahpcrc.org> Greg Lindahl wrote: > On Tue, Dec 05, 2006 at 12:11:07PM -0600, Richard Walsh wrote: > > >> One of the key innovations on the Cray X1 is that the circuits >> are "on the ceiling" >> so to speak and sprayed from below. The fluid is gravity collected and >> cycled up again. >> > > This technology predates the X1 by a while. > Mmm ... I did not know that ... in a reasonably successful commercial product (i.e. an innovation, rather than a mere invention)? What was/is/were/are the product(s)? Perhaps outside of HPC ... rbw > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > -- Richard B. Walsh "The world is given to me only once, not one existing and one perceived. The subject and object are but one." Erwin Schroedinger Project Manager Network Computing Services, Inc. Army High Performance Computing Research Center (AHPCRC) rbw at ahpcrc.org | 612.337.3467 ----------------------------------------------------------------------- This message (including any attachments) may contain proprietary or privileged information, the use and disclosure of which is legally restricted. If you have received this message in error please notify the sender by reply message, do not otherwise distribute it, and delete this message, with all of its contents, from your files. ----------------------------------------------------------------------- From eric-shook at uiowa.edu Tue Dec 5 13:09:51 2006 From: eric-shook at uiowa.edu (Eric Shook) Date: Tue, 05 Dec 2006 15:09:51 -0600 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <20061205164943.GA2722@galactic.demon.co.uk> References: <45743B22.5090908@uiowa.edu> <20061205164943.GA2722@galactic.demon.co.uk> Message-ID: <4575E01F.8020309@uiowa.edu> It has been revived. I was at their booth at SC06 asking about their technology. The website is http://www.spraycool.com. They offer it for a small set of Tier1 systems (or others for a cost I'm sure) If anyone knows anything I would be interested in feedback as I am interested. Thanks, Eric Andrew M.A. Cater wrote: > On Mon, Dec 04, 2006 at 09:13:38AM -0600, Eric Shook wrote: >> Hi Bruce, >> >> Our University is also looking into these racks. We have also looked at >> other vendors with similar liquid cooling and something called >> "Spraycool" technology (limited in deployment) among others. I would >> also be interested in the information you collect on or off the list and >> would be willing to share some information. >> > The only things I know that used spraycool technology were big machines > like Crays and ?? Thinking Machines ?? which dunked circuit boards in > freon and sprayed the liquid to keep it moving. Surely they can't have > revived that :) > > AndyC > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu From James.P.Lux at jpl.nasa.gov Tue Dec 5 19:23:05 2006 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Tue, 05 Dec 2006 19:23:05 -0800 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <45760497.3030101@ahpcrc.org> References: <45743B22.5090908@uiowa.edu> <20061205164943.GA2722@galactic.demon.co.uk> <4575B63B.5080709@ahpcrc.org> <20061205233449.GA5628@greglaptop> <45760497.3030101@ahpcrc.org> Message-ID: <6.2.3.4.2.20061205191702.032bbe30@mail.jpl.nasa.gov> At 03:45 PM 12/5/2006, Richard Walsh wrote: >Greg Lindahl wrote: >>On Tue, Dec 05, 2006 at 12:11:07PM -0600, Richard Walsh wrote: >> >> >>>One of the key innovations on the Cray X1 is that the circuits are >>>"on the ceiling" >>>so to speak and sprayed from below. The fluid is gravity >>>collected and cycled up again. >>> >> >>This technology predates the X1 by a while. >> > Mmm ... I did not know that ... in a reasonably successful > commercial product (i.e. an innovation, rather > than a mere invention)? What was/is/were/are the product(s)? >Perhaps outside of HPC ... This has been around for quite a while (decades at least). It was in a Fluorinert brochure back in the mid 80s that I recall. There's also versions with ebullient (boiling) cooling. There might even be high power vacuum tubes cooled this way, although I think they either tend to use a cooling jacket or a boiler, as opposed to spraying). And, of course, it's been mentioned on this list, several years ago, at least, as a potential solution for cooling a sealed box portable cluster that uses off the shelf mobos designed for convection cooling) As far as spray cooling of things goes, that's been around since Newcomen invented his steam engine, predating James Watt. Watt's big advance was realizing that the temperature changes in the engine cycle didn't all have to occur in the same place (i.e. you could keep the cylinder hot, and condense the steam somewhere else). Spray type heat exchangers are also fairly standard devices (a notable example being the "swamp cooler" familiar to those in dry desert climates) So, a decent idea, been around for a while, all it takes to make it a good *business* idea is the convergence of a need and a decent execution that is reliable and works. > James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From daniel.kidger at clearspeed.com Wed Dec 6 04:20:35 2006 From: daniel.kidger at clearspeed.com (Daniel Kidger) Date: Wed, 6 Dec 2006 12:20:35 -0000 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <4575E01F.8020309@uiowa.edu> Message-ID: <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> In spite of many of the slides saying "ISR Propiatory and Confidential" I did find this presentation on the web: http://www.vita.com/cool/pres-2004/1430-tilton.pdf see slide 11 for a photo of what they are doing. Slide 18 implies there early market is for Defence systems - I guess getting rid of hot air on a submarine is a bit tricky? I am at a UK HPC Conference today (so is Greg L for that matter) One of the speakers said he was evaluating Spraycool to retrofit to his existing cluster. So if these guys spray *downwards* on the chip - what is the risk of a blocked tube causing the Flurinert to catch fire? Daniel Dr. Daniel Kidger, Technical Consultant, ClearSpeed Technology plc, Bristol, UK E: daniel.kidger at clearspeed.com T: +44 117 317 2030 M: +44 7738 458742 "Write a wise saying and your name will live forever." - Anonymous. -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Eric Shook Sent: 05 December 2006 21:10 To: Andrew M.A. Cater Cc: beowulf at beowulf.org Subject: Re: [Beowulf] 'liquid cooled' racks It has been revived. I was at their booth at SC06 asking about their technology. The website is http://www.spraycool.com. They offer it for a small set of Tier1 systems (or others for a cost I'm sure) If anyone knows anything I would be interested in feedback as I am interested. Thanks, Eric Andrew M.A. Cater wrote: > On Mon, Dec 04, 2006 at 09:13:38AM -0600, Eric Shook wrote: >> Hi Bruce, >> >> Our University is also looking into these racks. We have also looked at >> other vendors with similar liquid cooling and something called >> "Spraycool" technology (limited in deployment) among others. I would >> also be interested in the information you collect on or off the list and >> would be willing to share some information. >> > The only things I know that used spraycool technology were big machines > like Crays and ?? Thinking Machines ?? which dunked circuit boards in > freon and sprayed the liquid to keep it moving. Surely they can't have > revived that :) > > AndyC > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From cap at nsc.liu.se Wed Dec 6 04:39:41 2006 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Wed, 6 Dec 2006 13:39:41 +0100 Subject: [Beowulf] SATA II In-Reply-To: <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> Message-ID: <200612061339.45032.cap@nsc.liu.se> On Wednesday 06 December 2006 07:11, Ruhollah Moussavi Baygi wrote: > Thanks Peter, > But do you mean that SATA is not a suitable choice for a beowulf cluster? huh? Both me and Mark relied almost identically. Pointing out that the controller (motherboard SATA or add-on raid-controller) is what you have to be careful about, SATA or SATAII drives makes very little difference from a compatibility point of view. That said, SATA is very suitable for a beowulf cluster. /Peter ... > > > 2- Is SATA II HDD compatible with Fedora Core 64bit? > > > > The sata and/or raid controller will be a lot more critical to your linux > > experience than any choice of HDD. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From landman at scalableinformatics.com Wed Dec 6 04:59:19 2006 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 06 Dec 2006 07:59:19 -0500 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> References: <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> Message-ID: <4576BEA7.5080008@scalableinformatics.com> Daniel Kidger wrote: > Slide 18 implies there early market is for Defence systems - I guess getting > rid of hot air on a submarine is a bit tricky? Hmmm... I would suggest that the important aspect is the lower noise level associated with heat removal on a sub. A bunch of 40mm fans whining away would make for rather easy tracking of subs... Considering that they are immersed in a huge heat bath of reasonable temperature, the important part of this would be how to couple the heat source effectively to this heat bath, also without lots of noise. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From ashley at quadrics.com Wed Dec 6 05:05:18 2006 From: ashley at quadrics.com (Ashley Pittman) Date: Wed, 06 Dec 2006 13:05:18 +0000 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> References: <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> Message-ID: <1165410318.2246.20.camel@localhost.localdomain> On Wed, 2006-12-06 at 12:20 +0000, Daniel Kidger wrote: > Slide 18 implies there early market is for Defence systems - I guess getting > rid of hot air on a submarine is a bit tricky? I shouldn't imagine it's to tricky although it probably depends on where you park it, I can't imagine hot liquid would be any different however. Ashley, From greg.lindahl at qlogic.com Tue Dec 5 05:49:46 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Tue, 5 Dec 2006 05:49:46 -0800 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> References: <4575E01F.8020309@uiowa.edu> <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> Message-ID: <20061205134946.GA1871@greglaptop.dl.ac.uk> On Wed, Dec 06, 2006 at 12:20:35PM -0000, Daniel Kidger wrote: > So if these guys spray *downwards* on the chip - what is the risk of a blocked > tube causing the Flurinert to catch fire? They don't use the same Flurinert that does nasty things when it catches on fire. -- greg From rbw at ahpcrc.org Wed Dec 6 06:07:41 2006 From: rbw at ahpcrc.org (Richard Walsh) Date: Wed, 06 Dec 2006 08:07:41 -0600 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <6.2.3.4.2.20061205191702.032bbe30@mail.jpl.nasa.gov> References: <45743B22.5090908@uiowa.edu> <20061205164943.GA2722@galactic.demon.co.uk> <4575B63B.5080709@ahpcrc.org> <20061205233449.GA5628@greglaptop> <45760497.3030101@ahpcrc.org> <6.2.3.4.2.20061205191702.032bbe30@mail.jpl.nasa.gov> Message-ID: <4576CEAD.2000007@ahpcrc.org> Jim Lux wrote: > At 03:45 PM 12/5/2006, Richard Walsh wrote: >> Greg Lindahl wrote: >>> On Tue, Dec 05, 2006 at 12:11:07PM -0600, Richard Walsh wrote: >>>> One of the key innovations on the Cray X1 is that the circuits are >>>> "on the ceiling" >>>> so to speak and sprayed from below. The fluid is gravity collected >>>> and cycled up again. >>> This technology predates the X1 by a while. >>> >> Mmm ... I did not know that ... in a reasonably successful >> commercial product (i.e. an innovation, rather >> than a mere invention)? What was/is/were/are the product(s)? >> Perhaps outside of HPC ... > > This has been around for quite a while (decades at least). It was in > a Fluorinert brochure back in the mid 80s that I recall. There's also > versions with ebullient (boiling) cooling. There might even be high > power vacuum tubes cooled this way, although I think they either tend > to use a cooling jacket or a boiler, as opposed to spraying). > This is drifting away from the useful, but I was asking what other successful HPC (or computing generally) product has used a spray-cool, gravity collection system the like Cray X1. I am not saying there isn't one, just asking someone to tell what it is. And as a side note, to me innovation implies successful commercial application ... I understand that evaporation is a cooling process ... ;-) ... rbw -- Richard B. Walsh "The world is given to me only once, not one existing and one perceived. The subject and object are but one." Erwin Schroedinger Project Manager Network Computing Services, Inc. Army High Performance Computing Research Center (AHPCRC) rbw at ahpcrc.org | 612.337.3467 ----------------------------------------------------------------------- This message (including any attachments) may contain proprietary or privileged information, the use and disclosure of which is legally restricted. If you have received this message in error please notify the sender by reply message, do not otherwise distribute it, and delete this message, with all of its contents, from your files. ----------------------------------------------------------------------- From James.P.Lux at jpl.nasa.gov Wed Dec 6 06:23:42 2006 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed, 06 Dec 2006 06:23:42 -0800 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> References: <4575E01F.8020309@uiowa.edu> <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> Message-ID: <6.2.3.4.2.20061206061702.032cd340@mail.jpl.nasa.gov> At 04:20 AM 12/6/2006, Daniel Kidger wrote: >In spite of many of the slides saying "ISR Propiatory and Confidential" Hah.. they've published it on the web, so it's not proprietary and confidential any more. Furthermore, if they really want trade secret protection, they've got to be a bit more careful about how they mark their stuff, or someone who DID steal their stuff could use the following defense: "How was I to know that this really was proprietary, they mark their stuff any old way, and lots of published information is marked as proprietary, even when it isn't, so the markings have no meaning." been there, done that, sat in the depositions. >I did find this presentation on the web: > >http://www.vita.com/cool/pres-2004/1430-tilton.pdf > >see slide 11 for a photo of what they are doing. > >Slide 18 implies there early market is for Defence systems - I guess getting >rid of hot air on a submarine is a bit tricky? And for land and portable apps. Spray cooling in one form or another has been around for quite a while. It's really only useful when you can't afford to design a conduction cooled system (i.e. you absolutely, positively have to use some COTS widget, available in no other form, in sealed box). It's a lot of complexity (pumps, sprayers, fluids, orientation sensistivity) compared to simple things like blowing cold air over the device. >I am at a UK HPC Conference today (so is Greg L for that matter) >One of the speakers said he was evaluating Spraycool to retrofit to his >existing cluster. > >So if these guys spray *downwards* on the chip - what is the risk of a blocked >tube causing the Flurinert to catch fire? Fluorinert is, as the name implies, inert. It doesn't burn. Actually, it's pretty amazing stuff. If you remember the photo from decades ago of the mouse breathing under the surface of a liquid.. that was Flourinert. It IS quite pricey. Back in the 80s it was in the hundred dollars/gallon range. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From rgb at phy.duke.edu Wed Dec 6 06:46:42 2006 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 6 Dec 2006 09:46:42 -0500 (EST) Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <1165410318.2246.20.camel@localhost.localdomain> References: <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> <1165410318.2246.20.camel@localhost.localdomain> Message-ID: On Wed, 6 Dec 2006, Ashley Pittman wrote: > On Wed, 2006-12-06 at 12:20 +0000, Daniel Kidger wrote: >> Slide 18 implies there early market is for Defence systems - I guess getting >> rid of hot air on a submarine is a bit tricky? > > I shouldn't imagine it's to tricky although it probably depends on where > you park it, I can't imagine hot liquid would be any different however. Right, the waste heat from cluster nodes is a tiny bit of redirected energy and waste heat from the nuclear reactor. Submarines have always had to deal with thermal regulation in both directions as they generate 2nd-law based waste heat in many locations (including human bodies) and sail through waters that range from tepid to warm in the tropics to quite cold in the arctic (and generally quite cool at cruising depths wherever). It's not like all the OTHER electronics on a submarine don't generate heat. rgb > > Ashley, > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From rbw at ahpcrc.org Wed Dec 6 07:50:55 2006 From: rbw at ahpcrc.org (Richard Walsh) Date: Wed, 06 Dec 2006 09:50:55 -0600 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <6.2.3.4.2.20061206061702.032cd340@mail.jpl.nasa.gov> References: <4575E01F.8020309@uiowa.edu> <041601c71930$eeeea3d0$7680a8c0@win.clearspeed.com> <6.2.3.4.2.20061206061702.032cd340@mail.jpl.nasa.gov> Message-ID: <4576E6DF.9020701@ahpcrc.org> Jim Lux wrote: > At 04:20 AM 12/6/2006, Daniel Kidger wrote: >> I am at a UK HPC Conference today (so is Greg L for that matter) >> One of the speakers said he was evaluating Spraycool to retrofit to his >> existing cluster. >> >> So if these guys spray *downwards* on the chip - what is the risk of >> a blocked >> tube causing the Flurinert to catch fire? > > Fluorinert is, as the name implies, inert. It doesn't burn. > Actually, it's pretty amazing stuff. If you remember the photo from > decades ago of the mouse breathing under the surface of a liquid.. > that was Flourinert. Non-combustable, but not exactly inert. At temperatures greater than 200 degrees C it begins to decompose yielding an HF aerosol among other things. I believe there was an incident with the old Cray T90 that required a computer room evacuation, but perhaps that was an urban myth. rbw -- Richard B. Walsh "The world is given to me only once, not one existing and one perceived. The subject and object are but one." Erwin Schroedinger Project Manager Network Computing Services, Inc. Army High Performance Computing Research Center (AHPCRC) rbw at ahpcrc.org | 612.337.3467 ----------------------------------------------------------------------- This message (including any attachments) may contain proprietary or privileged information, the use and disclosure of which is legally restricted. If you have received this message in error please notify the sender by reply message, do not otherwise distribute it, and delete this message, with all of its contents, from your files. ----------------------------------------------------------------------- From gdjacobs at gmail.com Wed Dec 6 08:30:56 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Wed, 06 Dec 2006 10:30:56 -0600 Subject: [Beowulf] A suitable motherboard for newbie In-Reply-To: References: <1bef2ce30611260352r33e8d1aia809f75280e1ca07@mail.gmail.com> <428810f20611280325j44e500cdm7c9e91c6abd57899@mail.gmail.com> Message-ID: <4576F040.90803@gmail.com> Krugger wrote: > I would like to contribute that going quad core is a really bad idea > at the moment, as linux support for the new boards is not very good at > the moment. Even some Core Duo 2 ready boards have trouble, either it > is the SATA controller or some other new stuff that they put in it. So > before you buy check your hardware against known hardware problems in > google: like some Tyan models that happen to have erratic behavior and > need BIOS upgrades or SATA controller that don't work or RAID boards > that break under heavy load. This is especially true if your are > buying in a small amount and don't have a maintainance contract with a > supplier. > > Now to get the most out of your money you should be really considering > how much you get for your money. For example going rack mounted is > only an option if you intent to expand and have a cooled room. > > Another thing is not to buy the latest tecnology available but go for > the reliable and cheaper computers. Why? Because you get more > computers which in the end will get you more overall computing power. > For example for 20.000 dolares you can get almost twice as many > opterons if you had dropped the dual core option, which means twice > the total memory, which allows you to run bigger simulations. > Basically you get the same number of tasks, 5(machines) * 2(cpus) * > 2(cores) = 10(machines) * 2(cpus), but you get more memory 20Gb RAM > compared to 40Gb. This considering if you had chosen the top of the > line dual core opteron, which costed three times the cost of top of > the line cpu without dual core, which really isn't considered top of > the line. For such a low density application, you might consider dual core Athlon X2 cpus. Couple these with an entry level enthusiast motherboard (possibly an MSI k9nu?), ECC ram, and possibly a high performance GigE nic depending on what's on the motherboard. Intel and Broadcom based nics are nice because they're fast, well supported by standard Linux kernel drivers, and work with low-latency stacks like gamma. Opteron carries a hefty price premium, and is only necessary when you want to run 2+ socket motherboards. -- Geoffrey D. Jacobs Go to the Chinese Restaurant, Order the Special From bill at cse.ucdavis.edu Wed Dec 6 12:01:36 2006 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 06 Dec 2006 12:01:36 -0800 Subject: [Beowulf] A suitable motherboard for newbie In-Reply-To: <4576F040.90803@gmail.com> References: <1bef2ce30611260352r33e8d1aia809f75280e1ca07@mail.gmail.com> <428810f20611280325j44e500cdm7c9e91c6abd57899@mail.gmail.com> <4576F040.90803@gmail.com> Message-ID: <457721A0.7000207@cse.ucdavis.edu> > For such a low density application, you might consider dual core Athlon > X2 cpus. Couple these with an entry level enthusiast motherboard > (possibly an MSI k9nu?), ECC ram, and possibly a high performance GigE > nic depending on what's on the motherboard. Intel and Broadcom based > nics are nice because they're fast, well supported by standard Linux > kernel drivers, and work with low-latency stacks like gamma. > > Opteron carries a hefty price premium, and is only necessary when you > want to run 2+ socket motherboards. Speaking of which, Mark recently pointed out to me that the new AMD Quad FX chips (a pair of dual cores) are basically half price opterons. From what limited info I have it looks like they don't support registered memory, but otherwise use the same socket, same cache, etc. Seems ideal for many HPC uses (except those which require a large number of dimms per cpu). Has anyone played with a pair? Is the "Quad FX" socket actually different? Or does it just depend on non-registered dimms being plugged into the dimm slots? Pricing at: http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_609,00.html?redir=CPT301 From fant at pobox.com Wed Dec 6 15:01:07 2006 From: fant at pobox.com (Andrew D. Fant) Date: Wed, 06 Dec 2006 18:01:07 -0500 Subject: [Beowulf] LSF and Fluent in a beowulf environment Message-ID: <45774BB3.8060102@pobox.com> If I may interrupt for a minute with an application-oriented question... If there is anyone out there who is running Fluent (CFD) in parallel on a shared-access Linux cluster who would be willing to answer a couple questions for me in email, I would really appreciate hearing from you. Part of the problem involves LSF interactions, so it would be nice if you have seen this, but at this point, I would take any advice I am offered. Thanks, Andy -- Andrew Fant | And when the night is cloudy | This space to let Molecular Geek | There is still a light |---------------------- fant at pobox.com | That shines on me | Disclaimer: I don't Boston, MA | Shine until tomorrow, Let it be | even speak for myself From James.P.Lux at jpl.nasa.gov Wed Dec 6 19:47:17 2006 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed, 06 Dec 2006 19:47:17 -0800 Subject: [Beowulf] 'liquid cooled' racks In-Reply-To: <4576CEAD.2000007@ahpcrc.org> References: <45743B22.5090908@uiowa.edu> <20061205164943.GA2722@galactic.demon.co.uk> <4575B63B.5080709@ahpcrc.org> <20061205233449.GA5628@greglaptop> <45760497.3030101@ahpcrc.org> <6.2.3.4.2.20061205191702.032bbe30@mail.jpl.nasa.gov> <4576CEAD.2000007@ahpcrc.org> Message-ID: <6.2.3.4.2.20061206194125.02f79b58@mail.jpl.nasa.gov> At 06:07 AM 12/6/2006, Richard Walsh wrote: >Jim Lux wrote: >>At 03:45 PM 12/5/2006, Richard Walsh wrote: >>>Greg Lindahl wrote: >>>>On Tue, Dec 05, 2006 at 12:11:07PM -0600, Richard Walsh wrote: >>>>>One of the key innovations on the Cray X1 is that the circuits >>>>>are "on the ceiling" >>>>>so to speak and sprayed from below. The fluid is gravity >>>>>collected and cycled up again. >>>>This technology predates the X1 by a while. >>> Mmm ... I did not know that ... in a reasonably successful >>> commercial product (i.e. an innovation, rather >>> than a mere invention)? What was/is/were/are the product(s)? >>> Perhaps outside of HPC ... >> >>This has been around for quite a while (decades at least). It was >>in a Fluorinert brochure back in the mid 80s that I >>recall. There's also versions with ebullient (boiling) >>cooling. There might even be high power vacuum tubes cooled this >>way, although I think they either tend to use a cooling jacket or a >>boiler, as opposed to spraying). >This is drifting away from the useful, but I was asking what other >successful HPC (or computing generally) >product has used a spray-cool, gravity collection system the like >Cray X1. I am not saying there isn't one, just >asking someone to tell what it is. How do you define "successful"? Paying the salaries of the developers? Making a profit for the company using it? Working without breaking? >And as a side note, to me innovation implies successful commercial >application ... I understand that evaporation >is a cooling process ... ;-) ... Lots of successful innovations aren't very commercially viable (in the sense of providing a return to the shareholders). The Mars Rovers are fairly successful, contain lots of innovation, but there's not much commercial value in the science data returned from them. There are "wick/evaporation" cooling systems around too.. some "heat pipes" work by this principle. In fact, most passive heat pipes work by some sort of phase change (evaporate where it's hot, condense where it's cool) scheme. >rbw > >-- James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From ruhollah.mb at gmail.com Tue Dec 5 22:11:32 2006 From: ruhollah.mb at gmail.com (Ruhollah Moussavi Baygi ) Date: Wed, 6 Dec 2006 09:41:32 +0330 Subject: [Beowulf] SATA II In-Reply-To: <200612051748.46469.cap@nsc.liu.se> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> Message-ID: <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> Thanks Peter, But do you mean that SATA is not a suitable choice for a beowulf cluster? On 12/5/06, Peter Kjellstrom wrote: > > On Tuesday 05 December 2006 08:34, Ruhollah Moussavi Baygi wrote: > > Hi All at Beowulf ! > > > > There are some questions about implementation of a Beowulf cluster: > > > > 1-Regarding OS, is "Fedora Core 64bit" a good option for AMD Athlon 64 > X2 > > 4200+? > > You'll have to upgrade to a later fedora core after a while when support > runs > out, but pretty much a matter of taste. (technically it'll be as good as > any) > > > 2- Is SATA II HDD compatible with Fedora Core 64bit? > > The sata and/or raid controller will be a lot more critical to your linux > experience than any choice of HDD. > > > 3- Concerning RAM, is "2 GB 800 MHz DDR2" sufficient? > > That depends close to 100% on what you'll run on it. > > /Peter > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > > -- Best, Ruhollah Moussavi B. Computational Physical Sciences Research Laboratory, Department of NanoScience, IPM -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruhollah.mb at gmail.com Wed Dec 6 00:58:49 2006 From: ruhollah.mb at gmail.com (Ruhollah Moussavi Baygi ) Date: Wed, 6 Dec 2006 12:28:49 +0330 Subject: [Beowulf] More technical information and spec of beowulf In-Reply-To: References: <1bef2ce30612030813j1d8c2cb6r9c9c49f8db617683@mail.gmail.com> Message-ID: <1bef2ce30612060058x21c80f91qd8722465ce4edc6e@mail.gmail.com> Hi Reza, Following may be helpful. Please anyone at Beowulf that thinks his/her notes can help, participate in this subject: MPI is nothing, but a library of some functions that can be called by a code (written in FORTRAN or C) to pass information between CPUs. On the other hand, Beowlf is nothing but a network of commonplace computers with this library installed on all fo them. Truthfully speaking, i have not a thorough experience in adjusting MPI configurations, but, as far as i know there are no special configurations, and just installing of MPI is enough. As you told, Beowulf is just this. However, after implementation you can use it depending on your applications. There are 2 possibilities: 1-you (or your staff) yourself write parallel program (FORTRAN/C) and then compile it on Beowulf cluster, or, 2- you use prepared packages like GROMACS, DLPOLY, ESPRESSO, ... which mainly are ready with their source codes. On 12/5/06, reza bakhshi wrote: > > Hello Mr. Mousavi, > Thank you for your notes, but what I have read about Beowulf told me that > we need for example some job scheduling mechanism, distributed programming > (like PVM, MPI, ...). by software infra, i meant in an implemented Beowulf > cluster one just needs pvm, mpich2, condor, ... and that's all? > or we need more configurations applied on pvm, condor, and any other > utility to gain a full beowulf. > I hope these cleared my problem :) > Can you (or anyone) help more? > Thank you again, > > --Reza Bakhshi > > -----Original Message----- > From: "Ruhollah Moussavi Baygi " > To: "reza bakhshi" > Cc: beowulf at beowulf.org > Date: Sun, 3 Dec 2006 19:43:57 +0330 > Subject: Re: [Beowulf] More technical information and spec of beowulf > > > Hi Reza! > > What do u mean exactly by "Beowulf software infrastructure" ? > > > > u know Beowulf is just a manner of connecting commonplace computers > > (nodes) > > together to make a robust system. Then your computational job will be > > divided among nodes. As far as I know, there is no especial software > > for > > this framework. All you need is a message passing interface (MPI) which > > makes it possible for nodes to pass messages together while running > > job. > > > > There are a lot of scientific packages (and mainly open source) which > > have > > been developed in such a way that you can run in parallel, like > > GROMACS, > > DLPOLY, AMBER, WIEN2K, and many others. All these packages (at least > > their > > computational parts), have been compiled by FORTRAN or C/C++ .This is > > due to > > the capability of these two compilers which can be run in parallel and > > work > > with MPI. > > > > However, there are some engineering and finite element software which > > are > > absolutely commercial and capable of being run in parallel, among them > > are > > FLUENT, CFX, and so on. > > > > I hope this could be helpful for you, > > > > Ruhollah Moussavi B. > > On 11/30/06, reza bakhshi wrote: > > > > > > Hi there, > > > How can i find some more detailed technical information on Beowulf > > > software infrastructure? > > > Thank you :) > > > > > > > > > _______________________________________________ > > > Beowulf mailing list, Beowulf at beowulf.org > > > To change your subscription (digest mode or unsubscribe) visit > > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > > > > > > > -- > > Best, > > Ruhollah Moussavi B. > > > > > -- Best, Ruhollah Moussavi Baygi Computational Physical Sciences Research Laboratory, Department of NanoScience, IPM -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruhollah.mb at gmail.com Wed Dec 6 01:03:03 2006 From: ruhollah.mb at gmail.com (Ruhollah Moussavi Baygi ) Date: Wed, 6 Dec 2006 12:33:03 +0330 Subject: [Beowulf] OSCAR? Message-ID: <1bef2ce30612060103x7460435dqbe1cc145e6179617@mail.gmail.com> Hi all at Boewulf, Does anyone have any experience in using OSCAR? Does anyone recommend/oppose using this suite for a small or large Beowulf cluster (AMD dual core 64bit)? -- Best, Ruhollah Moussavi Baygi Computational Physical Sciences Research Laboratory, Department of NanoScience, IPM -------------- next part -------------- An HTML attachment was scrubbed... URL: From turuncu at be.itu.edu.tr Thu Dec 7 12:18:52 2006 From: turuncu at be.itu.edu.tr (turuncu at be.itu.edu.tr) Date: Thu, 7 Dec 2006 22:18:52 +0200 (EET) Subject: [Beowulf] Re: LSF and Fluent in a beowulf environment Message-ID: <49255.85.102.189.28.1165522732.squirrel@www.be.itu.edu.tr> Hi, I will write a simple lsf script that fix the lsf and fluent integration. you can use the following script to submit job to lsf, >> Begin script (don't add this line to script) #!/usr/bin/ksh #BSUB -a fluent # lsf fluent integration paramter #BSUB -J FLUENT # job name #BSUB -o %J.out # LSF out file #BSUB -e %J.err # LSF error file #BSUB -n 12 # number of process (must be same with fluent -t parameter) #BSUB -q trccsq # queue name # ------------------------- # erase rm -rf host.file # get empty node names np=`echo $LSB_HOSTS` # generate new machine file for i in $np do echo $i >> host.file done # ------------------------- fluent -lsf -pnmpi -cnf=./host.file 2ddp -i fluent.jou -t12 -p -g> fluent.out >> End script (don't add this line) To use the script you have to create fluent journal (fluent.jou) and data files. I hope this helps to you. ps: -t12 and lsf -n parameter must be point same number of cpu. Best regards, Ufuk Utku Turuncoglu Istanbul Technical University HPC Lab From gdjacobs at gmail.com Thu Dec 7 22:30:47 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 08 Dec 2006 00:30:47 -0600 Subject: [Beowulf] More technical information and spec of beowulf In-Reply-To: <1bef2ce30612060058x21c80f91qd8722465ce4edc6e@mail.gmail.com> References: <1bef2ce30612030813j1d8c2cb6r9c9c49f8db617683@mail.gmail.com> <1bef2ce30612060058x21c80f91qd8722465ce4edc6e@mail.gmail.com> Message-ID: <45790697.8040609@gmail.com> Nothing says you must use MPI. PVM was around on clusters years and years ago, and many people still prefer it. Some people are using OpenMosix and shared mem on small clusters. A software maniac might code a parallel application in raw tcp/ip. I knew a guy who built a parallel Mandelbrot appliance using raw IPX on DOS in high school. Beowulf is nothing more than using networked commodity-based hardware to collectively solve large, parallel computational problems. -- Geoffrey D. Jacobs From gdjacobs at gmail.com Thu Dec 7 22:34:28 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 08 Dec 2006 00:34:28 -0600 Subject: [Beowulf] SATA II In-Reply-To: <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> Message-ID: <45790774.8040400@gmail.com> Ruhollah Moussavi Baygi wrote: > Thanks Peter, > But do you mean that SATA is not a suitable choice for a beowulf cluster? SATA is fine. You just have to be choosy about the SATA/SAS controller, and be mindful of reliability issues with desktop drives. -- Geoffrey D. Jacobs From hahn at physics.mcmaster.ca Fri Dec 8 06:29:06 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Fri, 8 Dec 2006 09:29:06 -0500 (EST) Subject: [Beowulf] SATA II In-Reply-To: <45790774.8040400@gmail.com> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> <45790774.8040400@gmail.com> Message-ID: >> Thanks Peter, >> But do you mean that SATA is not a suitable choice for a beowulf cluster? > SATA is fine. You just have to be choosy about the SATA/SAS controller, it's interesting that SAS advertising has obscured the fact that SAS is just a further development of SCSI, and not interchangable with SATA. for instance, no SATA controller will support any SAS disk, and any SAS setup uses a form of encapsulation to communicate with the foreign SATA protocol. SAS disks follow the traditional price formula of SCSI disks (at least 4x more than non-boutique disks), and I suspect the rest of SAS infrastructure will be in line with that. > and be mindful of reliability issues with desktop drives. I would claim that this is basically irrelevant for beowulf. for small clusters (say, < 100 nodes), you'll be hitting a negligable number of failures per year. for larger clusters, you can't afford any non-ephemeral install on the disks anyway - reboot-with-reimage should only take a couple minutes more than a "normal" reboot. and if you take the no-install (NFS root) approach (which I strongly recommend) the status of a node-local disks can be just a minor node property to be handled by the scheduler. by all means, buy only 5-year warranty mass-market drives, since there's no longer any premium vs 3 or even 1-year drives. the failure rate I've seen over the past couple years has been quite low - probably around .1-.5% AFR (failures/disk-year). (that's ignoring infant mortality, of course, and a reasonably cooled operating environment; expect higher rates if your supply chain involves piles of un-padded disks sitting on on some shop's counter/shelf/display-case ;) From gdjacobs at gmail.com Fri Dec 8 12:01:51 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 08 Dec 2006 14:01:51 -0600 Subject: [Beowulf] SATA II In-Reply-To: References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> <45790774.8040400@gmail.com> Message-ID: <4579C4AF.5080607@gmail.com> Mark Hahn wrote: >>> Thanks Peter, >>> But do you mean that SATA is not a suitable choice for a beowulf >>> cluster? >> SATA is fine. You just have to be choosy about the SATA/SAS controller, > > it's interesting that SAS advertising has obscured the fact that SAS is > just a further development of SCSI, and not interchangable > with SATA. for instance, no SATA controller will support any SAS disk, > and any SAS setup uses a form of encapsulation to communicate with > the foreign SATA protocol. SAS disks follow the traditional price > formula of SCSI disks (at least 4x more than non-boutique disks), > and I suspect the rest of SAS infrastructure will be in line with that. Yes, SAS encapsulates SATA, but not vice-versa. The ability to use a hardware raid SAS controller with large numbers of inexpensive SATA drives is very attractive. I was also trying to be thorough. >> and be mindful of reliability issues with desktop drives. > > I would claim that this is basically irrelevant for beowulf. > for small clusters (say, < 100 nodes), you'll be hitting a negligable > number of failures per year. for larger clusters, you can't afford > any non-ephemeral install on the disks anyway - reboot-with-reimage > should only take a couple minutes more than a "normal" reboot. > and if you take the no-install (NFS root) approach (which I strongly > recommend) the status of a node-local disks can be just a minor node > property to be handled by the scheduler. PXE/NFS is absolutely the slickest way to go, but any service nodes should have some guarantee of reliability. In my experience, disks (along with power supplies) are two of the most common points of failure. -- Geoffrey D. Jacobs Go to the Chinese Restaurant, Order the Special From mwill at penguincomputing.com Fri Dec 8 13:14:31 2006 From: mwill at penguincomputing.com (Michael Will) Date: Fri, 08 Dec 2006 13:14:31 -0800 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <4579C4AF.5080607@gmail.com> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> <45790774.8040400@gmail.com> <4579C4AF.5080607@gmail.com> Message-ID: <4579D5B7.7090608@jellyfish.highlyscyld.com> Geoff Jacobs wrote: > Mark Hahn wrote: > >> it's interesting that SAS advertising has obscured the fact that SAS is >> just a further development of SCSI, and not interchangable >> with SATA. for instance, no SATA controller will support any SAS disk, >> and any SAS setup uses a form of encapsulation to communicate with >> the foreign SATA protocol. SAS disks follow the traditional price >> formula of SCSI disks (at least 4x more than non-boutique disks), >> and I suspect the rest of SAS infrastructure will be in line with that. >> > Yes, SAS encapsulates SATA, but not vice-versa. The ability to use a > hardware raid SAS controller with large numbers of inexpensive SATA > drives is very attractive. I was also trying to be thorough. > > >>> and be mindful of reliability issues with desktop drives. >>> >> I would claim that this is basically irrelevant for beowulf. >> for small clusters (say, < 100 nodes), you'll be hitting a negligable >> number of failures per year. for larger clusters, you can't afford >> any non-ephemeral install on the disks anyway - reboot-with-reimage >> should only take a couple minutes more than a "normal" reboot. >> and if you take the no-install (NFS root) approach (which I strongly >> recommend) the status of a node-local disks can be just a minor node >> property to be handled by the scheduler. >> > PXE/NFS is absolutely the slickest way to go, but any service nodes > should have some guarantee of reliability. In my experience, disks > (along with power supplies) are two of the most common points of failure Most of the clusters we configure for our customers use diskless compute nodes to minimize compute node failure for precisely the reason you mentioned unless either the application can benefit from additional local scratchspace (i.e. software raid0 over four sata drives allows to read/write large datastreams at 280MB/s in a 1U server with 3TB of disk space on each compute node), or because they need to sometimes run jobs that require more virtual memory than they can afford to put in physically -> local swapspace. We find that customers don't typically want to pay for the premium for redundant power supplies+pdus+cabling for the compute nodes through, that's something that is typically requested for head nodes and NFS servers. Also we find that NFS-offloading on the NFS-server with the rapidfile card helps avoid scalability issues where the NFS server bogs down under massively parallel requests from say 128 cores in a 32 compute node dual cpu dual core cluster. The rapidfile card is a pci-x card with two fibre channel ports + two gige ports + nfs/cifs offloading processor on the same card. Since most bulk data transfer is redirected from fibre channel to gige nfs clients without passing through the NFS server cpu+ram itself, the nfs servers cpu load is not becoming the bottleneck, we find it's rather the amount of spindles before saturating the two gige ports. We configure clusters for our customers with Scyld Beowulf which does not nfs-mount root but rather just nfs-mounts the home directories because of its particular lightweight compute node model, (PXE booting into RAM) and so does not run into the typical nfs-root scalability issues. Michael Michael Will SE Technical Lead / Penguin Computing / www.penguincomputing.com From harsha at zresearch.com Thu Dec 7 21:03:24 2006 From: harsha at zresearch.com (Harshavardhana) Date: Fri, 8 Dec 2006 10:33:24 +0530 (IST) Subject: [Beowulf] Fluent Running on LSF env Message-ID: <4573.220.227.64.170.1165554204.squirrel@zresearch.com> Hi Andrew, i have some experience in doing fluent runs over the LSF env. You can hit me with some questions, may be i can answer some. Regards -Harshavardhana -- Harshavardhana "Software gets slower faster as Hardware gets faster" From eugen at leitl.org Fri Dec 8 01:04:03 2006 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 8 Dec 2006 10:04:03 +0100 Subject: [Beowulf] SATA II In-Reply-To: <45790774.8040400@gmail.com> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> <45790774.8040400@gmail.com> Message-ID: <20061208090403.GR6974@leitl.org> On Fri, Dec 08, 2006 at 12:34:28AM -0600, Geoff Jacobs wrote: > SATA is fine. You just have to be choosy about the SATA/SAS controller, > and be mindful of reliability issues with desktop drives. Some SATA is more reliable than others. Caviar RE2 drives claim enterprise-level reliability, only at a slight price premium in comparison to e.g. WD Raptor series. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From steve_heaton at iinet.net.au Fri Dec 8 18:05:04 2006 From: steve_heaton at iinet.net.au (Steve Heaton) Date: Sat, 09 Dec 2006 13:05:04 +1100 Subject: [Beowulf] Beowulf analogy for a classroom Message-ID: <457A19D0.3070605@iinet.net.au> G'day all I'll skip the background as to 'why' but I've recently been working on a way to explain the Beowulf concept to a classroom of school kids. No computers required. I think I've come up with a useful analogy/experiment that might work. I've posted it here on the off chance that someone else might want to give it a try if they get pressed into such a situation. First, some 'newsgroup preempters'. This is designed for school kids not your typical Beowulf list reader. Sorry but no cars or harnessing of chickens involved ;) No mentions of heat dissipation or switching fabrics. This is a non-technical post that someone might find handy at some point. So, here's the idea. Generate a list of, say, 100 random numbers (int). Make two copies. Break the 3rd copy into 10 lists of 10 numbers each. You need 12 students. They're going to be your 'nodes' :) Ask the teacher which student is the best at mental arithmetic. They're going to be the 'standalone' node. You then pick 11 other students and they'll be the Beowulf. Of those select 1 as the 'master' node. Give the two long (100 number) lists to the 'standalone node' student and the teacher. Give the 10 number lists to the master node. Then you say 'Go!' and start the stopwatch. The 'standalone node' and the teacher start summing their lists. The 'master' node student hands out their lists to the other 'nodes' at their desks and they start summing. As each 'node' student finishes, they walk their summed result back to the 'master' node. The master node student sums the returning results. When the 'standalone' node, 'master' node and teacher have finished their sums, record the time on the board. Now I'm sure you can see there's a huge amount of variables and things that could go wrong (as in any classroom demo). However, hopefully the overall result goes something like the teacher (representing a singe powerful node) finished first. Then the Beowulf group of students, then the 'standalone' node. I think this structure could be quite useful and there's some good 'learning outcomes'. eg: The Beowulf group was 10 times bigger but why wasn't it ten times faster? (Moving the results around = comms time, coordination etc) Why was the teacher faster? ((Hopefully) sharper math skills and no comms overhead) ... Image a Beowulf cluster of 'teachers'! ... Imagine the results if the cluster were (lower grade) students!? What if there were 20 of them instead of 10? (Leads to a discussion of Amdahl's Law etc etc) Obviously you can change the number of 'nodes' and difficulty of the addition to match the student abilities. Anyway, that's the idea. I'd love to hear if anyone ever has cause to use it :) Cheers Stevo From gmpc at sanger.ac.uk Sat Dec 9 01:11:07 2006 From: gmpc at sanger.ac.uk (Guy Coates) Date: Sat, 09 Dec 2006 09:11:07 +0000 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <4579D5B7.7090608@jellyfish.highlyscyld.com> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> <45790774.8040400@gmail.com> <4579C4AF.5080607@gmail.com> <4579D5B7.7090608@jellyfish.highlyscyld.com> Message-ID: <457A7DAB.8070403@sanger.ac.uk> > We configure clusters for our customers with Scyld Beowulf which does > not nfs-mount > root but rather just nfs-mounts the home directories because of its > particular lightweight > compute node model, (PXE booting into RAM) and so does not run into the > typical > nfs-root scalability issues. > > Michael > > Michael Will At what node count does the nfs-root model start to break down? Does anyone have any rough numbers with the number of clients you can support with a generic linux NFS server vs a dedicated NAS filer? Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 From landman at scalableinformatics.com Sat Dec 9 05:58:55 2006 From: landman at scalableinformatics.com (Joe Landman) Date: Sat, 09 Dec 2006 08:58:55 -0500 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <457A7DAB.8070403@sanger.ac.uk> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> <45790774.8040400@gmail.com> <4579C4AF.5080607@gmail.com> <4579D5B7.7090608@jellyfish.highlyscyld.com> <457A7DAB.8070403@sanger.ac.uk> Message-ID: <457AC11F.9060609@scalableinformatics.com> Guy Coates wrote: > > At what node count does the nfs-root model start to break down? Does anyone > have any rough numbers with the number of clients you can support with a generic > linux NFS server vs a dedicated NAS filer? If you use warewulf or the new perceus variant, it creates a ram disk which is populated upon boot. Thats one of the larger transients. Then you nfs mount applications, and home directories. I haven't looked at Scyld for a while, but I seem to remember them doing something like this. If you have this operational in this manner, apart from home/scratch directory file service, you should be able to handle a few hundred nodes without too much pain, though you will want to beef up the design a bit. Scyld requires a meatier head node as I remember due to its launch model. Even better if you can load balance your NFS servers and set them up to mirror each other, or use a hardware unit like Panasas, and so on. Joe > > Cheers, > > Guy > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From john.hearns at streamline-computing.com Sat Dec 9 07:42:58 2006 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat, 09 Dec 2006 15:42:58 +0000 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <457AC11F.9060609@scalableinformatics.com> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> <45790774.8040400@gmail.com> <4579C4AF.5080607@gmail.com> <4579D5B7.7090608@jellyfish.highlyscyld.com> <457A7DAB.8070403@sanger.ac.uk> <457AC11F.9060609@scalableinformatics.com> Message-ID: <457AD982.9090405@streamline-computing.com> Joe Landman wrote: > > Guy Coates wrote: > >> At what node count does the nfs-root model start to break down? Does anyone >> have any rough numbers with the number of clients you can support with a generic >> linux NFS server vs a dedicated NAS filer? > > If you use warewulf or the new perceus variant, it creates a ram disk > which is populated upon boot. Thats one of the larger transients. Then > you nfs mount applications, and home directories. I haven't looked at > Scyld for a while, but I seem to remember them doing something like this. > I've booted up and run a 130 node cluster in ramdisk mode, with applications shared over NFS. I ran HPL benchmarks for the client across all nodes in the cluster in this configuration. That's 130 Sun galaxy 4200s, the head node being a Galaxy 4200 also in this case. I agree with what Joe says about a few hundred nodes being the time you would start to look closer at this approach. > > Even better if you can load balance your NFS servers and set them up to > mirror each other, or use a hardware unit like Panasas, and so on. Second that. -- John Hearns Senior HPC Engineer Streamline Computing, The Innovation Centre, Warwick Technology Park, Gallows Hill, Warwick CV34 6UW Office: 01926 623130 Mobile: 07841 231235 From buccaneer at rocketmail.com Sat Dec 9 08:38:52 2006 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Sat, 9 Dec 2006 08:38:52 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: <20061209163852.31426.qmail@web30604.mail.mud.yahoo.com> [snip] > I agree with what Joe says about a few hundred nodes being the time you > would start to look closer at this approach. I have started to explore the possibility of using this technology because I would really like to see us with the ability to change OSs and OS Personalities as needed. The question I have is with 2000+ compute nodes what kind of infrastructure do I need to support this? ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com From eric-shook at uiowa.edu Sat Dec 9 11:27:45 2006 From: eric-shook at uiowa.edu (Eric Shook) Date: Sat, 09 Dec 2006 13:27:45 -0600 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <20061209163852.31426.qmail@web30604.mail.mud.yahoo.com> References: <20061209163852.31426.qmail@web30604.mail.mud.yahoo.com> Message-ID: <457B0E31.8090205@uiowa.edu> Not to diverge this conversation, but has anyone had any experience using this pxe boot / nfs model with a rhel variant? I have been wanting to do a nfs root or ramdisk model for some-time but our software stack requires a rhel base so Scyld and Perceus most likely will not work (although I am still looking into both of them to make sure) Thanks for any help, Eric Shook Buccaneer for Hire. wrote: > [snip] > > >> I agree with what Joe says about a few hundred nodes being the time you >> would start to look closer at this approach. >> > > I have started to explore the possibility of using this technology because I would really like to see us with the ability to change OSs and OS Personalities as needed. The question I have is with 2000+ compute nodes what kind of infrastructure do I need to support this? > > > > > > > > > > ____________________________________________________________________________________ > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail beta. > http://new.mail.yahoo.com > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From john.hearns at streamline-computing.com Sat Dec 9 11:35:59 2006 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat, 09 Dec 2006 19:35:59 +0000 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <20061209163852.31426.qmail@web30604.mail.mud.yahoo.com> References: <20061209163852.31426.qmail@web30604.mail.mud.yahoo.com> Message-ID: <457B101F.7030402@streamline-computing.com> Buccaneer for Hire. wrote: > [snip] > >> I agree with what Joe says about a few hundred nodes being the time you >> would start to look closer at this approach. > > I have started to explore the possibility of using this technology because I would really like to see us with the ability to change OSs and OS Personalities as needed. The question I have is with 2000+ compute nodes what kind of infrastructure do I need to support this? > With 2000+ nodes you should definitely look at remote power control, and remote serial console access. Also you might think of separate install servers for each (say) 500 machines. Mirror them up to each other of course. Its unlikely that you would ever reboot 2000 machines at once, but think ahead to (say) quick power on following a power cut. I would hazard that any DHCP/PXE type install server would struggle with 2000 requests (yes- you arrange the power switching and/or reboots to stagger at N second intervals). From buccaneer at rocketmail.com Sat Dec 9 12:03:27 2006 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Sat, 9 Dec 2006 12:03:27 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: <20061209200327.65078.qmail@web30610.mail.mud.yahoo.com> Thank you for writing... > With 2000+ nodes you should definitely look at remote power control, and > remote serial console access. Have it already in place with remote monitoring as well. > Also you might think of separate install servers for each (say) 500 > machines. Mirror them up to each other of course. We currently have 5 kickstart servers (one web server), the kickstart file is dynamically altered to reflect the assigned server. > Its unlikely that you would ever reboot 2000 machines at once, but think > ahead to (say) quick power on following a power cut. We have had to do that during a couple of Hurricanes last year and power outages. We actually have complete startup and shutdown procedures that are well tested now. > I would hazard that any DHCP/PXE type install server would struggle with > 2000 requests (yes- you arrange the power switching and/or reboots to > stagger at N second intervals). There are a few modifications you have to make to increase the number of bootps before it fails. So now to figure out my next step. I will need local space for logs and data/temp data files. ____________________________________________________________________________________ Any questions? Get answers on any topic at www.Answers.yahoo.com. Try it now. From laytonjb at charter.net Sat Dec 9 13:04:34 2006 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Sat, 09 Dec 2006 16:04:34 -0500 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <457B0E31.8090205@uiowa.edu> References: <20061209163852.31426.qmail@web30604.mail.mud.yahoo.com> <457B0E31.8090205@uiowa.edu> Message-ID: <457B24E2.20208@charter.net> Eric Shook wrote: > Not to diverge this conversation, but has anyone had any experience > using this pxe boot / nfs model with a rhel variant? I have been > wanting to do a nfs root or ramdisk model for some-time but our > software stack requires a rhel base so Scyld and Perceus most likely > will not work (although I am still looking into both of them to make sure) Warewulf and Perceus will work withe something like CentOS. Jeff From landman at scalableinformatics.com Sat Dec 9 13:56:29 2006 From: landman at scalableinformatics.com (Joe Landman) Date: Sat, 09 Dec 2006 16:56:29 -0500 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <20061209200327.65078.qmail@web30610.mail.mud.yahoo.com> References: <20061209200327.65078.qmail@web30610.mail.mud.yahoo.com> Message-ID: <457B310D.7060201@scalableinformatics.com> Buccaneer for Hire. wrote: >> I would hazard that any DHCP/PXE type install server would struggle >> with 2000 requests (yes- you arrange the power switching and/or >> reboots to stagger at N second intervals). fwiw: we use dnsmasq to serve dhcp and handle pxe booting. It does a marvelous job of both, and is far easier to configure (e.g. it is less fussy) than dhcpd. > There are a few modifications you have to make to increase the number > of bootps before it fails. Likely with dhcpd, not sure how many dnsmasq can handle, but we have done 36 at a time to do system checking. No problems with it. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From hahn at physics.mcmaster.ca Sat Dec 9 14:39:25 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Sat, 9 Dec 2006 17:39:25 -0500 (EST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <457A7DAB.8070403@sanger.ac.uk> References: <1bef2ce30612042334j3cfa6f00ta0a7c3f90afb1842@mail.gmail.com> <200612051748.46469.cap@nsc.liu.se> <1bef2ce30612052211v7f16033bh61f53b11fee26100@mail.gmail.com> <45790774.8040400@gmail.com> <4579C4AF.5080607@gmail.com> <4579D5B7.7090608@jellyfish.highlyscyld.com> <457A7DAB.8070403@sanger.ac.uk> Message-ID: >> particular lightweight >> compute node model, (PXE booting into RAM) and so does not run into the >> typical >> nfs-root scalability issues. I'm not sure I know what those would be. do you mean that the kernel code for nfs-root has inappropriate timeouts or lacked effective retries? > At what node count does the nfs-root model start to break down? Does anyone > have any rough numbers with the number of clients you can support with a generic > linux NFS server vs a dedicated NAS filer? I think the answer depends mostly on your config. for instance, if you have a typical distro's incredibly baroque /etc/rc.d tree, then you'll be generating tons of traffic even though NFS caches quite well. but for HPC clustering, most of that is completely spurious - often a clusters nodes are all identical, so no extensive configurability is necessary in modules, daemons, etc. if there are scalability issues, they depend on saturating your NFS server with traffic, but you control that amount. on a somewhat neglected cluster I have, kernel+initrd amount to 3779277 bytes, which seems quite high. but probably limits the cluster to 10-ish nodes/second booting (it has 100 nodes, but I've never timed the boot). once a node has the kernel+initrd, it reads some other files via NFS, but nothing much (syslog binary+config, same for portmap, and sshd). to me, the tradeoff is transmitting via TFTP vs NFS. I would strongly suspect that the latter is more efficient and robust, so would prefer to minimize the kernel+initrd size. for what it's worth, I tcpdumped a node booting just now: 11428451 bytes in 14236 packets (40.8 seconds). that's a 2.6 kernel, myrinet support, syslog, ssh, queuing system written in perl, and home and scratch mounts. with some effort, that could probably be 5MB or so. it's also clear that separate servers could handle subsets of the traffic in a large cluster (separate dhcp/tftp/syslog, separate servers for nfs root vs other) regards, mark hahn. From hahn at physics.mcmaster.ca Sat Dec 9 15:44:25 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Sat, 9 Dec 2006 18:44:25 -0500 (EST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <20061209200327.65078.qmail@web30610.mail.mud.yahoo.com> References: <20061209200327.65078.qmail@web30610.mail.mud.yahoo.com> Message-ID: >> I would hazard that any DHCP/PXE type install server would struggle with >> 2000 requests a single server (implying 1 gb nic?) might have trouble with the tftp part, but I don't see why you couldn't scale up by splitting the tftp part off to multiple servers. I'd expect a single DHCP (no TFTP) would be plenty in all cases. 100 tftp clients per server would probably be pretty safe. I personally like the idea of putting one admin server in each rack. they don't have to be fancy servers, by any means. > There are a few modifications you have to make to increase the number of bootps before > it fails. do you mean you'd expect load problems even with a single sever dedicated only to dhcp? > So now to figure out my next step. I will need local space for logs and data/temp data files. why would you want logs local? From James.P.Lux at jpl.nasa.gov Sat Dec 9 16:48:41 2006 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Sat, 09 Dec 2006 16:48:41 -0800 Subject: [Beowulf] A quote attributed to Grace Hopper Message-ID: <6.2.3.4.2.20061209164521.02f7a618@mail.jpl.nasa.gov> On the centennial of Adm. Hopper's birth: She said in respect of the building of bigger computers: "In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers." which I find particularly relevant to Beowulfery... (I don't know if should would have advocated programming a cluster in COBOL, though...) James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From buccaneer at rocketmail.com Sat Dec 9 17:11:36 2006 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Sat, 9 Dec 2006 17:11:36 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: <20061210011136.97786.qmail@web30604.mail.mud.yahoo.com> > I personally like the idea of putting one admin server in each rack. >they don't have to be fancy servers, by any means. *LOLOL* At first I was guilty of the one things I am always getting on the other guys for-thinking too literally. I was going to say there is no room in the rack. Of course, the server would not have to even be on the same side of the room. :) >> There are a few modifications you have to make to increase the number of bootps before >> it fails. > > do you mean you'd expect load problems even with a single sever > dedicated only to dhcp? dhcp is not the problem (it is only critical during kickstart and for laptops moved in on a temporary basis. tftp was a problem because of xinetd. We bought 1024 dual opt nodes in 16 racks. When we received the first 6 racks we triggered them all for install-it did not work as expected. >> So now to figure out my next step. I will need local space for logs and data/temp data files. > > why would you want logs local? We have huge data sets, huge scratch data, huge library data (travel time sets) and I worry about network traffic. ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com From hahn at physics.mcmaster.ca Sat Dec 9 18:00:26 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Sat, 9 Dec 2006 21:00:26 -0500 (EST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <20061210011136.97786.qmail@web30604.mail.mud.yahoo.com> References: <20061210011136.97786.qmail@web30604.mail.mud.yahoo.com> Message-ID: > > I personally like the idea of putting one admin server in each rack. > >they don't have to be fancy servers, by any means. > > *LOLOL* At first I was guilty of the one things I am always getting on the > other guys for-thinking too literally. I was going to say there is no room in > the rack. Of course, the server would not have to even be on the same > side of the room. :) no, I really meant to put one admin server (1u is fine) in each rack. I'd already have a Gb switch and possibly a high-speed interconnect leaf in the rack if possible. a modular approach like this cuts down on cabling and out-of-rack traffic. > in on a temporary basis. tftp was a problem because of xinetd. We bought 1024 why would you run tftp from xinetd? generally I don't see the point to inetd anymore - it was a cool hack from days when you wanted lots of daemons reachable, but didn't want them in memory. I usually remove it. > >> So now to figure out my next step. I will need local space for logs and data/temp data files. > > > > why would you want logs local? > > We have huge data sets, huge scratch data, huge library data (travel time sets) > and I worry about network traffic. if your logs are enough to interfere with other traffic, something's wrong. but perhaps you don't look at your logs as much as I do (which is why I want them coalesced.) From buccaneer at rocketmail.com Sat Dec 9 19:44:01 2006 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Sat, 9 Dec 2006 19:44:01 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: <20061210034401.64781.qmail@web30611.mail.mud.yahoo.com> > no, I really meant to put one admin server (1u is fine) in each rack. > I'd already have a Gb switch and possibly a high-speed interconnect > leaf in the rack if possible. a modular approach like this cuts > down on cabling and out-of-rack traffic. No place to put it in the rack. these are blades. But they do have 3 Cisco bricks with their own private pathway between them so I can connect each server to a local switch. > if your logs are enough to interfere with other traffic, something's wrong. > but perhaps you don't look at your logs as much as I do (which is why > I want them coalesced.) Too many nodes and not enough time, and I mostly manage by exception. If it is running well I don't want to have to keep an eye on it. We have a daemon that collects information on each node and keeps an eye on the local log file. Then we have capture the serial console into a log file. ____________________________________________________________________________________ Want to start your own business? Learn how on Yahoo! Small Business. http://smallbusiness.yahoo.com/r-index From john.hearns at streamline-computing.com Sun Dec 10 00:17:14 2006 From: john.hearns at streamline-computing.com (John Hearns) Date: Sun, 10 Dec 2006 08:17:14 +0000 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <20061210011136.97786.qmail@web30604.mail.mud.yahoo.com> References: <20061210011136.97786.qmail@web30604.mail.mud.yahoo.com> Message-ID: <457BC28A.5000201@streamline-computing.com> Buccaneer for Hire. wrote: >> I personally like the idea of putting one admin server in each rack. >> they don't have to be fancy servers, by any means. > > *LOLOL* At first I was guilty of the one things I am always getting on the > other guys for-thinking too literally. I was going to say there is no room in > the rack. Of course, the server would not have to even be on the same > side of the room. :) Depends on your network layout of course - but we Beowulf types like nice flat networks anyway. > > dhcp is not the problem (it is only critical during kickstart and for laptops moved > in on a temporary basis. You'll have to go the dhcp route for booting with an NFS route. You won't regret it anyway. tftp was a problem because of xinetd. We bought 1024 > dual opt nodes in 16 racks. When we received the first 6 racks we triggered them > all for install-it did not work as expected. Always stagger booting by a few seconds between nodes. Stops power surges (OK, unlikely) but more importantly gives all those little daemons time to shift their packets out to the interface. >>> So now to figure out my next step. I will need local space for logs and data/temp data files. >> why would you want logs local? > We have huge data sets, huge scratch data, huge library data (travel time sets) > and I worry about network traffic. Errrrrrrrr.... I would be thinking about a your data storage and transport module. Give thought to a parallel filesystem, Panasas would be good, or Lustre. Or maybe iSCSI servers for the huge library data (if it is read only, then each of these admin nodes per rack could double up as an iSCSI server. Mirror the data between admin nodes, and rejig the fstab on a per-rack basis???) Also motherboards have two gig E ports - if not using the second for MPI it could be a storage network. For huge scratch data - you have local disks. Either write a script to format the disk when you boot the node in NFS-root, the disk has a swap, a /tmp for scratch space and a local /var if you don't want to use a network syslog server. Or leave the install as-is and mount the swap, tmp and var paritions. -- John Hearns Senior HPC Engineer Streamline Computing, The Innovation Centre, Warwick Technology Park, Gallows Hill, Warwick CV34 6UW Office: 01926 623130 Mobile: 07841 231235 From buccaneer at rocketmail.com Sun Dec 10 09:03:56 2006 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Sun, 10 Dec 2006 09:03:56 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: <20061210170356.84880.qmail@web30615.mail.mud.yahoo.com> >> *LOLOL* At first I was guilty of the one things I am always getting on the >> other guys for-thinking too literally. I was going to say there is no room in >> the rack. Of course, the server would not have to even be on the same >> side of the room. :) > > Depends on your network layout of course - but we Beowulf types like > nice flat networks anyway. Certainly much simpler if there is no technical reason not to. In our case we have a /20 of non-RFC1918 space which is its own seperate VLAN. With NIS/DNS/NTP/etc traffic the amount of bandwidth used for broadcasting is low. >> dhcp is not the problem (it is only critical during kickstart and for laptops moved >> in on a temporary basis. > > You'll have to go the dhcp route for booting with an NFS route. > You won't regret it anyway. I am researching it. One of the guys as been wanting to change all machine to boot using DHCP while I have been resistant - believing one always stacks the cards in favor of the least amount of impact when an "issue" occurs. >> tftp was a problem because of xinetd. We bought 1024 >> dual opt nodes in 16 racks. When we received the first 6 racks we triggered them >> all for install-it did not work as expected. > > Always stagger booting by a few seconds between nodes. > Stops power surges (OK, unlikely) but more importantly gives all those > little daemons time to shift their packets out to the interface. Will look into that. I believe the power system should allow for that. [snip] > Errrrrrrrr.... > I would be thinking about a your data storage and transport module. > Give thought to a parallel filesystem, Panasas would be good, or Lustre. > Or maybe iSCSI servers for the huge library data (if it is read only, > then each of these admin nodes per rack could double up as an iSCSI > server. Mirror the data between admin nodes, and rejig the fstab on a > per-rack basis???) I have been pushing for a long to time for us to focus on a standard inside the cluster and right now it is EMC. I have already tried others but I have almost 300TB of data and 400TB of space (the new EMC came in) and I can just start moving things round. But I always have a plan B (and then C.) [snip] > For huge scratch data - you have local disks. > Either write a script to format the disk when you boot the node in > NFS-root, the disk has a swap, a /tmp for scratch space and a local /var > if you don't want to use a network syslog server. > Or leave the install as-is and mount the swap, tmp and var paritions. That's the direction I am thinking. Over the holidays I will start working on a plan with my grid documentation. ____________________________________________________________________________________ Have a burning question? Go to www.Answers.yahoo.com and get answers from real people who know. From csamuel at vpac.org Sun Dec 10 15:09:56 2006 From: csamuel at vpac.org (Chris Samuel) Date: Mon, 11 Dec 2006 10:09:56 +1100 Subject: [Beowulf] More technical information and spec of beowulf In-Reply-To: <45790697.8040609@gmail.com> References: <1bef2ce30612060058x21c80f91qd8722465ce4edc6e@mail.gmail.com> <45790697.8040609@gmail.com> Message-ID: <200612111009.56902.csamuel@vpac.org> On Friday 08 December 2006 17:30, Geoff Jacobs wrote: > I knew a guy who built a parallel Mandelbrot appliance using raw IPX on DOS > in high school. I don't know if I should be amazed or appalled. :-) -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From gdjacobs at gmail.com Sun Dec 10 22:03:22 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Mon, 11 Dec 2006 00:03:22 -0600 Subject: [Beowulf] More technical information and spec of beowulf In-Reply-To: <200612111009.56902.csamuel@vpac.org> References: <1bef2ce30612060058x21c80f91qd8722465ce4edc6e@mail.gmail.com> <45790697.8040609@gmail.com> <200612111009.56902.csamuel@vpac.org> Message-ID: <457CF4AA.7070104@gmail.com> Chris Samuel wrote: > On Friday 08 December 2006 17:30, Geoff Jacobs wrote: > >> I knew a guy who built a parallel Mandelbrot appliance using raw IPX on DOS >> in high school. > > I don't know if I should be amazed or appalled. :-) Big Netware fan. Go figure. -- Geoffrey D. Jacobs From bex061 at gmail.com Fri Dec 8 18:36:48 2006 From: bex061 at gmail.com (samit) Date: Sat, 9 Dec 2006 08:21:48 +0545 Subject: [Beowulf] distributed file storage solution? Message-ID: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> hello list, i am looking for a application (most preferably, but not necessarily OSS or free software) and that can create distributed, reliable, fault tolerant, decentralized, high preformance file server. I've looked at few P2P file storing solutions that store multiple copies of file in different server for data reliability but i havent found any solution that would help me access a file as flexibly like the local file with high preformance. Cross platform solution would be even better as i want to aovoid the SMB overhead if its just a linux based solution! PLEASE RECOMENT ME SOMETHING!? feel free to suggest any solutions that even qualifies half of this criteria! -bipin -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwill at penguincomputing.com Mon Dec 11 16:14:43 2006 From: mwill at penguincomputing.com (Michael Will) Date: Mon, 11 Dec 2006 16:14:43 -0800 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: <433093DF7AD7444DA65EFAFE3987879C3035D6@jellyfish.highlyscyld.com> Scyld CW4 is based on RHEL4 and also supported on Centos 4. That does not give you different operating systems though, just flexible deployment of RHEL4 based HPC compute nodes. Note that we had to reimplement the PXE boot part to allow reasonable scaling. Michael -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Eric Shook Sent: Saturday, December 09, 2006 11:28 AM To: Buccaneer for Hire. Cc: beowulf at beowulf.org Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Not to diverge this conversation, but has anyone had any experience using this pxe boot / nfs model with a rhel variant? I have been wanting to do a nfs root or ramdisk model for some-time but our software stack requires a rhel base so Scyld and Perceus most likely will not work (although I am still looking into both of them to make sure) Thanks for any help, Eric Shook Buccaneer for Hire. wrote: > [snip] > > >> I agree with what Joe says about a few hundred nodes being the time >> you would start to look closer at this approach. >> > > I have started to explore the possibility of using this technology because I would really like to see us with the ability to change OSs and OS Personalities as needed. The question I have is with 2000+ compute nodes what kind of infrastructure do I need to support this? > > > > > > > > > > ______________________________________________________________________ > ______________ > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail beta. > http://new.mail.yahoo.com > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org To change your subscription > (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From kyron at neuralbs.com Mon Dec 11 16:32:43 2006 From: kyron at neuralbs.com (Eric Thibodeau) Date: Mon, 11 Dec 2006 19:32:43 -0500 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> Message-ID: <200612111932.43384.kyron@neuralbs.com> You can look into OpenAFS but be warned that you have to know infrastructure software quite well (LDAP+kerberos). It's cross-platform, can be distributed but don't think it's up to multiple writes on different mirrors though. Le vendredi 8 d?cembre 2006 21:36, samit a ?crit?: > hello list, > > i am looking for a application (most preferably, but not necessarily OSS or > free software) and that can create distributed, reliable, fault tolerant, > decentralized, high preformance file server. I've looked at few P2P file > storing solutions that store multiple copies of file in different server for > data reliability but i havent found any solution that would help me access a > file as flexibly like the local file with high preformance. Cross platform > solution would be even better as i want to aovoid the SMB overhead if its > just a linux based solution! > > PLEASE RECOMENT ME SOMETHING!? feel free to suggest any solutions that even > qualifies half of this criteria! > > -bipin > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517 From hahn at physics.mcmaster.ca Mon Dec 11 16:50:15 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Mon, 11 Dec 2006 19:50:15 -0500 (EST) Subject: [Beowulf] distributed file storage solution? In-Reply-To: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> Message-ID: > i am looking for a application (most preferably, but not necessarily OSS or > free software) and that can create distributed, reliable, fault tolerant, > decentralized, high preformance file server. how reliable/fault-tolerant? how distributed/decentralized (no servers)? how high-performing? > I've looked at few P2P file > storing solutions that store multiple copies of file in different server for > data reliability but i havent found any solution that would help me access a > file as flexibly like the local file with high preformance. do you mean that you want to mount it as a normal filesystem? > solution would be even better as i want to aovoid the SMB overhead if its > just a linux based solution! what is "SMB overhead"? do you mean the overhead of running Samba? From hahn at physics.mcmaster.ca Mon Dec 11 16:52:36 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Mon, 11 Dec 2006 19:52:36 -0500 (EST) Subject: [Beowulf] distributed file storage solution? In-Reply-To: <200612111932.43384.kyron@neuralbs.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> Message-ID: > You can look into OpenAFS but be warned that you have to know >infrastructure software quite well (LDAP+kerberos). It's cross-platform, can >be distributed but don't think it's up to multiple writes on different >mirrors though. and Lustre is parallel, high-performing and somewhat reliable/FT, but is not decent/distributed (it's client-server). From mwill at penguincomputing.com Mon Dec 11 17:03:06 2006 From: mwill at penguincomputing.com (Michael Will) Date: Mon, 11 Dec 2006 17:03:06 -0800 Subject: [Beowulf] distributed file storage solution? Message-ID: <433093DF7AD7444DA65EFAFE3987879C24553C@jellyfish.highlyscyld.com> Look at dcache -----Original Message----- From: Mark Hahn [mailto:hahn at physics.mcmaster.ca] Sent: Mon Dec 11 17:01:15 2006 To: beowulf at beowulf.org Subject: Re: [Beowulf] distributed file storage solution? > You can look into OpenAFS but be warned that you have to know >infrastructure software quite well (LDAP+kerberos). It's cross-platform, can >be distributed but don't think it's up to multiple writes on different >mirrors though. and Lustre is parallel, high-performing and somewhat reliable/FT, but is not decent/distributed (it's client-server). _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From bill at cse.ucdavis.edu Mon Dec 11 17:53:58 2006 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Mon, 11 Dec 2006 17:53:58 -0800 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <200612111932.43384.kyron@neuralbs.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> Message-ID: <457E0BB6.3040905@cse.ucdavis.edu> Eric Thibodeau wrote: > You can look into OpenAFS but be warned that you have to know infrastructure software quite well (LDAP+kerberos). It's cross-platform, can be distributed but don't think it's up to multiple writes on different mirrors though. > Indeed. There are many tough compromises in distributed filesystems. Alas there are many conflicting goals. Coherency vs performance is a big one, you pretty much get one or the other. Locking is another ugly one, databases and some applications assume bit range locking which is sometimes available, sometimes not. Many unix programs assuming posix locking, again sometimes available. So, unfortunately it's easy to ask for a distributed filesystem which does not exist. I'll provide my current brain dump on the various pieces I've been tracking, I'm sure there are some inaccuracies included, but hopefully they are small ones. As always comments and corrections welcome. A high level overview of opanafs: * Openafs is distributed, but not p2p. * performs well (assuming cache friendliness, and a single peer accessing the same files/directories) * scales well (for reads, because RO volumes can be replicated) * has a universal namespace * places little trust in a peer (getting root on a client != ability to read all files) * allows for transparent volume migration (the client doesn't complain when a volume is migrated) * perfect coherency (via a subscription model) * It also supports linux, OSX, and Windows (among others). * relatively complex. NFS in contrast: * Isn't distributed (unless you count automount) * has loose coherency (poll based) * No replication (corrections?) * Doesn't scale easily * Volume migration isn't easy (nfs4 claims to enable this, I've yet to see it demonstrated in the real world). * Is mostly unix specific (Microsoft had an NFS client but MS EoL'd it?) * relatively simple Lustre: * client server * scales extremely well, seems popular on the largest of clusters. * Can survive hardware failures assuming more than 1 block server is connected to each set of disks * unix only. * relatively complex. PVFS2: * Client server * scales well * can not survive a block server death. * unix only * relatively simple. * designed for use within a cluster. Oceanstore: * p2p * claims scalability to billions of users * Highly available/byzantine fault tolerant * complex * slow * in prototype stage * Requires use of an API (AFAIK it is not available as a transparently mounted filesystem) So the end result (from my skewed perspective) is: * NFS is hugely popular, easy, not very secure (at least by default), poor coherency, but for things like sharing /home within a cluster it works reasonably well. Seems most appropriate for LAN usage. Diskless to most implies NFS (and works well within a cluster or LAN). * Lustre and PVFS2 are popular in clusters for sharing files in larger clusters where more than single file server worth of bandwidth is required. Both I believe scale well with bandwidth but only allow for a single metadata server so will ultimately scale only as far as single machine for metadata intensive workloads (such as lock intensive, directory intensive, or file creation/deletion intensive workloads). Granted this also allows for exotic hardware solutions (like solid state storage) if you really need the performance. * AFS is popular for internet wide file service, researchers love the ability to run an application that requires 100 different libraries anywhere in the world. Sysadmins love it because then can migrate volumes without having to notify users or schedule downtime. I believe performance is usually somewhat less than NFS within a cluster (because of higher overhead), and usually significantly better outside a cluster (better caching and coherency). I'm less familiar with the various commercial filesystems like ibrix. Hopefully others will expand and correct the above. From ctierney at hypermall.net Mon Dec 11 19:26:58 2006 From: ctierney at hypermall.net (Craig Tierney) Date: Mon, 11 Dec 2006 20:26:58 -0700 Subject: [Beowulf] distributed file storage solution? In-Reply-To: References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> Message-ID: <457E2182.2050504@hypermall.net> Mark Hahn wrote: >> You can look into OpenAFS but be warned that you have to know >> infrastructure software quite well (LDAP+kerberos). It's >> cross-platform, can >> be distributed but don't think it's up to multiple writes on different >> mirrors though. > > and Lustre is parallel, high-performing and somewhat reliable/FT, but is > not decent/distributed (it's client-server). Lustre supports redundant meta-data servers (MDS) and failover for the object-storage servers (OSS). However, "high-performing" is relative. Great at streaming data, not at meta-data. Craig > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From brian.ropers.huilman at gmail.com Mon Dec 11 19:57:41 2006 From: brian.ropers.huilman at gmail.com (Brian D. Ropers-Huilman) Date: Mon, 11 Dec 2006 21:57:41 -0600 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <457E2182.2050504@hypermall.net> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> <457E2182.2050504@hypermall.net> Message-ID: On 12/11/06, Craig Tierney wrote: > Lustre supports redundant meta-data servers (MDS) and failover for the > object-storage servers (OSS). However, "high-performing" is relative. > Great at streaming data, not at meta-data. Which is really the bane of all cluster file systems, isn't it? Meta data accesses kill performance. -- Brian D. Ropers-Huilman From ctierney at hypermall.net Mon Dec 11 20:20:27 2006 From: ctierney at hypermall.net (Craig Tierney) Date: Mon, 11 Dec 2006 21:20:27 -0700 Subject: [Beowulf] distributed file storage solution? In-Reply-To: References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> <457E2182.2050504@hypermall.net> Message-ID: <457E2E0B.2070000@hypermall.net> Brian D. Ropers-Huilman wrote: > On 12/11/06, Craig Tierney wrote: >> Lustre supports redundant meta-data servers (MDS) and failover for the >> object-storage servers (OSS). However, "high-performing" is relative. >> Great at streaming data, not at meta-data. > > Which is really the bane of all cluster file systems, isn't it? Meta > data accesses kill performance. > Some are better than others. Lustre could have been designed from day one so that every OSS was also a metadata server (MDS), but they didn't. It is on their roadmap to distribute meta-data, but it doesn't look like there will be a 1 to 1 relationship between MDS and OSS. If I wanted a more general purpose distributed filesystem, those with distributed meta-data can provide better performance. If I wanted to provide a filesystem to my users where compiles wouldn't be horribly painful, tests with Ibrix showed it was adequate. I would be interested in testing some of the others (Panasas, Isilon) to see how they compare. Craig From hunting at ix.netcom.com Mon Dec 11 22:32:09 2006 From: hunting at ix.netcom.com (Michael Huntingdon) Date: Mon, 11 Dec 2006 22:32:09 -0800 Subject: [Beowulf] distributed file storage solution? In-Reply-To: References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> <457E2182.2050504@hypermall.net> Message-ID: <7.0.1.0.2.20061211222458.024feb88@ix.netcom.com> I just sat in on a presentation where both streaming and meta-data were discussed. Meta-data is the silver bullet in a Luster environment. The conversation turned to setting up NFS for home/user, with the nearly all other data on SFS with enough OSS and MDS servers to provide both the throughput and failover needed. regards michael At 07:57 PM 12/11/2006, Brian D. Ropers-Huilman wrote: >On 12/11/06, Craig Tierney wrote: >>Lustre supports redundant meta-data servers (MDS) and failover for the >>object-storage servers (OSS). However, "high-performing" is relative. >>Great at streaming data, not at meta-data. > >Which is really the bane of all cluster file systems, isn't it? Meta >data accesses kill performance. > >-- >Brian D. Ropers-Huilman >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From ajt at rri.sari.ac.uk Tue Dec 12 01:53:21 2006 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Tue, 12 Dec 2006 09:53:21 +0000 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> Message-ID: <457E7C11.1020907@rri.sari.ac.uk> samit wrote: > > hello list, > > i am looking for a application (most preferably, but not necessarily OSS > or free software) and that can create distributed, reliable, fault > tolerant, decentralized, high preformance file server. I've looked at > few P2P file storing solutions that store multiple copies of file in > different server for data reliability but i havent found any solution > that would help me access a file as flexibly like the local file with > high preformance. Cross platform solution would be even better as i want > to aovoid the SMB overhead if its just a linux based solution! Hello, Bipin. I'm interested in p2p filesystems. Which ones did you look at? Tony. -- Dr. A.J.Travis, | mailto:ajt at rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687 From bill at platform.com Tue Dec 12 06:27:15 2006 From: bill at platform.com (Bill Bryce) Date: Tue, 12 Dec 2006 09:27:15 -0500 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: Hi Eric, You may want to send the Perceus guys an email and ask them how hard it is to replace cAos Linux with RHEL or CentOS. I don't believe it should be that hard for them to do....we modified Warewulf to install on top of a stock Rocks cluster effectively turning a Rocks cluster into a Warewulf cluster - and the cluster was running RHEL....so it is possible. Regards, Bill. -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Michael Will Sent: Monday, December 11, 2006 7:15 PM To: Eric Shook; Buccaneer for Hire. Cc: beowulf at beowulf.org Subject: RE: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Scyld CW4 is based on RHEL4 and also supported on Centos 4. That does not give you different operating systems though, just flexible deployment of RHEL4 based HPC compute nodes. Note that we had to reimplement the PXE boot part to allow reasonable scaling. Michael -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Eric Shook Sent: Saturday, December 09, 2006 11:28 AM To: Buccaneer for Hire. Cc: beowulf at beowulf.org Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Not to diverge this conversation, but has anyone had any experience using this pxe boot / nfs model with a rhel variant? I have been wanting to do a nfs root or ramdisk model for some-time but our software stack requires a rhel base so Scyld and Perceus most likely will not work (although I am still looking into both of them to make sure) Thanks for any help, Eric Shook Buccaneer for Hire. wrote: > [snip] > > >> I agree with what Joe says about a few hundred nodes being the time >> you would start to look closer at this approach. >> > > I have started to explore the possibility of using this technology because I would really like to see us with the ability to change OSs and OS Personalities as needed. The question I have is with 2000+ compute nodes what kind of infrastructure do I need to support this? > > > > > > > > > > ______________________________________________________________________ > ______________ > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail beta. > http://new.mail.yahoo.com > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org To change your subscription > (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From ctierney at hypermall.net Tue Dec 12 06:48:30 2006 From: ctierney at hypermall.net (Craig Tierney) Date: Tue, 12 Dec 2006 07:48:30 -0700 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <7.0.1.0.2.20061211222458.024feb88@ix.netcom.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> <457E2182.2050504@hypermall.net> <7.0.1.0.2.20061211222458.024feb88@ix.netcom.com> Message-ID: <457EC13E.8020008@hypermall.net> Michael Huntingdon wrote: > I just sat in on a presentation where both streaming and meta-data were > discussed. Meta-data is the silver bullet in a Luster environment. The > conversation turned to setting up NFS for home/user, with the nearly all > other data on SFS with enough OSS and MDS servers to provide both the > throughput and failover needed. > I think this is how I would design most systems, whether Lustre is the high-performance filesystem or not. Craig > regards > michael > > > At 07:57 PM 12/11/2006, Brian D. Ropers-Huilman wrote: >> On 12/11/06, Craig Tierney wrote: >>> Lustre supports redundant meta-data servers (MDS) and failover for the >>> object-storage servers (OSS). However, "high-performing" is relative. >>> Great at streaming data, not at meta-data. >> >> Which is really the bane of all cluster file systems, isn't it? Meta >> data accesses kill performance. >> >> -- >> Brian D. Ropers-Huilman >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > > From ctierney at hypermall.net Tue Dec 12 06:51:22 2006 From: ctierney at hypermall.net (Craig Tierney) Date: Tue, 12 Dec 2006 07:51:22 -0700 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <1084.86.136.175.49.1165913728.squirrel@webmail.hostme.co.uk> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> <457E2182.2050504@hypermall.net> <1084.86.136.175.49.1165913728.squirrel@webmail.hostme.co.uk> Message-ID: <457EC1EA.9090809@hypermall.net> Robin Harker wrote: > TerraGrid is high performance, low CPU overhead and with the XFS version, > (TG2) excellent metadata performance. e.g. gets use with real time > databases, and can be designed with no single point of failure. > Do you have it running? If so, how many nodes (targets) do you have connected? If you have it running in redundant mode, have you measured the performance difference between RAID0 and RAID5? It's now called Rapidscale. I heard that when they were bought out by Rackable, they had a name conflict in the US and had to change it. Craig > Robin > > >> Mark Hahn wrote: >>>> You can look into OpenAFS but be warned that you have to know >>>> infrastructure software quite well (LDAP+kerberos). It's >>>> cross-platform, can >>>> be distributed but don't think it's up to multiple writes on different >>>> mirrors though. >>> and Lustre is parallel, high-performing and somewhat reliable/FT, but is >>> not decent/distributed (it's client-server). >> Lustre supports redundant meta-data servers (MDS) and failover for the >> object-storage servers (OSS). However, "high-performing" is relative. >> Great at streaming data, not at meta-data. >> >> Craig >> >> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > Robin Harker > Workstations UK Ltd > DDI: 01494 787710 > Tel: 01494 724498 > > From buccaneer at rocketmail.com Tue Dec 12 08:30:15 2006 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Tue, 12 Dec 2006 08:30:15 -0800 (PST) Subject: [Beowulf] distributed file storage solution? Message-ID: <20061212163015.99150.qmail@web30615.mail.mud.yahoo.com> > Do you have it running? If so, how many nodes (targets) do you > have connected? If you have it running in redundant mode, have you > measured the performance difference between RAID0 and RAID5? Just as an FYI, I would recommend that whatever you decide on, you develop a "show me" attitude and test it at your location using your job mix BEFORE you spend a dime on it. They you can decide how much time and money it will take to make it work. We have tried many solutions that have not worked. Some of it, AFTER we spent good money and time on it. ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com From ctierney at hypermall.net Tue Dec 12 08:37:27 2006 From: ctierney at hypermall.net (Craig Tierney) Date: Tue, 12 Dec 2006 09:37:27 -0700 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <20061212163015.99150.qmail@web30615.mail.mud.yahoo.com> References: <20061212163015.99150.qmail@web30615.mail.mud.yahoo.com> Message-ID: <457EDAC7.5060203@hypermall.net> Buccaneer for Hire. wrote: >> Do you have it running? If so, how many nodes (targets) do you >> have connected? If you have it running in redundant mode, have you >> measured the performance difference between RAID0 and RAID5? > > Just as an FYI, I would recommend that whatever you decide on, you > develop a "show me" attitude and test it at your location using your > job mix BEFORE you spend a dime on it. They you can decide how much > time and money it will take to make it work. > > We have tried many solutions that have not worked. Some of it, AFTER > we spent good money and time on it. > > Filesystems are getting better, but they are still (all of them) quite temperamental and may not necessarily work will with your applications. At my old site, we brought in any vendor willing to let us test their stuff. We got burned on another solution earlier in the contract. It passed acceptance, but it still didn't work. I have become much more diligent at "breaking" filesystems. It has help understand what filesystems can and cannot do. Craig > > > > > > ____________________________________________________________________________________ > Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail > beta. http://new.mail.yahoo.com > From robl at mcs.anl.gov Tue Dec 12 09:01:52 2006 From: robl at mcs.anl.gov (Robert Latham) Date: Tue, 12 Dec 2006 11:01:52 -0600 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <457E0BB6.3040905@cse.ucdavis.edu> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> <457E0BB6.3040905@cse.ucdavis.edu> Message-ID: <20061212170151.GJ24143@mcs.anl.gov> On Mon, Dec 11, 2006 at 05:53:58PM -0800, Bill Broadley wrote: > Lustre: > * client server > * scales extremely well, seems popular on the largest of clusters. > * Can survive hardware failures assuming more than 1 block server is > connected to each set of disks > * unix only. > * relatively complex. > > PVFS2: > * Client server > * scales well > * can not survive a block server death. > * unix only > * relatively simple. > * designed for use within a cluster. Hi Bill As a member of the PVFS project I just wanted to comment on your description of our file system. I would say that PVFS is every bit as fault tolerant as Lustre. The redundancy model for the two file systems are pretty simliar: both file systems rely on shared storage and high availability software to continute operating in the face of disk failure. What Lustre has done a much better job of than we have is documenting the HA process. This is one of our (PVFS) areas of focus in the near-term. We may not have documented the process in enough detail, but one can definitely set up PVFS servers with links to shared storage and make use of things like IP takeover to deliver resiliancy in the face of disk failure, and have had this ability for several years now (PVFS users can check out 'pvfs2-ha.pdf' in our source for a starting point). > So the end result (from my skewed perspective) is: > * Lustre and PVFS2 are popular in clusters for sharing files in larger > clusters where more than single file server worth of bandwidth is > required. Both I believe scale well with bandwidth but only allow > for a single metadata server so will ultimately scale only as far > as single machine for metadata intensive workloads (such as lock > intensive, directory intensive, or file creation/deletion > intensive workloads). Granted this also allows for exotic > hardware solutions (like solid state storage) if you really need > the performance. PVFS v2 has offered multiple metadata servers for some time now. Our metadata operations scale well with the number of metadata servers. You are absolutely correct that PVFS metadata performance is dependant on hardware, but you need not get so exotic as solid state to see high metadata rates. The OSC PVFS deployment has servers with RAID and fast disks, and can deliver quite high metadata rates. Another point I'd like to make about PVFS is how well-suited it is for MPI-IO applications. The ROMIO MPI-IO implementation (the basis for many MPI-IO implementations) contains a highly-efficent PVFS driver. This driver speaks directly to PVFS servers, bypassing the kernel. It also contains optimizations for collective metadata operations and noncontiguous I/O. Applications making use of MPI-IO, or higher-level libraries built on top of MPI-IO such as parallel-netcdf or (when configured correctly) HDF5 are likely to see quite good performance when running on PVFS. > Hopefully others will expand and correct the above. Happy to do so! ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From eric-shook at uiowa.edu Tue Dec 12 09:27:06 2006 From: eric-shook at uiowa.edu (Eric Shook) Date: Tue, 12 Dec 2006 11:27:06 -0600 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <433093DF7AD7444DA65EFAFE3987879C3035D6@jellyfish.highlyscyld.com> References: <433093DF7AD7444DA65EFAFE3987879C3035D6@jellyfish.highlyscyld.com> Message-ID: <457EE66A.8050709@uiowa.edu> Michael, This should be sufficient enough to run our software stack as they are tested on rhel 4 variants. I will most definitely look closer at scyld CW4 to see if it fits our needs. What does reimplementing pxe boot mean specifically? Thanks, Eric Michael Will wrote: > Scyld CW4 is based on RHEL4 and also supported on Centos 4. That does > not give you different > operating systems though, just flexible deployment of RHEL4 based HPC > compute nodes. > > Note that we had to reimplement the PXE boot part to allow reasonable > scaling. > > Michael > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Eric Shook > Sent: Saturday, December 09, 2006 11:28 AM > To: Buccaneer for Hire. > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes > > Not to diverge this conversation, but has anyone had any experience > using this pxe boot / nfs model with a rhel variant? I have been > wanting to do a nfs root or ramdisk model for some-time but our software > stack requires a rhel base so Scyld and Perceus most likely will not > work (although I am still looking into both of them to make sure) > > Thanks for any help, > Eric Shook > > Buccaneer for Hire. wrote: >> [snip] >> >> >>> I agree with what Joe says about a few hundred nodes being the time >>> you would start to look closer at this approach. >>> >> I have started to explore the possibility of using this technology > because I would really like to see us with the ability to change OSs and > OS Personalities as needed. The question I have is with 2000+ compute > nodes what kind of infrastructure do I need to support this? >> >> >> >> >> >> >> >> >> ______________________________________________________________________ >> ______________ >> Do you Yahoo!? >> Everyone is raving about the all-new Yahoo! Mail beta. >> http://new.mail.yahoo.com >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org To change your subscription >> (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org To change your subscription > (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu From eric-shook at uiowa.edu Tue Dec 12 09:28:27 2006 From: eric-shook at uiowa.edu (Eric Shook) Date: Tue, 12 Dec 2006 11:28:27 -0600 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: References: Message-ID: <457EE6BB.6040500@uiowa.edu> Hi Bill, I will try to email them and let everyone know what they have to say. How did the Rocks -> warewulf conversion go? Thanks, Eric Bill Bryce wrote: > Hi Eric, > > You may want to send the Perceus guys an email and ask them how hard it > is to replace cAos Linux with RHEL or CentOS. I don't believe it should > be that hard for them to do....we modified Warewulf to install on top of > a stock Rocks cluster effectively turning a Rocks cluster into a > Warewulf cluster - and the cluster was running RHEL....so it is > possible. > > Regards, > > Bill. > > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Michael Will > Sent: Monday, December 11, 2006 7:15 PM > To: Eric Shook; Buccaneer for Hire. > Cc: beowulf at beowulf.org > Subject: RE: [Beowulf] SATA II - PXE+NFS - diskless compute nodes > > Scyld CW4 is based on RHEL4 and also supported on Centos 4. That does > not give you different > operating systems though, just flexible deployment of RHEL4 based HPC > compute nodes. > > Note that we had to reimplement the PXE boot part to allow reasonable > scaling. > > Michael > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Eric Shook > Sent: Saturday, December 09, 2006 11:28 AM > To: Buccaneer for Hire. > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes > > Not to diverge this conversation, but has anyone had any experience > using this pxe boot / nfs model with a rhel variant? I have been > wanting to do a nfs root or ramdisk model for some-time but our software > stack requires a rhel base so Scyld and Perceus most likely will not > work (although I am still looking into both of them to make sure) > > Thanks for any help, > Eric Shook > > Buccaneer for Hire. wrote: >> [snip] >> >> >>> I agree with what Joe says about a few hundred nodes being the time >>> you would start to look closer at this approach. >>> >> I have started to explore the possibility of using this technology > because I would really like to see us with the ability to change OSs and > OS Personalities as needed. The question I have is with 2000+ compute > nodes what kind of infrastructure do I need to support this? >> >> >> >> >> >> >> >> >> ______________________________________________________________________ >> ______________ >> Do you Yahoo!? >> Everyone is raving about the all-new Yahoo! Mail beta. >> http://new.mail.yahoo.com >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org To change your subscription >> (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org To change your subscription > (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu From mwill at penguincomputing.com Tue Dec 12 09:33:00 2006 From: mwill at penguincomputing.com (Michael Will) Date: Tue, 12 Dec 2006 09:33:00 -0800 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: <433093DF7AD7444DA65EFAFE3987879C3035E7@jellyfish.highlyscyld.com> The standard pxe server code had issues with larger amounts of clients wanting to boot at the same time. Some of them would time out and not make it. Donald Becker reimplemented the server side of it to solve the issues. I don't know anything more specific than that, but if you are interested I am sure he would not mind if you where to contact him directly. Michael -----Original Message----- From: Eric Shook [mailto:eric-shook at uiowa.edu] Sent: Tuesday, December 12, 2006 9:27 AM To: Michael Will Cc: Buccaneer for Hire.; beowulf at beowulf.org Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Michael, This should be sufficient enough to run our software stack as they are tested on rhel 4 variants. I will most definitely look closer at scyld CW4 to see if it fits our needs. What does reimplementing pxe boot mean specifically? Thanks, Eric Michael Will wrote: > Scyld CW4 is based on RHEL4 and also supported on Centos 4. That does > not give you different operating systems though, just flexible > deployment of RHEL4 based HPC compute nodes. > > Note that we had to reimplement the PXE boot part to allow reasonable > scaling. > > Michael > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Eric Shook > Sent: Saturday, December 09, 2006 11:28 AM > To: Buccaneer for Hire. > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes > > Not to diverge this conversation, but has anyone had any experience > using this pxe boot / nfs model with a rhel variant? I have been > wanting to do a nfs root or ramdisk model for some-time but our > software stack requires a rhel base so Scyld and Perceus most likely > will not work (although I am still looking into both of them to make > sure) > > Thanks for any help, > Eric Shook > > Buccaneer for Hire. wrote: >> [snip] >> >> >>> I agree with what Joe says about a few hundred nodes being the time >>> you would start to look closer at this approach. >>> >> I have started to explore the possibility of using this technology > because I would really like to see us with the ability to change OSs > and OS Personalities as needed. The question I have is with 2000+ > compute nodes what kind of infrastructure do I need to support this? >> >> >> >> >> >> >> >> >> _____________________________________________________________________ >> _ >> ______________ >> Do you Yahoo!? >> Everyone is raving about the all-new Yahoo! Mail beta. >> http://new.mail.yahoo.com >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org To change your subscription >> (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org To change your subscription > (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu From robl at mcs.anl.gov Tue Dec 12 09:43:26 2006 From: robl at mcs.anl.gov (Robert Latham) Date: Tue, 12 Dec 2006 11:43:26 -0600 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> Message-ID: <20061212174326.GK24143@mcs.anl.gov> On Sat, Dec 09, 2006 at 08:21:48AM +0545, samit wrote: > hello list, > > i am looking for a application (most preferably, but not necessarily OSS or > free software) and that can create distributed, reliable, fault tolerant, > decentralized, high preformance file server. I Hi Samit I think you might be going about the choice of file system from the opposite direction. What you want to do is not pick a file system, but rather start with an application or suite of applications. With the workload in mind, then you can pick the file system that best matches your needs. Questions you should ask: - What is a typical I/O workload? Large, contiguous regions or small, interleaved data? - How many files will this workload create? many files? One file? - Typical file size? many gigs, or several kbytes? - Is the workload parallel or serial? Would two or more processes ever try to simultaneously read or write to the same flie? How about directory? the beowulf mailing list has a catch-all answer of "it depends on your application", and that can be frustrating to hear, but it's never more true than when trying to match file systems to workloads. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From eric-shook at uiowa.edu Tue Dec 12 09:48:20 2006 From: eric-shook at uiowa.edu (Eric Shook) Date: Tue, 12 Dec 2006 11:48:20 -0600 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <433093DF7AD7444DA65EFAFE3987879C3035E7@jellyfish.highlyscyld.com> References: <433093DF7AD7444DA65EFAFE3987879C3035E7@jellyfish.highlyscyld.com> Message-ID: <457EEB64.2050904@uiowa.edu> Excellent. Thanks for the info. Eric Michael Will wrote: > The standard pxe server code had issues with larger amounts of clients > wanting to boot at the same time. Some of them would time out and not > make it. Donald Becker reimplemented the server side of it to solve the > issues. I don't know anything more specific than that, but if you are > interested I am sure he would not mind if you where to contact him > directly. > > Michael > -----Original Message----- > From: Eric Shook [mailto:eric-shook at uiowa.edu] > Sent: Tuesday, December 12, 2006 9:27 AM > To: Michael Will > Cc: Buccaneer for Hire.; beowulf at beowulf.org > Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes > > Michael, > > This should be sufficient enough to run our software stack as they are > tested on rhel 4 variants. I will most definitely look closer at scyld > CW4 to see if it fits our needs. > > What does reimplementing pxe boot mean specifically? > > Thanks, > Eric > > Michael Will wrote: >> Scyld CW4 is based on RHEL4 and also supported on Centos 4. That does >> not give you different operating systems though, just flexible >> deployment of RHEL4 based HPC compute nodes. >> >> Note that we had to reimplement the PXE boot part to allow reasonable >> scaling. >> >> Michael >> >> -----Original Message----- >> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] >> On Behalf Of Eric Shook >> Sent: Saturday, December 09, 2006 11:28 AM >> To: Buccaneer for Hire. >> Cc: beowulf at beowulf.org >> Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes >> >> Not to diverge this conversation, but has anyone had any experience >> using this pxe boot / nfs model with a rhel variant? I have been >> wanting to do a nfs root or ramdisk model for some-time but our >> software stack requires a rhel base so Scyld and Perceus most likely >> will not work (although I am still looking into both of them to make >> sure) >> >> Thanks for any help, >> Eric Shook >> >> Buccaneer for Hire. wrote: >>> [snip] >>> >>> >>>> I agree with what Joe says about a few hundred nodes being the time >>>> you would start to look closer at this approach. >>>> >>> I have started to explore the possibility of using this technology >> because I would really like to see us with the ability to change OSs >> and OS Personalities as needed. The question I have is with 2000+ >> compute nodes what kind of infrastructure do I need to support this? >>> >>> >>> >>> >>> >>> >>> >>> _____________________________________________________________________ >>> _ >>> ______________ >>> Do you Yahoo!? >>> Everyone is raving about the all-new Yahoo! Mail beta. >>> http://new.mail.yahoo.com >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org To change your subscription > >>> (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org To change your subscription >> (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Eric Shook (319) 335-6714 > Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu From hahn at physics.mcmaster.ca Tue Dec 12 09:50:29 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Tue, 12 Dec 2006 12:50:29 -0500 (EST) Subject: [Beowulf] distributed file storage solution? In-Reply-To: <20061212163015.99150.qmail@web30615.mail.mud.yahoo.com> References: <20061212163015.99150.qmail@web30615.mail.mud.yahoo.com> Message-ID: | Just as an FYI, I would recommend that whatever you decide on, you develop a | "show me" attitude and test it at your location using your job mix BEFORE you | spend a dime on it. They you can decide how much time and money it will take | to make it work. have you had success with this approach? I'm pretty sure vendors would have laughed at us if we asked for it, though for something small (few TB, <100 clients, etc), I'm sure it would work. then again, anyone could do that sort of small project on a weekend. if you have hundreds of clients, expensive (non-Gb) interconnect, multiple racks, power,cooling,floor needs, I don't think you'll get a lot of people offering to do full demos for you. in short, if you do this, aren't you effectively pre-selecting high-margin vendors? From ctierney at hypermall.net Tue Dec 12 10:07:54 2006 From: ctierney at hypermall.net (Craig Tierney) Date: Tue, 12 Dec 2006 11:07:54 -0700 Subject: [Beowulf] distributed file storage solution? In-Reply-To: References: <20061212163015.99150.qmail@web30615.mail.mud.yahoo.com> Message-ID: <457EEFFA.4070309@hypermall.net> Mark Hahn wrote: > | Just as an FYI, I would recommend that whatever you decide on, you > develop a > | "show me" attitude and test it at your location using your job mix > BEFORE you > | spend a dime on it. They you can decide how much time and money it > will take > | to make it work. > > have you had success with this approach? I'm pretty sure vendors would > have laughed at us if we asked for it, though for something small (few > TB, <100 clients, etc), I'm sure it would work. then again, anyone > could do that sort of small project on a weekend. if you have hundreds > of clients, > expensive (non-Gb) interconnect, multiple racks, power,cooling,floor needs, > I don't think you'll get a lot of people offering to do full demos for you. > > in short, if you do this, aren't you effectively pre-selecting > high-margin vendors? My problem was never getting a vendor to let me demo their software. My problem was being able to gather enough hardware to adequately test the system as compared to where it would be deployed. At times I tested both Ibrix and Terrascale on as much of my system as I could steal from the users (768 nodes). For these two tests, they weren't even 'if it works I will buy it'. It was strictly for evaluation and to help guide future purchase decisions. They were both younger companies at the time, but it is something I would expect even today. This isn't too hard for the vendor, they just have to provide software and support. For hardware solutions (Panasas, Isilon, and now Terrascale) that takes a bit more investment on their part. It would be harder. For large systems, I would never enter into an agreement with a filesystem vendor that did not have a acceptance test that must be completed succesfully before I paid them anything. The test would be a mixture of my applications, multiple copies of fsx, building the linux kernel, and many clients to one file MPI-IO tests. I would be fair, and give them a chance to fix any problems found, but I am tired of buying products that don't work. Craig > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From mwill at penguincomputing.com Tue Dec 12 10:12:49 2006 From: mwill at penguincomputing.com (Michael Will) Date: Tue, 12 Dec 2006 10:12:49 -0800 Subject: [Beowulf] distributed file storage solution? Message-ID: <433093DF7AD7444DA65EFAFE3987879C30360B@jellyfish.highlyscyld.com> There are only high margin vendors as far as I know. Think $10k license per I/O node. -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Mark Hahn Sent: Tuesday, December 12, 2006 9:50 AM To: Buccaneer for Hire. Cc: beowulf at beowulf.org Subject: Re: [Beowulf] distributed file storage solution? | Just as an FYI, I would recommend that whatever you decide on, you | develop a "show me" attitude and test it at your location using your | job mix BEFORE you spend a dime on it. They you can decide how much | time and money it will take to make it work. have you had success with this approach? I'm pretty sure vendors would have laughed at us if we asked for it, though for something small (few TB, <100 clients, etc), I'm sure it would work. then again, anyone could do that sort of small project on a weekend. if you have hundreds of clients, expensive (non-Gb) interconnect, multiple racks, power,cooling,floor needs, I don't think you'll get a lot of people offering to do full demos for you. in short, if you do this, aren't you effectively pre-selecting high-margin vendors? _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From buccaneer at rocketmail.com Tue Dec 12 10:36:58 2006 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Tue, 12 Dec 2006 10:36:58 -0800 (PST) Subject: [Beowulf] distributed file storage solution? Message-ID: <20061212183658.34421.qmail@web30604.mail.mud.yahoo.com> ----- Original Message ---- From: Mark Hahn To: Buccaneer for Hire. Cc: beowulf at beowulf.org Sent: Tuesday, December 12, 2006 11:50:29 AM Subject: Re: [Beowulf] distributed file storage solution? > | Just as an FYI, I would recommend that whatever you decide on, you develop a > | "show me" attitude and test it at your location using your job mix BEFORE you > | spend a dime on it. They you can decide how much time and money it will take > | to make it work. > > have you had success with this approach? Absolutely it does. They promise they can do something better than anyone else can, let them prove it. > I'm pretty sure vendors would have > laughed at us if we asked for it, though for something small (few TB, > <100 clients, etc), I'm sure it would work. then again, anyone could do > that sort of small project on a weekend. if you have hundreds of clients, > expensive (non-Gb) interconnect, multiple racks, power,cooling,floor needs, > I don't think you'll get a lot of people offering to do full demos for you. Why would they laugh? We manage our cluster efficiently and get a lot of real work done. If you tell me you can improve that (otherwise why change anything), shouldn't you be willing to put your money where your mouth is? When I tell my masters we should go with something, I make sure I have all the facts stacked in my favor so there are no surprises. When my masters fail to listen and impulse buy I wind up with a file system with infiniband (which has been talked about here) that a year later when I asked my co-sysadmin here if it was finally running, he smiles and answers, "That depends on how what you mean by running... Yes it runs, yes we break it 100% of the time." >in short, if you do this, aren't you effectively pre-selecting high-margin vendors? Who cares what the margins are? If they have something that will increase the ability of our cluster to complete more jobs in a shorter amount of time-is that not why we are here? ____________________________________________________________________________________ Yahoo! Music Unlimited Access over 1 million songs. http://music.yahoo.com/unlimited From hahn at physics.mcmaster.ca Tue Dec 12 10:53:44 2006 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Tue, 12 Dec 2006 13:53:44 -0500 (EST) Subject: [Beowulf] distributed file storage solution? In-Reply-To: <20061212183658.34421.qmail@web30604.mail.mud.yahoo.com> References: <20061212183658.34421.qmail@web30604.mail.mud.yahoo.com> Message-ID: >> have you had success with this approach? > >Absolutely it does. They promise they can do something better than anyone >else can, let them prove it. what I'm skeptical about is the cost of doing such a demo. >> I'm pretty sure vendors would have >> laughed at us if we asked for it, though for something small (few TB, >> <100 clients, etc), I'm sure it would work. then again, anyone could do >> that sort of small project on a weekend. if you have hundreds of clients, >> expensive (non-Gb) interconnect, multiple racks, power,cooling,floor needs, >> I don't think you'll get a lot of people offering to do full demos for you. > >Why would they laugh? We manage our cluster efficiently and get a lot of real >work done. If you tell me you can improve that (otherwise why change anything), >shouldn't you be willing to put your money where your mouth is? how much money? my local SFS cluster is 70 TB, not really that large, and consumes 3.5 racks. (storage doesn't consume much power, so the racks are probably <10KW total, but still require dual 20x220 plugs for each rack. I'd guess that a vendor would have to spend perhaps $20K to get such a cluster in place for demo purposes, and that's ignoring the interconnect entirely (which is quadrics in this case, so non-negligable in price ;) >When I tell my masters we should go with something, I make sure I have all the >facts stacked in my favor so there are no surprises. When my masters fail to listen >and impulse buy I wind up with a file system with infiniband (which has been talked >about here) that a year later when I asked my co-sysadmin here if it was finally >running, he smiles and answers, "That depends on how what you mean by running... interesting; I'd enjoy hearing more about it. obviously, IB _is_ capable of working, so what appear to be the sticking points? >>in short, if you do this, aren't you effectively pre-selecting high-margin vendors? > >Who cares what the margins are? If they have something that will increase the >ability of our cluster to complete more jobs in a shorter amount of time-is that not >why we are here? if my 70TB costs $100/GB, it's quite a different proposition than a lower-margin solution that costs $1/GB. even though the latter might be a better solution, such a vendor would obviously never be able to afford to demo it the way you describe. I think you mentioned EMC in an earlier message, and I suspect that pretty much equates to what I'd call high-margin. this is NOT to criticize your decision or EMC, just that it's probably a lot more than $1/GB. in fact, I'd love to hear your comments about EMC as well. FWIW, the 70 TB cluster I have here is similar to three others we have at our 4 largest sites. 36 shelves of 11x 250G SATA, connected by dual-U320 to 12 servers on the cluster's interconnect (along with metadata-pair). performance is generally good, and I'm not sure whether we have any real issues with metadata slowness (this is lustre/sfs.). From bill at platform.com Tue Dec 12 11:28:15 2006 From: bill at platform.com (Bill Bryce) Date: Tue, 12 Dec 2006 14:28:15 -0500 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: It works but honestly I can't say it is 'production'. If you are interested in the 'roll' it is located at: http://www.osgdc.org/project/kusu/wiki/RocksRoll We tried it with Rocks 4.1 (not Rocks 4.2) and Platform OCS (http://my.platform.com/products/platform-ocs) The roll takes the existing rocks repository on the frontend and builds a vnfs image out of it for the compute nodes. It also turns off the standard rocks DHCP and PXE changes and replaces them with warewulf...it isn't pretty but it does work. Regards, Bill. -----Original Message----- From: Eric Shook [mailto:eric-shook at uiowa.edu] Sent: Tuesday, December 12, 2006 12:28 PM To: Bill Bryce Cc: Michael Will; Buccaneer for Hire.; beowulf at beowulf.org Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Hi Bill, I will try to email them and let everyone know what they have to say. How did the Rocks -> warewulf conversion go? Thanks, Eric Bill Bryce wrote: > Hi Eric, > > You may want to send the Perceus guys an email and ask them how hard it > is to replace cAos Linux with RHEL or CentOS. I don't believe it should > be that hard for them to do....we modified Warewulf to install on top of > a stock Rocks cluster effectively turning a Rocks cluster into a > Warewulf cluster - and the cluster was running RHEL....so it is > possible. > > Regards, > > Bill. > > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Michael Will > Sent: Monday, December 11, 2006 7:15 PM > To: Eric Shook; Buccaneer for Hire. > Cc: beowulf at beowulf.org > Subject: RE: [Beowulf] SATA II - PXE+NFS - diskless compute nodes > > Scyld CW4 is based on RHEL4 and also supported on Centos 4. That does > not give you different > operating systems though, just flexible deployment of RHEL4 based HPC > compute nodes. > > Note that we had to reimplement the PXE boot part to allow reasonable > scaling. > > Michael > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Eric Shook > Sent: Saturday, December 09, 2006 11:28 AM > To: Buccaneer for Hire. > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes > > Not to diverge this conversation, but has anyone had any experience > using this pxe boot / nfs model with a rhel variant? I have been > wanting to do a nfs root or ramdisk model for some-time but our software > stack requires a rhel base so Scyld and Perceus most likely will not > work (although I am still looking into both of them to make sure) > > Thanks for any help, > Eric Shook > > Buccaneer for Hire. wrote: >> [snip] >> >> >>> I agree with what Joe says about a few hundred nodes being the time >>> you would start to look closer at this approach. >>> >> I have started to explore the possibility of using this technology > because I would really like to see us with the ability to change OSs and > OS Personalities as needed. The question I have is with 2000+ compute > nodes what kind of infrastructure do I need to support this? >> >> >> >> >> >> >> >> >> ______________________________________________________________________ >> ______________ >> Do you Yahoo!? >> Everyone is raving about the all-new Yahoo! Mail beta. >> http://new.mail.yahoo.com >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org To change your subscription >> (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org To change your subscription > (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu From buccaneer at rocketmail.com Tue Dec 12 11:35:14 2006 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Tue, 12 Dec 2006 11:35:14 -0800 (PST) Subject: [Beowulf] distributed file storage solution? Message-ID: <20061212193514.50747.qmail@web30601.mail.mud.yahoo.com> >>> have you had success with this approach? >> >>Absolutely it does. They promise they can do something better than anyone >>else can, let them prove it. > > what I'm skeptical about is the cost of doing such a demo. Cost? What cost? You can spend the time testing or spend the time do fixes and work-arounds. Consider the cost of procuring something you can't use or will have to spend the next few years working around. Or just let it sit around heating up the room. We already use >500KVA and have no real need for a heater in there. We had just that issue with a couple NAS vendors. In fact, had we accepted that solution without testing, we would have purchased tons of terabytes of storage with a 100% failure rate. It took 80 nodes and 3.5 days, but it failed 100% of the time. Ooooh, I would not have liked my name attached to that 'cause it would have cast doubt on my abilities. We have found that any reputable vendors will work with you to get an eval unit- he/she wants you to test their product 'cause that means you are really interested and not kicking the tires. > interesting; I'd enjoy hearing more about it. obviously, IB _is_ capable of > working, so what appear to be the sticking points? The fs vendor came in with the IB guys and swore it all worked together-but a year later we don't have it working the way it was represented. I make sure I use that stick when the powers-that-be starting thinking about things they should not be. > I think you mentioned EMC in an earlier message, and I suspect that pretty > much equates to what I'd call high-margin. this is NOT to criticize your > decision or EMC, just that it's probably a lot more than $1/GB. > > in fact, I'd love to hear your comments about EMC as well. Rule #1 is that you build your alliances carefully. These are relationships you are building for the long haul. If a vendor can not take care of you-take it as a sign that they will not be able to support you when it all hits the fan. We have had good luck with EMC. EMC produces a solid product. We beat the snot out of it every single day and it keeps ticking. But, it is all about our EMC reseller. If you have built the right kind of relationship with your vendor, when something falls apart (and it will) they will be there to help you pick up the pieces. Some 3 years ago or so, we had an issue where at 5:30 Xmas (Wed) eve we lost an array. Not a large one (~TB) but one that would be needed I in a couple of days for a precondition job (everything halts until this one is done.) I called our vendor (our sales gal was on the road heading to dinner with the folks and her husband.) I told her what my problem and she told me straight out, "Darlin' I can't get you one until after the 1st!" We talked a little bit more and found they had something we could use to hold us off. She agreed to be sure it was delievered Friday AM (we had to get going too.) At 8AM Friday, it was delivered and the job ran so when the guys came back, nothing wrong was noticed. ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com From josh.kayse at gtri.gatech.edu Mon Dec 11 12:32:43 2006 From: josh.kayse at gtri.gatech.edu (Josh Kayse) Date: Mon, 11 Dec 2006 15:32:43 -0500 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> Message-ID: <1165869163.13070.6.camel@localhost.localdomain> On Sat, 2006-12-09 at 08:21 +0545, samit wrote: > > hello list, > > i am looking for a application (most preferably, but not necessarily > OSS or free software) and that can create distributed, reliable, fault > tolerant, decentralized, high preformance file server. I've looked at > few P2P file storing solutions that store multiple copies of file in > different server for data reliability but i havent found any solution > that would help me access a file as flexibly like the local file with > high preformance. Cross platform solution would be even better as i > want to aovoid the SMB overhead if its just a linux based solution! > > PLEASE RECOMENT ME SOMETHING!? feel free to suggest any solutions that > even qualifies half of this criteria! > > -bipin In house we use DRBD as reliable, fault-tolerant NFS servers. It's worked well for us so far and meets some of your requirements. Version 0.8 supports multi-master but ymmv. Also, it only supports 2 machines that can act as part of the group, so it's not that distributed. Hope this helps. josh From robin at workstationsuk.co.uk Tue Dec 12 00:55:28 2006 From: robin at workstationsuk.co.uk (Robin Harker) Date: Tue, 12 Dec 2006 08:55:28 -0000 (GMT) Subject: [Beowulf] distributed file storage solution? In-Reply-To: <457E2182.2050504@hypermall.net> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> <457E2182.2050504@hypermall.net> Message-ID: <1084.86.136.175.49.1165913728.squirrel@webmail.hostme.co.uk> TerraGrid is high performance, low CPU overhead and with the XFS version, (TG2) excellent metadata performance. e.g. gets use with real time databases, and can be designed with no single point of failure. Robin > Mark Hahn wrote: >>> You can look into OpenAFS but be warned that you have to know >>> infrastructure software quite well (LDAP+kerberos). It's >>> cross-platform, can >>> be distributed but don't think it's up to multiple writes on different >>> mirrors though. >> >> and Lustre is parallel, high-performing and somewhat reliable/FT, but is >> not decent/distributed (it's client-server). > > Lustre supports redundant meta-data servers (MDS) and failover for the > object-storage servers (OSS). However, "high-performing" is relative. > Great at streaming data, not at meta-data. > > Craig > > >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > Robin Harker Workstations UK Ltd DDI: 01494 787710 Tel: 01494 724498 From robin at workstationsuk.co.uk Tue Dec 12 07:07:24 2006 From: robin at workstationsuk.co.uk (Robin Harker) Date: Tue, 12 Dec 2006 15:07:24 -0000 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <457EC1EA.9090809@hypermall.net> Message-ID: <0a0e01c71dff$3ea5fcc0$6000a8c0@robinxp> Hi Craig, >Do you have it running? If so, how many nodes (targets) do you have connected? If you have it running in redundant mode, have you measured the performance difference between RAID0 and RAID5? The most targets we have tested was 27 (yes a strange number), and over a single GigE saw 2.7GB/sec using IOZONE. The internal disc is RAID5 using SATA HW RAID controllers, no RAID0 although it would of course work. Using TG-HA (RapidScale-HA perhaps ;-) ) you see about 30% drop in write performance, but reads are similar to the standard product. The thing we have really noticed is the filesystem/meta-data performance gains when using the XFS filesystem version. Remember it doesn't have a metadata control node as with Lustre, soit scales as you add targets. Best regards Robin It's now called Rapidscale. I heard that when they were bought out by Rackable, they had a name conflict in the US and had to change it. Craig Robin Harker www.rackable.com Tel: 01494 724498 Cell: 07802 517059 RapidScale - High Performance Storage - One Brick at a Time -----Original Message----- From: Craig Tierney [mailto:ctierney at hypermall.net] Sent: 12 December 2006 14:51 To: robin at workstationsuk.co.uk Cc: Mark Hahn; beowulf at beowulf.org Subject: Re: [Beowulf] distributed file storage solution? Robin Harker wrote: > TerraGrid is high performance, low CPU overhead and with the XFS version, > (TG2) excellent metadata performance. e.g. gets use with real time > databases, and can be designed with no single point of failure. > Do you have it running? If so, how many nodes (targets) do you have connected? If you have it running in redundant mode, have you measured the performance difference between RAID0 and RAID5? It's now called Rapidscale. I heard that when they were bought out by Rackable, they had a name conflict in the US and had to change it. Craig > Robin > > >> Mark Hahn wrote: >>>> You can look into OpenAFS but be warned that you have to know >>>> infrastructure software quite well (LDAP+kerberos). It's >>>> cross-platform, can >>>> be distributed but don't think it's up to multiple writes on different >>>> mirrors though. >>> and Lustre is parallel, high-performing and somewhat reliable/FT, but is >>> not decent/distributed (it's client-server). >> Lustre supports redundant meta-data servers (MDS) and failover for the >> object-storage servers (OSS). However, "high-performing" is relative. >> Great at streaming data, not at meta-data. >> >> Craig >> >> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > Robin Harker > Workstations UK Ltd > DDI: 01494 787710 > Tel: 01494 724498 > > __________ NOD32 1916 (20061212) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com From simon at thekelleys.org.uk Tue Dec 12 09:46:44 2006 From: simon at thekelleys.org.uk (Simon Kelley) Date: Tue, 12 Dec 2006 17:46:44 +0000 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes Message-ID: <457EEB04.3070809@thekelleys.org.uk> Joe Landman wrote: >>> I would hazard that any DHCP/PXE type install server would struggle >>> with 2000 requests (yes- you arrange the power switching and/or >>> reboots to stagger at N second intervals). > fwiw: we use dnsmasq to serve dhcp and handle pxe booting. It does a > marvelous job of both, and is far easier to configure (e.g. it is less > fussy) than dhcpd. Joe, you might like to know that the next release of dnsmasq includes a TFTP server so that it can do the whole job. The process model for the TFTP implementation should be well suited to booting many nodes at once because it multiplexes all the connections on the same process. My guess is that will work better then having inetd fork 2000 copies of tftpd, which is what would happen with traditional TFTP servers. If anyone on the list has a suitable test setup, I be very happy to do some pre-release load testing. For ultimate scalability, I guess the solution is to use multicast-TFTP. I know that support for that is included in the PXE spec, but I've never tried to implement it. Based on prior experience of PXE ROMs, the chance of finding a sufficiently bug-free implementation of mtftp there must be fairly low. >> There are a few modifications you have to make to increase the number >> of bootps before it fails. > Likely with dhcpd, not sure how many dnsmasq can handle, but we have > done 36 at a time to do system checking. No problems with it. Dnsmasq will handle DHCP for thousands of clients on reasonably meaty hardware. The only rate-limiting step is a 2-3 second timeout while newly-allocated addresses are "ping"ed to check that they are not in use. That check is optional, and skipped automatically under heavy load, so a large number of clients is no problem. Cheers, Simon. From cousins at umeoce.maine.edu Tue Dec 12 10:58:20 2006 From: cousins at umeoce.maine.edu (Steve Cousins) Date: Tue, 12 Dec 2006 13:58:20 -0500 (EST) Subject: [Beowulf] High performance storage with GbE? In-Reply-To: <200612120635.kBC6Z4cx011660@bluewest.scyld.com> References: <200612120635.kBC6Z4cx011660@bluewest.scyld.com> Message-ID: We are currently looking to upgrade storage on a 256 node cluster for both performance and size (around 100 TB). The current thread on distributed file storage talks about a number of issues we are dealing with. We are trying to spec out storage that would give us somewhere in the neighborhood of 5 to 10 MB/sec at each node which translates to around 1.2 to 2.5 GB/sec when it gets to the storage. Of course, this is only if all nodes are trying to access/write data at once but for some models this is a possibility. We are trying to do this with GbE since we can't afford to add FC type cards to each node. Some vendors use NAS gateways to consolidate a number of FC DAS RAID arrays together and then serve out NFS via multiple GbE pipes. Others (Panasas, Isilon) do distributed storage that scales up as you add storage shelves. Then there are distributed file systems like PVFS and Lustre that would allow us to use node disks as scratch space. Since we are on a fairly tight budget I don't think we can afford to do all Panasas/Isilon type storage. We are probably going to need to do a tiered system with faster and slower storage with scratch and permanent space. Although a tape silo is a possibility for part of the storage, for now we want to go with all disk storage. My question is: what are other people doing to be able to provide fast GbE storage to the nodes? Thanks, Steve ______________________________________________________________________ Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu Marine Sciences, 452 Aubert Hall http://rocky.umeoce.maine.edu Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302 From becker at scyld.com Tue Dec 12 15:49:46 2006 From: becker at scyld.com (Donald Becker) Date: Tue, 12 Dec 2006 15:49:46 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <457EEB04.3070809@thekelleys.org.uk> Message-ID: On Tue, 12 Dec 2006, Simon Kelley wrote: > Joe Landman wrote: > >>> I would hazard that any DHCP/PXE type install server would struggle > >>> with 2000 requests (yes- you arrange the power switching and/or > >>> reboots to stagger at N second intervals). Those that have talked to me about this topic know that it's a hot-button for me. The limit with the "traditional" approach, the ISC DHCP server with one of the three common TFTP servers, is about 40 machines before you risk losing machines during a boot. With 100 machines you are likely to lose 2-5 during a typical power-restore cycle when all machines boot simultaneously. The actual node count limit is strongly dependent on the exact hardware (e.g. the characteristics of the Ethernet switch) and the size of the boot image (larger is much worse than you would expect). Staggering node power-up is a hack to work around the limit. You can build a lot of complexity into doing it "right", but still be rolling the dice overall. It's better than build a reliable boot system than to build a complex system around known unreliability. The right solution is to build a smart, integrated PXE server that understands the bugs and characteristics of PXE. I wrote one a few years ago and understand many of the problems. It's clear to me that no matter how you hack up the ISC DHCP server, you won't end up with a good PXE server. (Read that carefully: yes, it's a great DHCP server; no, it's not good for PXE.) > > fwiw: we use dnsmasq to serve dhcp and handle pxe booting. It does a > > marvelous job of both, and is far easier to configure (e.g. it is less > > fussy) than dhcpd. > > Joe, you might like to know that the next release of dnsmasq includes a > TFTP server so that it can do the whole job. The process model for the > TFTP implementation should be well suited to booting many nodes at once > because it multiplexes all the connections on the same process. My guess > is that will work better then having inetd fork 2000 copies of tftpd, > which is what would happen with traditional TFTP servers. Yup, that's a good start. It's one of the many things you have to do. You are already far ahead of the "standard" approach. Don't forget flow and bandwidth control, ARP table stuffing and clean-up, state reporting, etc. Oh, and you'll find out about the PXE bug that results in a zero-length filename.. expect it. > For ultimate scalability, I guess the solution is to use multicast-TFTP. > I know that support for that is included in the PXE spec, but I've never > tried to implement it. Based on prior experience of PXE ROMs, the chance > of finding a sufficiently bug-free implementation of mtftp there must be > fairly low. This is a good example of why PXE is not just DHCP+TFTP. The multicast TFTP in PXE is not multicast TFTP. The DHCP response specifies the multicast group to join, rather than negotiating it as per RFC2090. That means multicast requires communication between the DHCP and TFTP sections. > > Likely with dhcpd, not sure how many dnsmasq can handle, but we have > > done 36 at a time to do system checking. No problems with it. As part of writing the server I wrote a DHCP and TFTP clients to simulate high node count boots. But the harshest test was old RLX systems: each of the 24 blades had three NICs, but could only boot off of the NIC connected to the internal 100base repeater/hub. Plus the blade BIOS had a good selection of PXE bugs. Another good test is booting Itaniums (really DHCP+TFTP, not PXE). They have a 7MB kernel, and a similarly large initial ramdisk. Forget to strip off the kernel symbols and you are looking at 70MB over TFTP. (But they extend the block index from 16 to 64 bits, allowing you start a transfer that will take until the heat death of the universe to finish! Really, 32 bits is sometimes more than enough. Especially when extending a crude protocol that should have been forgotten long ago.) > Dnsmasq will handle DHCP for thousands of clients on reasonably meaty > hardware. The only rate-limiting step is a 2-3 second timeout while > newly-allocated addresses are "ping"ed to check that they are not in > use. That check is optional, and skipped automatically under heavy load, > so a large number of clients is no problem. > > > Cheers, > > Simon. > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Donald Becker becker at scyld.com Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 From becker at scyld.com Tue Dec 12 16:34:32 2006 From: becker at scyld.com (Donald Becker) Date: Tue, 12 Dec 2006 16:34:32 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: Message-ID: On Sat, 9 Dec 2006, Mark Hahn wrote: > >> I would hazard that any DHCP/PXE type install server would struggle with > >> 2000 requests > > a single server (implying 1 gb nic?) might have trouble with the tftp part, > but I don't see why you couldn't scale up by splitting the tftp part > off to multiple servers. I'd expect a single DHCP (no TFTP) would be > plenty in all cases. 100 tftp clients per server would probably > be pretty safe. Some of the limits you encounter aren't solved by multiple machines running TFTP servers, but can be solved by a single clever TFTP server. TFTP is subject to something like the Ethernet "capture effect", where once a machine misses a packet, it's increasingly likely to continue to fail. And with most PXE clients, failure is fatal. So you want to avoid any TFTP retry, even if that means deferring the response to other clients when you detect a retry attempt. Another problem is that PXE clients seem to have some corner cases with ARP. It's best not to re-ARP during a download, even responding to an external request if some other machine is trying to ARP your client. > I personally like the idea of putting one admin server in each rack. > they don't have to be fancy servers, by any means. Any installed machine is added complexity. And these are machines you have to keep consistent with potentially many boot images. (Imagine cases where you are detecting the hardware and serving the proper image.) > > There are a few modifications you have to make to increase the number of bootps before > > it fails. > > do you mean you'd expect load problems even with a single sever > dedicated only to dhcp? That only seems unlikely until you pair a script interpreter with the DHCP server. A default backlog of only 25 packets seems tiny when you are running scripts that make SQL queries before responding. But NO ONE would do that, right? Right? > > So now to figure out my next step. I will need local space for logs > > and data/temp data files. > > why would you want logs local? You want your kernel messages to be logged to the same machine that is serving your kernels. Which should be the same server that provides the kernel modules and modprobe tables that match the kernel. And you want the logging to happen as the very first thing after booting the kernel. (Boot kernel, load network driver, DHCP for loghost, dump kernel message, only then activate additional hardware and do other risky things.) -- Donald Becker becker at scyld.com Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 From landman at scalableinformatics.com Tue Dec 12 18:26:19 2006 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 12 Dec 2006 21:26:19 -0500 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <457EEB04.3070809@thekelleys.org.uk> References: <457EEB04.3070809@thekelleys.org.uk> Message-ID: <457F64CB.5060107@scalableinformatics.com> Hi Simon Simon Kelley wrote: > Joe Landman wrote: >>>> I would hazard that any DHCP/PXE type install server would struggle >>>> with 2000 requests (yes- you arrange the power switching and/or >>>> reboots to stagger at N second intervals). > >> fwiw: we use dnsmasq to serve dhcp and handle pxe booting. It does a >> marvelous job of both, and is far easier to configure (e.g. it is less >> fussy) than dhcpd. > > Joe, you might like to know that the next release of dnsmasq includes a > TFTP server so that it can do the whole job. The process model for the > TFTP implementation should be well suited to booting many nodes at once > because it multiplexes all the connections on the same process. My guess > is that will work better then having inetd fork 2000 copies of tftpd, > which is what would happen with traditional TFTP servers. I am glad to hear this. I haven't found a case that ISC DHCP does a better job than dnsmasq for our clusters: the former is hard to configure properly; it is quite fussy. Add in that we don't need to configure bind on the cluster (really doesn't make much sense in most cases, unless you are doing some sort of fail-over cluster config) when we use dnsmasq... this is a good tool. We haven't explicitly enabled it in our Rocks roll as a default option, but we typically turn off bind and the local dhcp server there as this does a much better job. For our non-rocks units, we simply use this by default. Great job Simon! Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From anandvaidya.ml at gmail.com Tue Dec 12 20:15:45 2006 From: anandvaidya.ml at gmail.com (Anand Vaidya) Date: Wed, 13 Dec 2006 12:15:45 +0800 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> Message-ID: <200612131215.46062.anandvaidya.ml@gmail.com> On Saturday 09 December 2006 10:36, samit wrote: > hello list, > > i am looking for a application (most preferably, but not necessarily OSS or > free software) and that can create distributed, reliable, fault tolerant, > decentralized, high preformance file server. I've looked at few P2P file > storing solutions that store multiple copies of file in different server > for data reliability but i havent found any solution that would help me > access a file as flexibly like the local file with high preformance. Cross > platform solution would be even better as i want to aovoid the SMB overhead > if its just a linux based solution! > > PLEASE RECOMENT ME SOMETHING!? feel free to suggest any solutions that even > qualifies half of this criteria! Have you considered G-FARM? http://datafarm.apgrid.org OpenSource Grid filesystem by the Japanese researchers. I did a proof-of-concept for my customer a year ago. There was also a talk on G-FARM by Osamu Tatebe at SC2006... GFARM (Grid File System) stores metadata on an LDAP server and is able to replicate files on multiple data/file servers. They recommend you "bring" the compute job to the host containing (probably) large files rather than staging files in and out. Regards Anand > > -bipin -- ------------------------------------------------------------------------------ Regards, Anand Vaidya From simon at thekelleys.org.uk Wed Dec 13 03:40:39 2006 From: simon at thekelleys.org.uk (Simon Kelley) Date: Wed, 13 Dec 2006 11:40:39 +0000 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: References: Message-ID: <457FE6B7.9050909@thekelleys.org.uk> Donald Becker wrote: > On Tue, 12 Dec 2006, Simon Kelley wrote: > >> Joe Landman wrote: >>>>> I would hazard that any DHCP/PXE type install server would struggle >>>>> with 2000 requests (yes- you arrange the power switching and/or >>>>> reboots to stagger at N second intervals). > > Those that have talked to me about this topic know that it's a hot-button > for me. > > The limit with the "traditional" approach, the ISC DHCP server with one of > the three common TFTP servers, is about 40 machines before you risk losing > machines during a boot. With 100 machines you are likely to lose 2-5 > during a typical power-restore cycle when all machines boot > simultaneously. > > The actual node count limit is strongly dependent on the exact hardware > (e.g. the characteristics of the Ethernet switch) and the size of the boot > image (larger is much worse than you would expect). > > Staggering node power-up is a hack to work around the limit. You can > build a lot of complexity into doing it "right", but still be rolling the > dice overall. It's better than build a reliable boot system than to build > a complex system around known unreliability. > > The right solution is to build a smart, integrated PXE server that > understands the bugs and characteristics of PXE. I wrote one a few years > ago and understand many of the problems. It's clear to me that no matter > how you hack up the ISC DHCP server, you won't end up with a good PXE > server. (Read that carefully: yes, it's a great DHCP server; no, it's not > good for PXE.) Is that server open-source/free software, or part of Sycld's product? No judgement implied, I'm just interested to know if I can download and learn from it. > >>> fwiw: we use dnsmasq to serve dhcp and handle pxe booting. It does a >>> marvelous job of both, and is far easier to configure (e.g. it is less >>> fussy) than dhcpd. >> Joe, you might like to know that the next release of dnsmasq includes a >> TFTP server so that it can do the whole job. The process model for the >> TFTP implementation should be well suited to booting many nodes at once >> because it multiplexes all the connections on the same process. My guess >> is that will work better then having inetd fork 2000 copies of tftpd, >> which is what would happen with traditional TFTP servers. > > Yup, that's a good start. It's one of the many things you have to do. > You are already far ahead of the "standard" approach. Don't forget flow > and bandwidth control, ARP table stuffing and clean-up, state reporting, > etc. Oh, and you'll find out about the PXE bug that results in a > zero-length filename.. expect it. It's maybe worth giving a bit of background here: dnsmasq is a lightweight DNS forwarder and DHCP server. Think of it as being equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode but doing dynamic DNS and a bit of authoritative DNS too. It's really aimed at small networks which need a DNS server and a DHCP server where the names of DHCP-configured hosts appear in the DNS but all other DNS queries get passed to upstream recursive DNS servers (typically at an ISP). Dnsmasq is widely used in the *WRT distributions which run in Linksys WRT-54G-class SOHO routers, and similar "turn your old 486 into a home router" products. It provides all the DNS and DHCP that these need in a ~100K binary that's flexible and easy to configure. Almost coincidentally, it's turned out to be useful for clusters too. I known from the dnsmasq mailing list that Joe Landman has used it in that way for a long time, and RLX used it in their control-tower product which has now been re-incarnated in HP's blade-management system. As Don Becker points out in another message ISC's dhcpd is way too heavyweight for his sort of stuff. The dnsmasq DHCP implementation pretty much receives a UDP packet, computes a reply as a function of the input packet, the in-memory lease database and the current configuration, and synchronously sends the reply. The only time it even needs to allocate memory is when a new lease is created: everything else manages which a single packet buffer and a few statically-allocated data structures. This makes for great scalability. For the TFTP implementation I've stayed with the same implementation style, so I hope it will scale well too. I've already covered some of Don's checklist, and I'll pay attention to the rest of it, within the contraint that this has to be small and simple, to fit the primary, SOHO router, niche. > >> For ultimate scalability, I guess the solution is to use multicast-TFTP. >> I know that support for that is included in the PXE spec, but I've never >> tried to implement it. Based on prior experience of PXE ROMs, the chance >> of finding a sufficiently bug-free implementation of mtftp there must be >> fairly low. > > This is a good example of why PXE is not just DHCP+TFTP. The multicast > TFTP in PXE is not multicast TFTP. The DHCP response specifies the > multicast group to join, rather than negotiating it as per RFC2090. That > means multicast requires communication between the DHCP and TFTP sections. > >>> Likely with dhcpd, not sure how many dnsmasq can handle, but we have >>> done 36 at a time to do system checking. No problems with it. > > As part of writing the server I wrote a DHCP and TFTP clients to simulate > high node count boots. But the harshest test was old RLX systems: each of > the 24 blades had three NICs, but could only boot off of the NIC > connected to the internal 100base repeater/hub. Plus the blade BIOS had a > good selection of PXE bugs. By chance, I have a couple a shelves of those available for testing. Would that be enough (48 blades, I guess) to get meaningful results? Cheers, Simon. From gmk at runlevelzero.net Wed Dec 13 10:13:27 2006 From: gmk at runlevelzero.net (Greg Kurtzer) Date: Wed, 13 Dec 2006 10:13:27 -0800 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <457B0E31.8090205@uiowa.edu> References: <20061209163852.31426.qmail@web30604.mail.mud.yahoo.com> <457B0E31.8090205@uiowa.edu> Message-ID: <561C4B00-B262-4055-8D0A-8C7929445174@runlevelzero.net> On Dec 9, 2006, at 11:27 AM, Eric Shook wrote: > Not to diverge this conversation, but has anyone had any experience > using this pxe boot / nfs model with a rhel variant? I have been > wanting to do a nfs root or ramdisk model for some-time but our > software stack requires a rhel base so Scyld and Perceus most > likely will not work (although I am still looking into both of them > to make sure) I haven't made any announcements on this list about Perceus yet, so just to clarify: Perceus (http://www.perceus.org) works very well with RHEL and we will soon have some VNFS capsules for the commercial distributions including high performance hardware and library stack and application stack pre-integrated into the capsule (which we will offer, support and certify for various solutions via Infiscale (http:// www.infiscale.com). note: Perceus capsules contain the kernel, drivers, provisioning scripts and utilities to support provisioning the VNFS into a single file that is importable into Perceus with a single command. The released capsules support stateless provisioning, but there is already work in creating capsules that can do statefull, NFS (almost) root, and hybrid systems. We have a user already running Perceus with RHEL capsules in HPC and another prototyping it for a web cluster solution. Also, Warewulf has been known to scale well over 2000 nodes. Perceus limits have yet to be reached, but it can natively handle load balancing and fail over multiple Perceus masters. Theoretically the limits should be well beyond Warewulf's capabilities. Version 1.0 of Perceus has been released (GPL) and now we are in bug fixing and tuning mode. We are in need of testers and documentation so if anyone is interested please let me know. -- Greg Kurtzer gmk at runlevelzero.net From bill at cse.ucdavis.edu Wed Dec 13 11:39:10 2006 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 13 Dec 2006 11:39:10 -0800 Subject: [Beowulf] High performance storage with GbE? In-Reply-To: References: <200612120635.kBC6Z4cx011660@bluewest.scyld.com> Message-ID: <458056DE.2090707@cse.ucdavis.edu> Steve Cousins wrote: > > We are currently looking to upgrade storage on a 256 node cluster for ... > 1.2 to 2.5 GB/sec when it gets to the storage. Of course, this is only ... What do you expect the I/O's to look like? Large file read/writes? Zillions of small reads/writes? To one file or directory or maybe to a file or directory per compute node? My approach so far as been to buy N dual opterons with 16 disks in each (using the areca or 3ware controllers) and use NFS. Higher end 48 port switches come with 2-4 10G uplinks. Numerous disk setups these days can sustain 800MB/sec (Dell MD-1000 external array, Areca 1261ML, and the 3ware 9650SE) all of which can be had in a 15/16 disk configuration for $8-$14k depending on the size of your 16 disks (400-500GB towards the lower end, 750GB towards the higher end). NFS would be easy, but any collection of clients (including all) would be performance limited by a single server. PVFS2 or Lustre would allow you to use N of the above file servers and get not too much less than N times the bandwidth (assuming large sequential reads and writes). In particular the Dell MD-1000 is interesting in that it allows for 2 12Gbit connections (via SAS), the docs I've found show you can access all 15 disks via a single connection or 7 disks on one, and 8 disks on the other. I've yet to find out if you can access all 15 disks via both interfaces to allow fallover in case one of your fileservers dies. As previously mentioned both PVFS2 and Lustre can be configured to handle this situation. So you could buy a pair of dual opterons + SAS card (with 2 external conenctions) then connect each port to each array (both servers to both connections), then if a single server fails the other can take over the other servers disks. A recent quote showed that for a config like this (2 servers 2 arrays) would cost around $24k. Assuming one spare disk per chassis, and a 12+2 RAID6 array and provide 12TB usable (not including 5% for filesystem overhead). So 9 of the above = $216k and 108TB usable, each of the arrays Dell claims can manage 800MB/sec, things don't scale perfectly but I wouldn't be surprised to see 3-4GB/sec using PVFS2 or Lustre. Actual data points appreciated, we are interested in a 1.5-2.0GB/sec setup. Are any of the solutions you are considering cheaper than this? Any of the dual opterons in a 16 disk chassis could manage the same bandwidth (both 3ware and areca claim 800MB/sec or so), but could not survive a file server death. From bill at cse.ucdavis.edu Wed Dec 13 14:45:00 2006 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 13 Dec 2006 14:45:00 -0800 Subject: [Beowulf] High performance storage with GbE? In-Reply-To: <458056DE.2090707@cse.ucdavis.edu> References: <200612120635.kBC6Z4cx011660@bluewest.scyld.com> <458056DE.2090707@cse.ucdavis.edu> Message-ID: <4580826C.4090401@cse.ucdavis.edu> Sorry all, my math was a bit off (thanks Mike and others). To be clearer: * $24k or so per pair of servers and a pair of 15*750GB arrays * The pair has 16.4TB usable (without the normal 5% reserved by the filesystem) * The pair has 20.5TB raw (counting disks use for spare and redundancy) * Each pair has 22 TB marketing (Using TB = 10^9 instead of 2^30) So 7 servers could manage to provide the mentioned >= 100TB. Total cost of 7*$24k=$168k. Total storage cost does not include the GigE switches, 10G uplinks, nor 10G nics. Usable capacity of 7*16 = 112TB. The above config assumes that software RAID performance = hardware RAID performance or that the Dell MD-1000 can allow 2 masters access to all 15 disks (not at the same time). I'm not sure that either is true. From the scaling numbers I've seen published with PVFS2 and Lustre seems like it should still be possible to manage the mentioned goal of 1.5-2.5GB/sec (which assumes 220MB/sec or 440MB/sec per file server pair for 7 pairs.) From becker at scyld.com Wed Dec 13 15:00:53 2006 From: becker at scyld.com (Donald Becker) Date: Wed, 13 Dec 2006 15:00:53 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <457FE6B7.9050909@thekelleys.org.uk> Message-ID: On Wed, 13 Dec 2006, Simon Kelley wrote: > Donald Becker wrote: > > On Tue, 12 Dec 2006, Simon Kelley wrote: > >> Joe Landman wrote: > >>>>> I would hazard that any DHCP/PXE type install server would struggle > >>>>> with 2000 requests (yes- you arrange the power switching and/or > >>>>> reboots to stagger at N second intervals). > > > > The limit with the "traditional" approach, the ISC DHCP server with one of > > the three common TFTP servers, is about 40 machines before you risk losing > > machines during a boot. With 100 machines you are likely to lose 2-5 > > during a typical power-restore cycle when all machines boot > > simultaneously. ... > > The right solution is to build a smart, integrated PXE server that > > understands the bugs and characteristics of PXE. I wrote one a few years > > Is that server open-source/free software, or part of Sycld's product? No > judgement implied, I'm just interested to know if I can download and > learn from it. When I wrote the first implementation I expected that we would be publishing it under the GPL or a similar open source license, as we had with most of our previous software. But the problems we had with Los Alamos removing the Scyld name and copyright from our code (the Scyld PXE server uses our "beoconfig" config file interface, which is common to both BProc and BeoBoot) caused us to not publish the code initially. And as often happens, early decisions stick around far longer than you expect. At some point we may revisit that decision, but it's not currently a priority. I have been very willing to talk with people about the implementation, although only people such as Peter Anvin (pxelinux) and Marty Conner (Etherboot) don't quickly find a reason to "freshen their drink" when I start ;->. > >>> fwiw: we use dnsmasq to serve dhcp and handle pxe booting. It does a > >>> marvelous job of both, and is far easier to configure (e.g. it is less > >>> fussy) than dhcpd. The configuration files issue was one of the triggering reasons for investigating writing our own server. Until 2002 we were focused on BeoBoot as the solution for booting nodes, and PXE was a side thought to support a handful of special machines, such as the RLX blades. As PXE became common we went down the path of using our config file to generate ISC DHCP config files. This broke one of my rules: avoid using config files to write other config files. You can't trace updates to their effects, and can't trace problems to their source. This was a test that proved the rule: we had three independent ways to write the config files to have backups if/when we encountered a bug. But that meant three programs were broken each time the ISC DHCP config file changed incompatibly. > >> Joe, you might like to know that the next release of dnsmasq includes a > >> TFTP server so that it can do the whole job. The process model for the > >> TFTP implementation should be well suited to booting many nodes at once > >> because it multiplexes all the connections on the same process. My guess > >> is that will work better then having inetd fork 2000 copies of tftpd, > >> which is what would happen with traditional TFTP servers. > > > > Yup, that's a good start. It's one of the many things you have to do. It should repeat this: forking a dozen processes sounds like a good idea. Thinking about forking a thousand (we plan every element to scale to "at least 1000") makes "1" seem like a much better idea. With one continuously running server, the coding task is harder. You can't leak memory. You can't leak file descriptors. You have to check for updated/modified files. You can't block on anything. You have to re-read your config file and re-open your control sockets on SIGHUP rather than just exiting. You should show/checkpoint the current state on SIGUSR1. Once you do have all of that written, it's now possible, even easy, to count have many bytes and packets were sent in the last timer tick and to check that every client asked for and received packet in the last half second. Combine the two and you can smoothly switch from bandwidth control to round-robin responses, then to slightly deferring DHCP responses. > It's maybe worth giving a bit of background here: dnsmasq is a > lightweight DNS forwarder and DHCP server. Think of it as being > equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode > but doing dynamic DNS and a bit of authoritative DNS too. One of the things we have been lacking in Scyld has been an external DNS service for compute nodes. For cluster-internal name lookups we developed BeoNSS. BeoNSS uses the linear address assignment of compute nodes to calculate the name or IP address e.g. "Node23" is the IP address of Node0 + 23. So BeoNSS depends on the assignment policy of the PXE server (1). BeoNSS works great, especially when establishing all-to-all communication. But we failed to consider that external file and license servers might not be running Linux, and therefore couldn't use BeoNSS. We now see that we need DNS and NIS (2) gateways for BeoNSS names. (1) This leads to one of the many details that you have to get right. The PXE server always assigns a temporary IP address to new nodes. Once a node has booted and passed tests, we then assign it a permanent node number and IP address. Assigning short-lease IP addresses then changing a few seconds later requires tight, race-free integration with the DHCP server and ARP tables. That's easy with a unified server, difficult with a script around ISC DHCP. (2) We need NIS or NIS+ for netgroups. Netgroups are use to export file systems to the cluster, independent of base IP address and size changes. > Almost coincidentally, it's turned out to be useful for clusters too. I > known from the dnsmasq mailing list that Joe Landman has used it in that > way for a long time, and RLX used it in their control-tower product > which has now been re-incarnated in HP's blade-management system. I didn't know where it was used. It does explain some of the Control Tower functionality. > receives a UDP packet, computes a reply as a function of the input > packet, the in-memory lease database and the current configuration, and > synchronously sends the reply. The only time it even needs to allocate > memory is when a new lease is created: everything else manages which a > single packet buffer and a few statically-allocated data structures. > This makes for great scalability. You might consider breaking the synchronous reply aspect. It's convenient because you can build the reply into the same packet buffer as the inbound request. But it makes it difficult to defer responses. (With DHCP you can take the sleazy approach of "only respond when the elapsed-time is greater than X", at the risk of encountering PXE clients with short timeouts.) > style, so I hope it will scale well too. I've already covered some of > Don's checklist, and I'll pay attention to the rest of it, within the > contraint that this has to be small and simple, to fit the primary, SOHO > router, niche. You probably won't want to go the whole way with the implementation, but hopefully I've given some useful suggestions. > > As part of writing the server I wrote a DHCP and TFTP clients to simulate > > high node count boots. But the harshest test was old RLX systems: each of > > the 24 blades had three NICs, but could only boot off of the NIC > > connected to the internal 100base repeater/hub. Plus the blade BIOS had a > > good selection of PXE bugs. > > By chance, I have a couple a shelves of those available for testing. > Would that be enough (48 blades, I guess) to get meaningful results? Yes. Better, try running the server on one of the blades, serving the other 47. Have the blade do some disk I/O at the same time. Transmeta CPUs were not the fastest chips around, even in their prime. [[ Hmmm, did this posting come up to RGB standards of length+detail? ]] -- Donald Becker becker at scyld.com Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 From eric-shook at uiowa.edu Wed Dec 13 18:44:04 2006 From: eric-shook at uiowa.edu (Eric Shook) Date: Wed, 13 Dec 2006 20:44:04 -0600 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <561C4B00-B262-4055-8D0A-8C7929445174@runlevelzero.net> References: <20061209163852.31426.qmail@web30604.mail.mud.yahoo.com> <457B0E31.8090205@uiowa.edu> <561C4B00-B262-4055-8D0A-8C7929445174@runlevelzero.net> Message-ID: <4580BA74.9040302@uiowa.edu> Thank you for commenting on this Greg. I might look deeper into perceus as an option if rhel (and particularly variants as in Scientific Linux) work well. Our infrastructure will most likely include nfs-root, possibly hybrid and full-install. So if Perceus can support it with a few simple VNFS capsules then that should simplify administration greatly. Would you declare Perceus as production quality? Or would our production infrastructure be a large-scale test? (Which I'm not sure if I'm comfortable being a test case with our production clusters ;o) Thanks, Eric Greg Kurtzer wrote: > > On Dec 9, 2006, at 11:27 AM, Eric Shook wrote: > >> Not to diverge this conversation, but has anyone had any experience >> using this pxe boot / nfs model with a rhel variant? I have been >> wanting to do a nfs root or ramdisk model for some-time but our >> software stack requires a rhel base so Scyld and Perceus most likely >> will not work (although I am still looking into both of them to make >> sure) > > I haven't made any announcements on this list about Perceus yet, so just > to clarify: > > Perceus (http://www.perceus.org) works very well with RHEL and we will > soon have some VNFS capsules for the commercial distributions including > high performance hardware and library stack and application stack > pre-integrated into the capsule (which we will offer, support and > certify for various solutions via Infiscale (http://www.infiscale.com). > > note: Perceus capsules contain the kernel, drivers, provisioning scripts > and utilities to support provisioning the VNFS into a single file that > is importable into Perceus with a single command. The released capsules > support stateless provisioning, but there is already work in creating > capsules that can do statefull, NFS (almost)root, and hybrid systems. > > We have a user already running Perceus with RHEL capsules in HPC and > another prototyping it for a web cluster solution. > > Also, Warewulf has been known to scale well over 2000 nodes. Perceus > limits have yet to be reached, but it can natively handle load balancing > and fail over multiple Perceus masters. Theoretically the limits should > be well beyond Warewulf's capabilities. > > Version 1.0 of Perceus has been released (GPL) and now we are in bug > fixing and tuning mode. We are in need of testers and documentation so > if anyone is interested please let me know. > > -- > Greg Kurtzer > gmk at runlevelzero.net > > > -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu From eric-shook at uiowa.edu Wed Dec 13 19:14:22 2006 From: eric-shook at uiowa.edu (Eric Shook) Date: Wed, 13 Dec 2006 21:14:22 -0600 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: References: Message-ID: <4580C18E.6020706@uiowa.edu> Thanks for the links Bill. I will check them out. Before I try it, how likely would it work for a system other than yours? :O) Bill Bryce wrote: > It works but honestly I can't say it is 'production'. If you are > interested in the 'roll' it is located at: > > http://www.osgdc.org/project/kusu/wiki/RocksRoll > > We tried it with Rocks 4.1 (not Rocks 4.2) and Platform OCS > (http://my.platform.com/products/platform-ocs) > > The roll takes the existing rocks repository on the frontend and builds > a vnfs image out of it for the compute nodes. It also turns off the > standard rocks DHCP and PXE changes and replaces them with warewulf...it > isn't pretty but it does work. > > Regards, > > Bill. > > -----Original Message----- > From: Eric Shook [mailto:eric-shook at uiowa.edu] > Sent: Tuesday, December 12, 2006 12:28 PM > To: Bill Bryce > Cc: Michael Will; Buccaneer for Hire.; beowulf at beowulf.org > Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes > > Hi Bill, > > I will try to email them and let everyone know what they have to say. > > How did the Rocks -> warewulf conversion go? > > Thanks, > Eric > > Bill Bryce wrote: >> Hi Eric, >> >> You may want to send the Perceus guys an email and ask them how hard > it >> is to replace cAos Linux with RHEL or CentOS. I don't believe it > should >> be that hard for them to do....we modified Warewulf to install on top > of >> a stock Rocks cluster effectively turning a Rocks cluster into a >> Warewulf cluster - and the cluster was running RHEL....so it is >> possible. >> >> Regards, >> >> Bill. >> >> >> -----Original Message----- >> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] >> On Behalf Of Michael Will >> Sent: Monday, December 11, 2006 7:15 PM >> To: Eric Shook; Buccaneer for Hire. >> Cc: beowulf at beowulf.org >> Subject: RE: [Beowulf] SATA II - PXE+NFS - diskless compute nodes >> >> Scyld CW4 is based on RHEL4 and also supported on Centos 4. That does >> not give you different >> operating systems though, just flexible deployment of RHEL4 based HPC >> compute nodes. >> >> Note that we had to reimplement the PXE boot part to allow reasonable >> scaling. >> >> Michael >> >> -----Original Message----- >> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] >> On Behalf Of Eric Shook >> Sent: Saturday, December 09, 2006 11:28 AM >> To: Buccaneer for Hire. >> Cc: beowulf at beowulf.org >> Subject: Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes >> >> Not to diverge this conversation, but has anyone had any experience >> using this pxe boot / nfs model with a rhel variant? I have been >> wanting to do a nfs root or ramdisk model for some-time but our > software >> stack requires a rhel base so Scyld and Perceus most likely will not >> work (although I am still looking into both of them to make sure) >> >> Thanks for any help, >> Eric Shook >> >> Buccaneer for Hire. wrote: >>> [snip] >>> >>> >>>> I agree with what Joe says about a few hundred nodes being the time >>>> you would start to look closer at this approach. >>>> >>> I have started to explore the possibility of using this technology >> because I would really like to see us with the ability to change OSs > and >> OS Personalities as needed. The question I have is with 2000+ compute >> nodes what kind of infrastructure do I need to support this? >>> >>> >>> >>> >>> >>> >>> >>> > ______________________________________________________________________ >>> ______________ >>> Do you Yahoo!? >>> Everyone is raving about the all-new Yahoo! Mail beta. >>> http://new.mail.yahoo.com >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org To change your subscription > >>> (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org To change your subscription >> (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu From gmk at runlevelzero.net Wed Dec 13 19:04:29 2006 From: gmk at runlevelzero.net (Greg Kurtzer) Date: Wed, 13 Dec 2006 19:04:29 -0800 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <4580BA74.9040302@uiowa.edu> References: <20061209163852.31426.qmail@web30604.mail.mud.yahoo.com> <457B0E31.8090205@uiowa.edu> <561C4B00-B262-4055-8D0A-8C7929445174@runlevelzero.net> <4580BA74.9040302@uiowa.edu> Message-ID: <0C8F2EE7-1BC7-4A67-BF05-C862C6D0E507@runlevelzero.net> On Dec 13, 2006, at 6:44 PM, Eric Shook wrote: > Thank you for commenting on this Greg. I might look deeper into > perceus as an option if rhel (and particularly variants as in > Scientific Linux) work well. Yes, we already have Centos and Caos 2&3 base images that most people are using for testing. > Our infrastructure will most likely include nfs-root, possibly > hybrid and full-install. So if Perceus can support it with a few > simple VNFS capsules then that should simplify administration greatly. These should be coming very soon. :) > > Would you declare Perceus as production quality? Or would our > production infrastructure be a large-scale test? (Which I'm not > sure if I'm comfortable being a test case with our production > clusters ;o) It depends on when you are ready to migrate. Our first test system was a 512 node Inifiband cluster and it worked without incident. We also have several other large prospects on the horizon (including vendor partnerships) so production readiness won't be a problem. With that said, I would wait until I do the formal press release (waiting for the 1.0 tree to finish getting hammered out by our testers and initial users). Thanks for inquiring! -- Greg Kurtzer gmk at runlevelzero.net From simon at thekelleys.org.uk Thu Dec 14 08:04:14 2006 From: simon at thekelleys.org.uk (Simon Kelley) Date: Thu, 14 Dec 2006 16:04:14 +0000 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: References: Message-ID: <458175FE.8080600@thekelleys.org.uk> Donald Becker wrote: >> Is that server open-source/free software, or part of Sycld's product? No >> judgement implied, I'm just interested to know if I can download and >> learn from it. > > When I wrote the first implementation I expected that we would be > publishing it under the GPL or a similar open source license, as we had > with most of our previous software. > But the problems we had with Los Alamos removing the Scyld name and > copyright from our code (the Scyld PXE server uses our "beoconfig" > config file interface, which is common to both BProc and BeoBoot) caused > us to not publish the code initially. And as often happens, early > decisions stick around far longer than you expect. > > At some point we may revisit that decision, but it's not currently > a priority. I have been very willing to talk with people about the > implementation, although only people such as Peter Anvin (pxelinux) and > Marty Conner (Etherboot) don't quickly find a reason to "freshen their > drink" when I start ;->. My glass is full; let us continue! > It should repeat this: forking a dozen processes sounds like a good idea. > Thinking about forking a thousand (we plan every element to scale to "at > least 1000") makes "1" seem like a much better idea. > > With one continuously running server, the coding task is harder. You > can't leak memory. You can't leak file descriptors. You have to check for > updated/modified files. You can't block on anything. You have to re-read > your config file and re-open your control sockets on SIGHUP rather than > just exiting. You should show/checkpoint the current state on SIGUSR1. All that stuff is there, and has been bedded down over several years. The TFTP code is an additional 500 lines. > > Once you do have all of that written, it's now possible, even easy, to > count have many bytes and packets were sent in the last timer tick and to > check that every client asked for and received packet in the last half > second. Combine the two and you can smoothly switch from bandwidth > control to round-robin responses, then to slightly deferring DHCP > responses. I'm not quite following here: It seems like you might be advocating retransmits every half second. I'm current doing classical exponential backoff, 1 second delay, then two, then four etc. Will that bite me? I'm doing round-robin, but I don't see how to throttle active connections: do I need to do that, or just limit total bandwidth? > >> It's maybe worth giving a bit of background here: dnsmasq is a >> lightweight DNS forwarder and DHCP server. Think of it as being >> equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode >> but doing dynamic DNS and a bit of authoritative DNS too. > > One of the things we have been lacking in Scyld has been an external DNS > service for compute nodes. For cluster-internal name lookups we > developed BeoNSS. Dnsmasq is worth a look. > > BeoNSS uses the linear address assignment of compute nodes to > calculate the name or IP address e.g. "Node23" is the IP address > of Node0 + 23. So BeoNSS depends on the assignment policy of > the PXE server (1). To do that with dnsmasq you'll have to nail down the IP address associated with every MAC address. DHCP IP address assignment to anonymous hosts is pseudo-random. (actually, it's done using a hash of the MAC address. That allows repeated DHCPDISCOVERs to be offered the same IP address without needing any server-side state until a lease is actually allocated. Some DHCP clients depend on getting the same answer to repeated DISCOVERs, without any support whatsoever from the standard.) OTOH if you use dnsmasq to provide your name service you might not need the linear assignment. > > BeoNSS works great, especially when establishing all-to-all communication. > But we failed to consider that external file and license servers might not > be running Linux, and therefore couldn't use BeoNSS. We now see that > we need DNS and NIS (2) gateways for BeoNSS names. > > (1) This leads to one of the many details that you have to get right. > The PXE server always assigns a temporary IP address to new nodes. Once > a node has booted and passed tests, we then assign it a permanent node > number and IP address. Assigning short-lease IP addresses then changing a > few seconds later requires tight, race-free integration with the DHCP > server and ARP tables. That's easy with a unified server, difficult with > a script around ISC DHCP. Is this a manifestation of the with-and-without-client-id problem? PXE sends a client-id, but the OS doesn't, or vice-versa. Dnsmasq has nailed down rules which work in most cases of this, mainly by trail-and-error. > > You probably won't want to go the whole way with the implementation, but > hopefully I've given some useful suggestions. Agreed, and thanks for the pointers. Cheers, Simon. From becker at scyld.com Thu Dec 14 13:07:14 2006 From: becker at scyld.com (Donald Becker) Date: Thu, 14 Dec 2006 13:07:14 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <458175FE.8080600@thekelleys.org.uk> Message-ID: On Thu, 14 Dec 2006, Simon Kelley wrote: > Donald Becker wrote: > > It should repeat this: forking a dozen processes sounds like a good idea. > > Thinking about forking a thousand (we plan every element to scale to "at > > least 1000") makes "1" seem like a much better idea. > > > > With one continuously running server, the coding task is harder. You > > can't leak memory. You can't leak file descriptors. You have to check for > > updated/modified files. You can't block on anything. You have to re-read > > your config file and re-open your control sockets on SIGHUP rather than > > just exiting. You should show/checkpoint the current state on SIGUSR1. > > All that stuff is there, and has been bedded down over several years. > The TFTP code is an additional 500 lines. It's not difficult to write a TFTP server. (The "trivial" in the name is a hint for those that haven't tried it.) It's difficult to write a reliable scalable one. But you have a head start. > > Once you do have all of that written, it's now possible, even easy, to > > count have many bytes and packets were sent in the last timer tick and to > > check that every client asked for and received packet in the last half > > second. Combine the two and you can smoothly switch from bandwidth > > control to round-robin responses, then to slightly deferring DHCP > > responses. > > I'm not quite following here: It seems like you might be advocating > retransmits every half second. I'm current doing classical exponential > backoff, 1 second delay, then two, then four etc. Will that bite me? Where are you you doing exponential back-off? For the TFTP client? The TFTP client will/should/might do a retry every second. (Background: TFTP uses "ACK" of the previous packet to mean "send the next one". The only way to detect this is a retry is timing.) The client might do a re-ARP first. In corner cases it might not reply to ARP itself. [[ Step up on the soapbox. ]] What idiot thought that exponential backoff was a good idea? Exponential backoff doesn't make sense where your base time period is a whole second and you can't tell if the reason for no response is failure, busy network or no one listening. My guess is that they were just copying Ethernet, where modified, randomized exponential backoff is what makes it magically good. Exponential backoff makes sense at the microsecond level, where you have a collision domain and potentially 10,000 hosts on a shared ether. Even there the idea of "carrier sense" or 'is the network busy' is what enables Ethernet to work at 98+% utilization rather than the 18% or 37% theoretical of Aloha Net. (Key difference: deaf transmitter.) What usually happens with DHCP and PXE is that the first packet is used getting the NIC to transmit correctly. The second packet is used to get the switch to start passing traffic. The third packet get through but we are already well into the exponential fallback. PXE would be much better and more reliable if it started out transmitting a burst of four DHCP packets even spaced in the first second, then falling back to once per second. If there is a concern about DHCP being a high percentage of traffic in huge installations running 10baseT, tell them to buy a server. Or, like, you know, a router. Because later the ARP traffic alone will dwarf a few DHCP broadcasts. > I'm doing round-robin, but I don't see how to throttle active > connections: do I need to do that, or just limit total bandwidth? Yes, you need to throttle active TFTP connections. The clients currently winning can turn around a next-packet request really quickly. If a few get in lock step, the server will have the next chunk of the file warm in the cache. This is the start of locking out the first loser. You can't just let the ACKs queue up in the socket as a substitute for deferring responses either. You have to pull them out ASAP and mark that client as needing a response. This doesn't cost very much. You need to keep the client state structure anyway. This is just one more bit, plus updating the timeval that you should be keeping anyway. > >> It's maybe worth giving a bit of background here: dnsmasq is a > >> lightweight DNS forwarder and DHCP server. Think of it as being > >> equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode > >> but doing dynamic DNS and a bit of authoritative DNS too. > > > > One of the things we have been lacking in Scyld has been an external DNS > > service for compute nodes. For cluster-internal name look-ups we > > developed BeoNSS. > Dnsmasq is worth a look. We likely can't leverage anything there. We already have a name system in BeoNSS. We just need the gateway from this NSS to DNS queries. > > BeoNSS uses the linear address assignment of compute nodes to > > calculate the name or IP address e.g. "Node23" is the IP address > > of Node0 + 23. So BeoNSS depends on the assignment policy of > > the PXE server (1). > To do that with dnsmasq you'll have to nail down the IP address > associated with every MAC address. .. > standard.) OTOH if you use dnsmasq to provide your name service you > might not need the linear assignment. I consider naming and numbering an important detail. The freedom to assign arbitrary names and IP addresses is a useful flexibility in a workstation environment. But for a compute room or cluster you want regular names and automatic-but-persistent IP addresses. We assign compute nodes a small integer node number the first time we accept them into the cluster. This is the node's persistent ID unless the administrator manually changes it. We used to allow node specialization based on MAC address as well as node number. The idea was the MAC address identified the specific machine hardware (e.g. extra disks or a frame buffer, server #6 of 16 in a PVFS array), while the node number might be used to specialize for a logical purpose. What we quickly found was that mostly-permanent node number assignment was a useful simplification. We deprecated using MAC specialization in favor of the node number being used for both physical and logical specialization. Just like you don't want your home address to change when a house down the street burns down, you don't want node IP addresses or node numbering to change. But you want automatic numbering when the street is extended or a new house is built on a vacant lot, with a manual override saying this house replaces the one that burnt down. [[ Do I get extra points for not using an automotive analogy? I can throw them away with "You don't care about the cylinder numbering in your car. But it's useful to have them numbered when you replace the spark plug cables." ]] > > (1) This leads to one of the many details that you have to get right. > > The PXE server always assigns a temporary IP address to new nodes. Once > > a node has booted and passed tests, we then assign it a permanent node > > number and IP address. Assigning short-lease IP addresses then changing a > > few seconds later requires tight, race-free integration with the DHCP > > server and ARP tables. That's easy with a unified server, difficult with > > a script around ISC DHCP. > > Is this a manifestation of the with-and-without-client-id problem? PXE > sends a client-id, but the OS doesn't, or vice-versa. Dnsmasq has nailed > down rules which work in most cases of this, mainly by trail-and-error. No, it's a different issue. PXE does have UUIDs, a universally unique ID that is distinct from MAC addresses. If you implement from the spec, you can use the UUID to pass out IP addresses and avoid the messiness of using the MAC address. I know I have the first machine built with the feature. It has the UUID with all zeros :-O. Then I have a whole bunch of other machines that must have been built for other universes because they have exactly the same all-zeros ID. Even when the UUID is distinct, it doesn't uniquely ID the machine. Different NICs on the same machine have different UUIDs, meaning you can not detect that it's the same machine you got a request from a few seconds ago. Bottom line: UUIDs are wildly useless. We address the multi-NIC case, along with a few others, by only assigning a persistent node number after the machine boots and runs a test program. The test program is elegantly simple: a Linux-based DHCP client. The request packets have an option field of all MAC addresses. (BTW, this is the same DHCP client code originally written to do PXE scalability tests.) -- Donald Becker becker at scyld.com Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 From becker at scyld.com Thu Dec 14 14:33:33 2006 From: becker at scyld.com (Donald Becker) Date: Thu, 14 Dec 2006 14:33:33 -0800 (PST) Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: <457AC11F.9060609@scalableinformatics.com> Message-ID: On Sat, 9 Dec 2006, Joe Landman wrote: > Guy Coates wrote: > > At what node count does the nfs-root model start to break down? Does anyone > > have any rough numbers with the number of clients you can support with a generic > > linux NFS server vs a dedicated NAS filer? > > If you use warewulf or the new perceus variant, it creates a ram disk > which is populated upon boot. Thats one of the larger transients. Then > you nfs mount applications, and home directories. I haven't looked at > Scyld for a while, but I seem to remember them doing something like this. I forgot to finish my reply to this message earlier this week. Since I'm in the writing mood today, I've finished it. Just when were getting past "diskless" being being misinterpreted as "NFS root"... Scyld does use "ramdisks" in our systems, but calling "ramdisk based" misses the point of the system. Booting: RAMdisks are critical Ramdisks are a key element of the boot system. Clusters need reliable, stateless node booting. We don't want local misconfiguration, failed storage hardware or corrupted file systems to prevent booting. The boot ramdisk have to be small, simple, and reliable. Large ramdisks multiply PXE problems and have obvious server scalability issues. Complexity is bad because the actions are happening "blind", with no easy way to see what step when wrong. We try to keep these images stable, with only ID table and driver updates. Run-time: RAMdisks are just an implementation detail The run-time system uses ramdisks almost incidentally. The real point of our system is creating a single point of administration and control -- a single virtual system. To that end we have a dynamic caching, consistent execution model. The real root "hypervisor" operates out of a ramdisk to be independent of the hardware and storage that might be used by application environments. The application root and caching system default to using ramdisks, but they can be configured to use local or network storage. The "real root" ramdisk is pretty small and simple. It's never seen by the applications, and only needs to keeps it own housekeeping info. The largest ramdisk is the system is the "libcache" FS. This cache starts out empty. As part of the node accepting new applications, the execution system (BProc or BeoProc) verifies that correct version of executable and libraries are available locally. By the time the node says "yah, I'll accept that job" it has cached the exact version it needs to run. (*) So really we are not using a "ramdisk install". We are dynamically detecting hardware, and loading the right kernel and device drivers under control of the boot system. Then we are creating an minimal custom "distribution" on the compute nodes. The effect is the same as creating a minimal custom "distribution" for that specific machine -- an installation that has only the kernel, device drivers and applications to be run on that node. This approach to dynamically building an installation is feasible and efficient because another innovation: a sharp distinction between full, standard "master" nodes and lightweight compute "slave" nodes. Only master nodes run the full, several-minute initialization to start standard services and daemons. ("How many copies of crond do you need?") Compute slaves exist only run only the end applications, and have a master with it's full reference install to fall on when they need to extend their limited environment. * Whole file caching is one element of the reliability model. It means we can continue to run even if that master stops responding, or replaces a file with a newer version. We provide a way for sophisticated sites to replace the file cache with a network file system, but then the file server must be up to continue running and you can run into versioning/consistency issue. RAMdisk Inventory We actually have five (!) different types of ramdisks over the system (see the descriptions below). But it's the opposite of the Warewulf approach. Our architecture is a consistent system model, so we dynamically build and update the environment on nodes. Warewulf-like ramdisk system only catch part of what we are doing: The Warewulf approach - Uses a manually selected subset distribution on the compute node ramdisk. While still very large, it's never quite complete. No matter how useless you think some utility is, there is probably some application out there that depends on it. - The ramdisk image is very large and it has to be completely downloaded at boot time just when the server is extremely. - Supplements the ramdisk with NFS, combining the problems of both.(*) The administrator and users to learn and think about how both fail. (*1) That said, combining a ramdisk root with NFS is still far more scalable and somewhat more robust than using solely NFS. With careful administration most of the executables will be on the ramdisk, allowing the server to support more nodes and reducing the likelihood of failures. The phrase "careful administration" should be read as "great for demos, and when the system is first configured, but degrades over time". The type of people that leap to configure the ramdisk properly the first time are generally not the same type that will be there for long-term manual tuning. Either they figure out why we designed around dynamic, consistent caching and re-write, or the system will degrade over time. Ramdisk types For completeness, here are the five ramdisk types in Scyld: BeoBoot stage 1: (The "Booster Stage") Used only for non-PXE booting. Now obsolete, this allowed network booting on machines that didn't have it built in. The kernel+ramdisk was small enough to fit on floppy, CD-ROM, hard disk, Disk-on-chip, USB, etc. This ramdisk image that contains NIC detection code and tables, along with every NIC driver and a method to substitution kernels. This image must be under 1.44MB, yet include all NIC drivers. BeoBoot stage 2 ramdisk: The run-time environment set-up, usually downloaded by PXE. Pretty much the same NIC detection code as the stage 1 ramdisk, except potentially optimized for only the NICs known to be installed. The purpose of this ramdisk is to start network logging ASAP and then contact the master to download the "real" run-time environment. When we have the new environment we pivotroot and delete this whole ramdisk. We've used the contents we cared about (tables & NIC drivers), and just emptying ramdisks frequently leaks memory! It's critical that this ramdisk be small to minimize TFTP traffic. Stage 3, Run-time environment supervisor (You can call this the "hypervisor".) This is the "real" root during operation, although applications never see it. The size isn't critical because we have full TCP from stage 2 to transfer it, but it shouldn't be huge because - it will compete with other, less robust booting traffic - the master will usually be busy - large images will delay node initialization LibCache ramdisk: This is a special-purpose file system used only for caching executables and libraries. We designed the system with a separate caching FS to optionally switch to caching on a local hard disk partition. That was useful with 32MB memory machines or when doing a rapid large-boot demo, but the added complexity is rarely useful on modern systems. Environment root: This is the file system the application sees. There is different environment for each master the node supports, or potentially even one for each application started. By default this is a ramdisk configured as a minimal Unix root by the master. The local administrator can change this to be a local or network file system to have a traditional "full install" environment, although that discards some of the robustness advances in Scyld. > Scyld requires a meatier head node as I remember due to its launch model. Not really because of the launch model, or the run-time control. It's to make the system less complex and simpler to use. Ideally the master does less work than the compute nodes because they are doing the computations. In real life people use the master for editing, compiling, scheduling, etc. It's the obvious place to put home directories and serve them to compute nodes. And it's where the real-life cruft ends up, such as license servers and reporting tools. Internally each type of service has it's own server IP address and port. We could point them to replicated masters or other file servers. They just all point to the single master to keep things simple. For reliability we can have cold, warm or hot spare masters. But again, it's less complex to administer one machine with redundant power supplies and hot-swap RAID5 arrays. All this makes the master node look like the big guy. -- Donald Becker becker at scyld.com Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 From cousins at umeoce.maine.edu Thu Dec 14 10:20:10 2006 From: cousins at umeoce.maine.edu (Steve Cousins) Date: Thu, 14 Dec 2006 13:20:10 -0500 (EST) Subject: [Beowulf] High performance storage with GbE? In-Reply-To: References: <200612120635.kBC6Z4cx011660@bluewest.scyld.com> Message-ID: Thanks Bill. This is really helpful. On Wed, 13 Dec 2006, Bill Broadley wrote: > What do you expect the I/O's to look like? Large file read/writes? Zillions > of small reads/writes? To one file or directory or maybe to a file or > directory per compute node? We are basing our specs on large file use. The cluster is used for many things so I'm sure there will be some cases where small file writes will be done. Most of the work I do deals with large file reads and writes and that is what we are basing our desired performance on. I don't think we can afford to try to get this type of bandwidth for multiple small file writes. > My approach so far as been to buy N dual opterons with 16 disks in each > (using the areca or 3ware controllers) and use NFS. Higher end 48 port > switches come with 2-4 10G uplinks. Numerous disk setups these days > can sustain 800MB/sec (Dell MD-1000 external array, Areca 1261ML, and the > 3ware 9650SE) all of which can be had in a 15/16 disk configuration for > $8-$14k depending on the size of your 16 disks (400-500GB towards the lower > end, 750GB towards the higher end). Do you have a system like this in place right now? > NFS would be easy, but any collection of clients (including all) would be > performance limited by a single server. This would be a problem, but... > PVFS2 or Lustre would allow you to use N of the above file servers and > get not too much less than N times the bandwidth (assuming large sequential > reads and writes). ... this sounds hopeful. How managable is this? Is it something that would take a FTE to keep going with 9 of these systems? I guess it depends on the systems themselves and how much fault tolerance there is. > In particular the Dell MD-1000 is interesting in that it allows for 2 12Gbit > connections (via SAS), the docs I've found show you can access all 15 > disks via a single connection or 7 disks on one, and 8 disks on the other. > I've yet to find out if you can access all 15 disks via both interfaces > to allow fallover in case one of your fileservers dies. As previously > mentioned both PVFS2 and Lustre can be configured to handle this situation. > > So you could buy a pair of dual opterons + SAS card (with 2 external > conenctions) then connect each port to each array (both servers to > both connections), then if a single server fails the other can take > over the other servers disks. > > A recent quote showed that for a config like this (2 servers 2 arrays) would > cost around $24k. Assuming one spare disk per chassis, and a 12+2 RAID6 array > and provide 12TB usable (not including 5% for filesystem overhead). Are the 1 TB drives out now? With 750 GB drives wouldn't it be 9 TB per array. We have a 13+2 RAID6 + hot spare array with 750 GB drives and with XFS file system we get 8.9 TiB. > So 9 of the above = $216k and 108TB usable, each of the arrays Dell claims > can manage 800MB/sec, things don't scale perfectly but I wouldn't be surprised > to see 3-4GB/sec using PVFS2 or Lustre. Actual data points appreciated, we > are interested in a 1.5-2.0GB/sec setup. Based on 8.9 TiB above for 16 drives, it looks like 8.2 TiB for 15 drives. so we'd want 12 of these to get about 98 TiB usable storage. I don't know what the overhead is in PVFS2 or Lustre compared to XFS but I'd doubt it would be any less so we might even need 13. So, 13 * $24K = $312K. Ah, what's another $100K. > Are any of the solutions you are considering cheaper than this? Any of the > dual opterons in a 16 disk chassis could manage the same bandwidth (both 3ware > and areca claim 800MB/sec or so), but could not survive a file server death. So far this is the best price for something that can theoretically give the desired performance. I say theoretically here because I'm not sure what parts of this you have in place. I'm trying to find real-world implementations that provide in the ballpark of 5 to 10 MB/sec at the nodes when on the order of a hundred nodes are writing/reading at the same time. Are you using PVFS2 or Lustre with your N Opteron servers? When you run a job with many nodes writing large files at the same time what kind of performance do you get per node? What is your value of N for the number of Opteron server/disk arrays you have implemented? Thanks again for all of this information. I hadn't been thinking seriously of PVFS2 or Lustre because I'd been thinking more in the lines of individual disks in nodes. Using RAID arrays would be much more manageable. Are there others who have this type of system implemented who can provide performance results as well as a view on how manageable it is? Thanks, Steve From cousins at umeoce.maine.edu Thu Dec 14 10:58:53 2006 From: cousins at umeoce.maine.edu (Steve Cousins) Date: Thu, 14 Dec 2006 13:58:53 -0500 (EST) Subject: [Beowulf] Re: High performance storage with GbE? In-Reply-To: <200612141753.kBEHrwaE007602@bluewest.scyld.com> References: <200612141753.kBEHrwaE007602@bluewest.scyld.com> Message-ID: Bill Broadley wrote: > Sorry all, my math was a bit off (thanks Mike and others). > > To be clearer: > * $24k or so per pair of servers and a pair of 15*750GB arrays > * The pair has 16.4TB usable (without the normal 5% reserved by the > filesystem) > * The pair has 20.5TB raw (counting disks use for spare and redundancy) > * Each pair has 22 TB marketing (Using TB = 10^9 instead of 2^30) > > So 7 servers could manage to provide the mentioned >= 100TB. Total cost of > 7*$24k=$168k. Total storage cost does not include the GigE switches, 10G > uplinks, nor 10G nics. Usable capacity of 7*16 = 112TB. Ah, I see. That brings it down quite a bit. Thanks for clarifying. > The above config assumes that software RAID performance = hardware RAID > performance or that the Dell MD-1000 can allow 2 masters access to > all 15 disks (not at the same time). I'm not sure that either is true. > > From the scaling numbers I've seen published with PVFS2 and Lustre seems like > it should still be possible to manage the mentioned goal of 1.5-2.5GB/sec > (which assumes 220MB/sec or 440MB/sec per file server pair for 7 pairs.) Again, if anyone can provide a real-world layout of what they are using, along with real-world speeds at the node, that would help out tremendously. Thanks, Steve From auman_alef at hotmail.com Thu Dec 14 11:57:00 2006 From: auman_alef at hotmail.com (ALEF AUMAN) Date: Thu, 14 Dec 2006 20:57:00 +0100 Subject: [Beowulf] Wake On Lan (WOL) doesn't work with 3c905c Message-ID: Hello, Wake On Lan doesn't work for the following on board NIC card in a HP VL400 computer : 01:04.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 7 I'm using Ubuntu(Edgy) I've tried already with the following : 1) Add the following to /etc/modules 3c59x options=0x408 2) Add the following to /etc/modules options 3c59x enable_wol=1 but without success. I've the same machine but with windows xp installed on it, and there it works. Wake on Lan is enabled in the BIOS Some outputs alef at FIRSTLOOK-UB:~$ sudo ethtool eth0 Password: Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 24 Transceiver: internal Auto-negotiation: on Current message level: 0x00000001 (1) Link detected: yes alef at FIRSTLOOK-UB:~$ I don't have the following lines Supports Wake-on: g Wake-on: g and this is not a good sign. Maybe the driver is not good in Ubuntu? Does somebody have some other ideas? With kind regards Alef From simon at thekelleys.org.uk Thu Dec 14 15:01:47 2006 From: simon at thekelleys.org.uk (Simon Kelley) Date: Thu, 14 Dec 2006 23:01:47 +0000 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: References: Message-ID: <4581D7DB.7020903@thekelleys.org.uk> Donald Becker wrote: >> >>I'm not quite following here: It seems like you might be advocating >>retransmits every half second. I'm current doing classical exponential >>backoff, 1 second delay, then two, then four etc. Will that bite me? > > > Where are you you doing exponential back-off? re-transmits in the TFTP server: sent a block and await the corresponding ACK; if it doesn't arrive for timeout, re-send. This is needed to recover from lost data packets, client retries only recover from lost ACKs (at least they do in implementations which have been immunised against sorcerers-apprentice syndrome.) > The TFTP client will/should/might do a retry every second. (Background: > TFTP uses "ACK" of the previous packet to mean "send the next one". The > only way to detect this is a retry is timing.) The client might do a > re-ARP first. In corner cases it might not reply to ARP itself. > > [[ Step up on the soapbox. ]] > > What idiot thought that exponential backoff was a good idea? > Exponential backoff doesn't make sense where your base time period is a > whole second and you can't tell if the reason for no response is > failure, busy network or no one listening. > > My guess is that they were just copying Ethernet, where modified, > randomized exponential backoff is what makes it magically good. > Exponential backoff makes sense at the microsecond level, where you have > a collision domain and potentially 10,000 hosts on a shared ether. Even > there the idea of "carrier sense" or 'is the network busy' is what > enables Ethernet to work at 98+% utilization rather than the 18% or 37% > theoretical of Aloha Net. (Key difference: deaf transmitter.) > > What usually happens with DHCP and PXE is that the first packet is used > getting the NIC to transmit correctly. The second packet is used to get > the switch to start passing traffic. The third packet get through but we > are already well into the exponential fallback. > > PXE would be much better and more reliable if it started out > transmitting a burst of four DHCP packets even spaced in the first > second, then falling back to once per second. If there is a concern > about DHCP being a high percentage of traffic in huge installations > running 10baseT, tell them to buy a server. Or, like, you know, a > router. Because later the ARP traffic alone will dwarf a few DHCP > broadcasts. It's probably worth differentiating DHCP and TFTP here. I guess the reason for exponential-backoff of to avoid congestion-collapse as the ratio of bits-on-the-wire to useful work decreases. By the time a host is doing TFTP the network-path should be established, so bursting packets shouldn't be needed. Maybe delaying backoff would make sense. > > >>I'm doing round-robin, but I don't see how to throttle active >>connections: do I need to do that, or just limit total bandwidth? > > > Yes, you need to throttle active TFTP connections. The clients > currently winning can turn around a next-packet request really quickly. > If a few get in lock step, the server will have the next chunk of the > file warm in the cache. This is the start of locking out the first > loser. > > You can't just let the ACKs queue up in the socket as a substitute for > deferring responses either. You have to pull them out ASAP and mark > that client as needing a response. This doesn't cost very much. You > need to keep the client state structure anyway. This is just one more > bit, plus updating the timeval that you should be keeping anyway. > All true. I'll experiment with some throttling approaches. Cheers, Simon. > From mathog at caltech.edu Thu Dec 14 16:01:12 2006 From: mathog at caltech.edu (David Mathog) Date: Thu, 14 Dec 2006 16:01:12 -0800 Subject: [Beowulf] RE: Wake On Lan (WOL) doesn't work with 3c905c Message-ID: "ALEF AUMAN" wrote: > Wake On Lan doesn't work for the following on board NIC card in a HP VL400 > computer : > 01:04.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev > 7 > I'm using Ubuntu(Edgy) Isn't WOL fun? I've wasted many days on WOL related problems. At least you know the board hardware really suports it, unlike the infamous Tyan S2466, since you have a positive control from XP. My last foray into this turned up a bug in the nvidia driver where it was writing it's own MAC into the NIC register backwards. The other trick is that you have to be sure that when the system shuts down it goes to the right power state. That's where Ubuntu might be biting you. See if anything in these links helps: http://www.nvnews.net/vbulletin/showthread.php?t=70384 http://ubuntuforums.org/showthread.php?p=1547189 > I don't have the following lines > Supports Wake-on: g > Wake-on: g > and this is not a good sign. > Maybe the driver is not good in Ubuntu? I think that may be normal for this card. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From csamuel at vpac.org Thu Dec 14 18:20:39 2006 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 15 Dec 2006 13:20:39 +1100 Subject: [Beowulf] distributed file storage solution? In-Reply-To: <200612131215.46062.anandvaidya.ml@gmail.com> References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612131215.46062.anandvaidya.ml@gmail.com> Message-ID: <200612151320.42761.csamuel@vpac.org> On Wednesday 13 December 2006 15:15, Anand Vaidya wrote: > Have you considered G-FARM? http://datafarm.apgrid.org Any news on when they're going to do a release as a POSIX filesystem yet ? Last time I'd heard that was planned for 2.0, but I've not heard anything about that release.. -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From agshew at gmail.com Thu Dec 14 21:39:35 2006 From: agshew at gmail.com (Andrew Shewmaker) Date: Thu, 14 Dec 2006 22:39:35 -0700 Subject: [Beowulf] distributed file storage solution? In-Reply-To: References: <3e655ca70612081836q3989485cr567c19909095ea51@mail.gmail.com> <200612111932.43384.kyron@neuralbs.com> <457E2182.2050504@hypermall.net> Message-ID: On 12/11/06, Brian D. Ropers-Huilman wrote: > On 12/11/06, Craig Tierney wrote: > > Lustre supports redundant meta-data servers (MDS) and failover for the > > object-storage servers (OSS). However, "high-performing" is relative. > > Great at streaming data, not at meta-data. > > Which is really the bane of all cluster file systems, isn't it? Meta > data accesses kill performance. Ceph is a new distributed file system with an interesting way of handling metadata. "Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs)." They've got a prototype and quite a few research papers on their site. http://ceph.sourceforge.net -- Andrew Shewmaker From gmpc at sanger.ac.uk Fri Dec 15 02:36:29 2006 From: gmpc at sanger.ac.uk (Guy Coates) Date: Fri, 15 Dec 2006 10:36:29 +0000 Subject: [Beowulf] Re: High performance storage with GbE? In-Reply-To: References: <200612141753.kBEHrwaE007602@bluewest.scyld.com> Message-ID: <45827AAD.6080205@sanger.ac.uk> Steve Cousins wrote: > Again, if anyone can provide a real-world layout of what they are using, > along with real-world speeds at the node, that would help out tremendously. I can give you the detail on our current lustre setup (HP SFS v2.1.1), We have 10 lustre servers (2 MDS servers and 8 OSTs). Each server is a Dual 3.20GHz Xeon server with 4 GB RAM, and has a single SFS20 scsi storage array attached to it. The array has 12 SAS disks in a raid6 configuration. (The arrays are actually dual-homed, so if a server fails the storage and OST/MDS service can failed over to another server). We have 560 clients. The interconnect is GigE at the edge (single GigE to the client) and 2x10GigE at the core. Large file read/write from a single client can fill the GigE pipe quite happily. Aggregate performance is also excellent. We have achieved 1.5 Gbytes/s (12 Gbits/s) in production with real code. The limiting factor appears to be the scsi controllers, which max out at ~170 Mbytes/second. As has previously been mentioned, small file / metadata performance is not great. A single client can do ~500 file creates per second, ~1000 deletes per second and ~1000 stats per second. The performance does at least scale when you run on more clients. The MDS itself can handle ~60,000 stats per second if you have multiple clients running in parallel. More gory detail here; http://www.sanger.ac.uk/Users/gmpc/presentations/SC06-lustre-BOF.pdf Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 From ebiederm at xmission.com Mon Dec 18 07:00:25 2006 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 18 Dec 2006 08:00:25 -0700 Subject: [Beowulf] Node Drop-Off In-Reply-To: <002901c717c5$578bbbc0$9600000a@gourmandises> (Vincent Diepeveen's message of "Mon, 4 Dec 2006 17:57:50 +0100") References: <45578E91.1080407@tcg-hsv.com> <16aa0e180611121340u1ad13c2ch892808b653e7abb@mail.gmail.com> <45743B96.10003@tcg-hsv.com> <002901c717c5$578bbbc0$9600000a@gourmandises> Message-ID: "Vincent Diepeveen" writes: > I wouldn't rule out that linux kernel simply has bugs there. The testing of > those kernels is total amateuristic. No the testing it is totally open. Just because you can't see how a process works doesn't make it better. Eric From gmk at runlevelzero.net Fri Dec 15 11:49:20 2006 From: gmk at runlevelzero.net (Greg Kurtzer) Date: Fri, 15 Dec 2006 11:49:20 -0800 Subject: [Beowulf] SATA II - PXE+NFS - diskless compute nodes In-Reply-To: References: Message-ID: <7B87C80B-50DD-490A-AD19-B3DB9DC3721C@runlevelzero.net> On Dec 14, 2006, at 2:33 PM, Donald Becker wrote: > On Sat, 9 Dec 2006, Joe Landman wrote: >> Guy Coates wrote: >>> At what node count does the nfs-root model start to break down? >>> Does anyone >>> have any rough numbers with the number of clients you can support >>> with a generic >>> linux NFS server vs a dedicated NAS filer? >> >> If you use warewulf or the new perceus variant, it creates a ram disk >> which is populated upon boot. Thats one of the larger >> transients. Then >> you nfs mount applications, and home directories. I haven't >> looked at >> Scyld for a while, but I seem to remember them doing something >> like this. > > I forgot to finish my reply to this message earlier this week. > Since I'm > in the writing mood today, I've finished it. > > > Just when were getting past "diskless" being being misinterpreted as > "NFS root"... I prefer the term "stateless" to describe Warewulf and Perceus provisioning model (stateless installs may have local disks for swap and scratch/data space). > RAMdisk Inventory > > We actually have five (!) different types of ramdisks over the system > (see the descriptions below). But it's the opposite of the Warewulf > approach. Our architecture is a consistent system model, so we > dynamically build and update the environment on nodes. Warewulf-like > ramdisk system only catch part of what we are doing: The stateless provisioning model has a very different goal then Scyld's Bproc implementation and thus a comparison is misleading. > > The Warewulf approach > - Uses a manually selected subset distribution on the compute node > ramdisk. > While still very large, it's never quite complete. No matter how > useless > you think some utility is, there is probably some application out > there > that depends on it. > - The ramdisk image is very large and it has to be completely > downloaded > at > boot time just when the server is extremely. > - Supplements the ramdisk with NFS, combining the problems of > both.(*) > The > administrator and users to learn and think about how both fail. I suppose that under some circumstances these observations maybe applicable, but with that said... I have not heard of any of the Warewulf or stateless Perceus *users* sharing these opinions. Regarding the various cluster implementations: there is not one size fits all, and all of the toolkits and implementation methods have tradeoffs. Rather then point out the problems in the various cluster solutions, I would just like to reiterate that people should evaluate what fits their needs best and utilize what works best for them in their environment. > > (*1) That said, combining a ramdisk root with NFS is still far more > scalable and somewhat more robust than using solely NFS. With careful > administration most of the executables will be on the ramdisk, > allowing > the server to support more nodes and reducing the likelihood of > failures. Well said, I agree. There are also some general policies that will work reasonably well and doesn't require much system specific tuning (if any). > The phrase "careful administration" should be read as "great for > demos, > and when the system is first configured, but degrades over time". The > type of people that leap to configure the ramdisk properly the first > time are generally not the same type that will be there for long-term > manual tuning. What an odd way of looking at it. Great for demos but not for a long term solution because it degrades over time??? If you are referring to careless people or admins mucking up the virtual node file systems I think the physical muckage would be the least of the concerns when these people have root. Not to mention blaming the cluster toolkit or provisioning model for allowing the users the freedom and flexibility to do what they want is misidentifying the problem. > Either they figure out why we designed around dynamic, > consistent caching and re-write, or the system will degrade over time. Why would a system not built around "dynamic consistent caching and re-write" degrade over time? Many thanks! Greg -- Greg Kurtzer gmk at runlevelzero.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From win.treese at sicortex.com Mon Dec 18 09:30:51 2006 From: win.treese at sicortex.com (Win Treese) Date: Mon, 18 Dec 2006 12:30:51 -0500 Subject: [Beowulf] Beowulf analogy for a classroom In-Reply-To: <457A19D0.3070605@iinet.net.au> References: <457A19D0.3070605@iinet.net.au> Message-ID: On Dec 8, 2006, at 9:05 PM, Steve Heaton wrote: > G'day all > > I'll skip the background as to 'why' but I've recently been working > on a way to explain the Beowulf concept to a classroom of school > kids. No computers required. > > I think I've come up with a useful analogy/experiment that might > work. I've posted it here on the off chance that someone else might > want to give it a try if they get pressed into such a situation. > > First, some 'newsgroup preempters'. This is designed for school > kids not your typical Beowulf list reader. Sorry but no cars or > harnessing of chickens involved ;) No mentions of heat dissipation > or switching fabrics. This is a non-technical post that someone > might find handy at some point. Steve, That's a good idea. Last year, I did something similar. My daughter asked me to come to her second-grade class to talk about "the big computer" we are building, which is perhaps a younger group than you had. Some colleagues and I came up with the following: The teacher handed out copies of one of the classroom books, so all the kids had the same book. Each child was assigned a page in the book. We did a couple of different problems. First, distributed search: who can find the word "blue"? Can the kids (parallel) do it faster than the teacher (sequential)? Second, distributed counting: how many times is the word "sky" in the book? Again, race the kids against the teacher. I don't know how much they remember about it, but it was fun. Win Treese SiCortex, Inc. win.treese at sicortex.com From landman at scalableinformatics.com Wed Dec 20 08:51:44 2006 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 20 Dec 2006 11:51:44 -0500 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users Message-ID: <45896A20.8060905@scalableinformatics.com> Hi folks: We have put together some Areca RPMs for x86_64 systems. RPMs for the Areca are available now in binary/source form at http://downloads.scalableinformatics.com/downloads/Scientific_Linux/4.4/x86_64/ We did ask Areca for permission to distribute. The RPMs contain Areca copyrighted information. You may get support from Areca on the drivers, and Scalable Informatics will support the RPMs (basically just packages) for our customers. They should be updated in a few days with some additional features. I simply wanted to point them to anyone who may be interested. One of our JackRabbit customers wanted to run it under Scientific Linux, so I did the build (source and binary) there. They appear to work nicely under Centos 4.x as well. I haven't tested them under RHEL4, but I am assuming that there shouldn't be an issue with the src RPM. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From deadline at clustermonkey.net Wed Dec 20 09:54:22 2006 From: deadline at clustermonkey.net (Douglas Eadline) Date: Wed, 20 Dec 2006 12:54:22 -0500 (EST) Subject: [Beowulf] ClusterMonkey for the Holidays Message-ID: <59820.192.168.1.1.1166637262.squirrel@mail.eadline.org> All: I finally got Jeff Layton's extensive review of SC06 on ClusterMonkey.net If you did not attend, or you did attend but could not get around to see everything (like me), have a look. There is also interviews on ClusterCast.net. Plus a Mellanox white paper and my final in a three part series on dynamic parallel programming (originally from Linux Magazine) Linkage: SC06 Review: http://www.clustermonkey.net//content/view/176/40/ ClusterCast: http://www.clustercast.org/ Mellanox Paper: http://www.clustermonkey.net//content/view/178/33/ Parallel Prog: http://www.clustermonkey.net//content/view/177/32/ Enjoy! -- Doug From ss524 at njit.edu Wed Dec 20 08:45:16 2006 From: ss524 at njit.edu (Mr. Sumit Saxena) Date: Wed, 20 Dec 2006 11:45:16 -0500 (EST) Subject: [Beowulf] LAM -beowulf problems Message-ID: <6068397.1166633116572.JavaMail.sumit.saxena@njit.edu> Hi I am new to linux as well as beowulf, please help me. I tried to hook up two machines and run LAM but I am not able to lamboot. I can lamboot on each machine individually but not from master to master and slave. I have provided the link of the libraries of LAM in my ld.so.conf as wellas .bash_profile, still I see the following error message. Also I am able to ssh into machines without passwords. I followed the following document to setup my machines http://tldp.org/HOWTO/html_single/Beowulf-HOWTO/ ++++++++++++++++++++++++++++++++++++++++++++++ LAM 6.5.9/MPI 2 C++ - Indiana University Executing hboot on n0 (surya01 - 1 CPU)... Executing hboot on n1 (surya02 - 1 CPU)... bash: line 1: hboot: command not found ----------------------------------------------------------------------------- LAM failed to execute a LAM binary on the remote node "surya02". Since LAM was already able to determine your remote shell as "hboot", it is probable that this is not an authentication problem. LAM tried to use the remote agent command "ssh" to invoke the following command: ssh -x surya02 -n hboot -t -c lam-conf.lam -v -s -I "-H 192.168.13.1 -P 33628 -n 1 -o 0 " This can indicate several things. You should check the following: - The LAM binaries are in your $PATH - You can run the LAM binaries - The $PATH variable is set properly before your .cshrc/.profile exits Try to invoke the command listed above manually at a Unix prompt. You will need to configure your local setup such that you will *not* be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed. When you can get this command to execute successfully by hand, LAM will probably be able to function properly. ------------------------------------------------------------------------ ----- ------------------------------------------------------------------------ ----- lamboot encountered some error (see above) during the boot process, and will now attempt to kill all nodes that it was previously able to boot (if any). Please wait for LAM to finish; if you interrupt this process, you may have LAM daemons still running on remote nodes. ------------------------------------------------------------------------ ----- wipe ... LAM 6.5.9/MPI 2 C++ - Indiana University Executing tkill on n0 (surya01)... ++++++++++++++++++++++++++++++++++++++++++++++ please help kind regards Sumit From r.vadivelanrhce at gmail.com Thu Dec 21 01:10:28 2006 From: r.vadivelanrhce at gmail.com (Vadivelan Rathinasabapathy) Date: Thu, 21 Dec 2006 14:40:28 +0530 Subject: [Beowulf] Reg: setting problem size in linpack Message-ID: <9fe360270612210110r39687514te55cd3f9836efc89@mail.gmail.com> hi, I am using a 16 node cluster with RedHat Enterprise Linux AS4.0 on all the servers with Rocks4.2.1 cluster s/w. I have compiled linpack and want to benchmark the server performance. my setup is like this HP proliant DL585 with AMD opteron model 875 2.2 Ghz dual core, 20gb RAM and 600GB SCSI hdd. HP proliant DL 145G2 with AMD Opteron model 275 2.2Ghz Dual Core, 10GB RAM and 80GB Sata HDD = 16 nos. I am Using summit 400-24T gigabit ethernet switch 1)What is the total GFLOPS of each server, and as a total Cluster? 2) What is the FLOPS of each processor? 3) how to calculate the best problem size for any cluster configuration and to this setup also? -- Thanks and Regards R.Vadivelan CMC Ltd, Bangalore r.vadivelanrhce at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.st.john at gmail.com Fri Dec 22 08:12:12 2006 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 22 Dec 2006 11:12:12 -0500 Subject: [Beowulf] Activity for array programming language ZPL? Message-ID: Does anybody use the array-processing language ZPL? The last reference I can find to it is about two years ago; for example, the latest nightly "Cutting Edge" build was: *"Last reflected modification: Tue Nov 16 18:55:55 PST 2004" * (from http://www.cs.washington.edu/research/zpl/download/download.html). The comic Doug Eaderline mentioned was a nice touch (but Marvel isn't losing sleep over the competition). Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From reuti at staff.uni-marburg.de Sun Dec 24 13:53:02 2006 From: reuti at staff.uni-marburg.de (Reuti) Date: Sun, 24 Dec 2006 22:53:02 +0100 Subject: [Beowulf] LAM -beowulf problems In-Reply-To: <6068397.1166633116572.JavaMail.sumit.saxena@njit.edu> References: <6068397.1166633116572.JavaMail.sumit.saxena@njit.edu> Message-ID: Hi, first of all I would suggest to look into the most recent version of LAM/MPI, which is 7.1.2 or OpenMPI. Which shell are you using? For bash maybe you have to add the PATH to your LAM/MPI binaries in .bashrc i.e. a file, that is sourced during a non-interactive login. -- Reuti Am 20.12.2006 um 17:45 schrieb Mr. Sumit Saxena: > Hi > I am new to linux as well as beowulf, please help me. > I tried to hook up two machines and run LAM but I am not able to > lamboot. I can lamboot on each machine individually but not from > master > to master and slave. I have provided the link of the libraries of LAM > in my ld.so.conf as wellas .bash_profile, still I see the following > error message. Also I am able to ssh into machines without > passwords. I > followed the following document to setup my machines > http://tldp.org/HOWTO/html_single/Beowulf-HOWTO/ > ++++++++++++++++++++++++++++++++++++++++++++++ > LAM 6.5.9/MPI 2 C++ - Indiana University > > Executing hboot on n0 (surya01 - 1 CPU)... > Executing hboot on n1 (surya02 - 1 CPU)... > bash: line 1: hboot: command not found > ---------------------------------------------------------------------- > ------- > LAM failed to execute a LAM binary on the remote node "surya02". > Since LAM was already able to determine your remote shell as "hboot", > it is probable that this is not an authentication problem. > > LAM tried to use the remote agent command "ssh" > to invoke the following command: > > ssh -x surya02 -n hboot -t -c lam-conf.lam -v -s -I "-H > 192.168.13.1 -P > 33628 -n 1 -o 0 " > > This can indicate several things. You should check the following: > > - The LAM binaries are in your $PATH > - You can run the LAM binaries > - The $PATH variable is set properly before your > .cshrc/.profile exits > > Try to invoke the command listed above manually at a Unix prompt. > > You will need to configure your local setup such that you will *not* > be prompted for a password to invoke this command on the remote node. > No output should be printed from the remote node before the output of > the command is displayed. > > When you can get this command to execute successfully by hand, LAM > will probably be able to function properly. > ---------------------------------------------------------------------- > -- > ----- > ---------------------------------------------------------------------- > -- > ----- > lamboot encountered some error (see above) during the boot process, > and will now attempt to kill all nodes that it was previously able to > boot (if any). > > Please wait for LAM to finish; if you interrupt this process, you may > have LAM daemons still running on remote nodes. > ---------------------------------------------------------------------- > -- > ----- > wipe ... > > LAM 6.5.9/MPI 2 C++ - Indiana University > > Executing tkill on n0 (surya01)... > > ++++++++++++++++++++++++++++++++++++++++++++++ > please help > kind regards > Sumit > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From csamuel at vpac.org Tue Dec 26 16:05:55 2006 From: csamuel at vpac.org (Chris Samuel) Date: Wed, 27 Dec 2006 11:05:55 +1100 Subject: [Beowulf] LAM -beowulf problems In-Reply-To: <6068397.1166633116572.JavaMail.sumit.saxena@njit.edu> References: <6068397.1166633116572.JavaMail.sumit.saxena@njit.edu> Message-ID: <200612271105.55750.csamuel@vpac.org> On Thursday 21 December 2006 03:45, Mr. Sumit Saxena wrote: > I have provided the link of the libraries of LAM in my ld.so.conf as > well as .bash_profile Try putting the PATH configuration for LAM into your .bashrc instead.. good luck! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia Nadolig llawen a blwyddyn newydd da i pawb (as they say in Wales). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From sean at duke.edu Wed Dec 27 07:46:29 2006 From: sean at duke.edu (Sean Dilda) Date: Wed, 27 Dec 2006 10:46:29 -0500 Subject: [Beowulf] mpiJava + MPICH Message-ID: <45929555.5090108@duke.edu> I'm working on setting up mpiJava for a cluster user. I'm compiling it against Sun's Java 1.5.0 and MPICH 1.2.5, on a cluster running CentOS 4 I can get it compiled and installed with a problem, and it almost works. The test java programs run, however MPICH seems to only initialize with a world size of 1. This is behavior very similar to if you run an mpich program without mpirun. However, I am using mpirun. I've also noticed that when a normal MPI program runs, the process tree shows mpirun, which has a child of your program. That child of mpirun then has a child that's your program running locally, and a bunch of children that are all the 'rsh' command for launching remote copies. Whenever I run a mpiJava program, the only thing in the process tree is mpirun and a single child of mpirun. Has anyone run across this, or have any ideas of what I could do to fix this problem? Thanks, Sean From chetoo.valux at gmail.com Tue Dec 26 10:29:29 2006 From: chetoo.valux at gmail.com (Chetoo Valux) Date: Tue, 26 Dec 2006 19:29:29 +0100 Subject: [Beowulf] Selling computation time Message-ID: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> Dear all, Maybe these dates I've drunk too much champagne, but I've wondered about the following: as you know building and maintining a cluster is not a trivial issue, not to mention housing it ... I wonder then if there would be potential buyers for cluster time. I've been browsing, not too deep, the net, and I've not found (yet) any information of someone selling cluster time. And when clusters come into place, I guess this is one of the right places to ask ... Best wishes, Chetoo. -------------- next part -------------- An HTML attachment was scrubbed... URL: From reuti at staff.uni-marburg.de Wed Dec 27 10:35:21 2006 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed, 27 Dec 2006 19:35:21 +0100 Subject: [Beowulf] Selling computation time In-Reply-To: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> References: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> Message-ID: Am 26.12.2006 um 19:29 schrieb Chetoo Valux: > Dear all, > > Maybe these dates I've drunk too much champagne, but I've wondered > about the following: as you know building and maintining a cluster > is not a trivial issue, not to mention housing it ... > > I wonder then if there would be potential buyers for cluster time. > I've been browsing, not too deep, the net, and I've not found > (yet) any information of someone selling cluster time. > I know about SUN selling it: http://www.network.com/ -- Reuti From deadline at eadline.org Wed Dec 27 12:06:14 2006 From: deadline at eadline.org (Douglas Eadline) Date: Wed, 27 Dec 2006 15:06:14 -0500 (EST) Subject: [Beowulf] Selling computation time In-Reply-To: References: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> Message-ID: <47931.192.168.1.1.1167249974.squirrel@mail.eadline.org> > Am 26.12.2006 um 19:29 schrieb Chetoo Valux: > >> Dear all, >> >> Maybe these dates I've drunk too much champagne, but I've wondered >> about the following: as you know building and maintining a cluster >> is not a trivial issue, not to mention housing it ... >> >> I wonder then if there would be potential buyers for cluster time. >> I've been browsing, not too deep, the net, and I've not found >> (yet) any information of someone selling cluster time. >> > > I know about SUN selling it: > > http://www.network.com/ As is IBM, http://www-03.ibm.com/servers/deepcomputing/cod/ -- Doug > > -- Reuti > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > !DSPAM:4592bd4d70071804284693! > -- Doug From buccaneer at rocketmail.com Wed Dec 27 12:06:52 2006 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed, 27 Dec 2006 12:06:52 -0800 (PST) Subject: [Beowulf] Selling computation time In-Reply-To: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> Message-ID: <655917.80249.qm@web30612.mail.mud.yahoo.com> --- Chetoo Valux wrote: > Dear all, > > Maybe these dates I've drunk too much champagne, but > I've wondered about the > following: as you know building and maintining a > cluster is not a trivial > issue, not to mention housing it ... > > I wonder then if there would be potential buyers for > cluster time. I've been > browsing, not too deep, the net, and I've not found > (yet) any information > of someone selling cluster time. > > And when clusters come into place, I guess this is > one of the right places > to ask ... > Big investment fixed and variable costs. Lots of companies doing it. IBM for instance, has a number of super computing sites. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From list-beowulf at onerussian.com Wed Dec 27 13:15:04 2006 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed, 27 Dec 2006 16:15:04 -0500 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <45896A20.8060905@scalableinformatics.com> References: <45896A20.8060905@scalableinformatics.com> Message-ID: <20061227211502.GB10571@washoe.onerussian.com> Thanks Joe I just want to comment on my experience with Areca. That is pity that Areca's drivers got kicked even from -mm devel branch of linux mainstream kernel (unfortunately I don't remember in which exact version it has happened). Recent driver provided by Areca seems to work fine for me (after I upgraded power supply of the box, which was mentioned by Areca as the main possible cause of the instability in operation we had). So problem seems to be resolved, but I disliked that the areca developers kept the same version of the driver (thus file name) while he introduced some changes, which they named (in my inquiry to them) as the simple refactoring of the code without any change in functionality. I saw other people in the mailing lists complaining about the same or similar issue (http://uwsg.iu.edu/hypermail/linux/kernel/0611.3/0590.html). So, things working well, but code was orphaned by linux kernel developers (read more about some resolved and not resolved problems in the mailing lists http://marc.theaimsgroup.com/?l=linux-scsi&m=113597128115672&w=2 http://marc.theaimsgroup.com/?l=linux-scsi&m=115226175822438&w=2), versioning is somewhat is inconsistent as I've mentioned How do you do with your Areca? :-) On Wed, 20 Dec 2006, Joe Landman wrote: > Hi folks: > We have put together some Areca RPMs for x86_64 systems. RPMs for the Areca are available > now in binary/source form at > http://downloads.scalableinformatics.com/downloads/Scientific_Linux/4.4/x86_64/ -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] From hunting at ix.netcom.com Wed Dec 27 13:15:57 2006 From: hunting at ix.netcom.com (Michael Huntingdon) Date: Wed, 27 Dec 2006 13:15:57 -0800 Subject: [Beowulf] Selling computation time In-Reply-To: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.co m> References: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> Message-ID: <7.0.1.0.2.20061227131321.024cd4b8@ix.netcom.com> HP's Computing on Demand: Happy Holidays Michael At 10:29 AM 12/26/2006, Chetoo Valux wrote: >Dear all, > >Maybe these dates I've drunk too much champagne, but I've wondered >about the following: as you know building and maintining a cluster >is not a trivial issue, not to mention housing it ... > >I wonder then if there would be potential buyers for cluster time. >I've been browsing, not too deep, the net, and I've not found (yet) >any information of someone selling cluster time. > >And when clusters come into place, I guess this is one of the right >places to ask ... > >Best wishes, >Chetoo. >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From csamuel at vpac.org Wed Dec 27 15:53:25 2006 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 28 Dec 2006 10:53:25 +1100 Subject: [Beowulf] Selling computation time In-Reply-To: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> References: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> Message-ID: <200612281053.28190.csamuel@vpac.org> On Wednesday 27 December 2006 05:29, Chetoo Valux wrote: > I wonder then if there would be potential buyers for cluster time. I've > been browsing,? not too deep, the net, and I've not found (yet) any > information of someone selling cluster time. We occasionally get approached by commercial customers wanting access to our HPC clusters (as well as assistance spec'ing, buying, configuring and maintaining them) and we help out where we can (NB: if they want to use commercial software they have to bring their own licenses). But our core business is supporting our academic users at the institutions that are members (i.e. owners) of VPAC. -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From csamuel at vpac.org Wed Dec 27 16:24:21 2006 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 28 Dec 2006 11:24:21 +1100 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <20061227211502.GB10571@washoe.onerussian.com> References: <45896A20.8060905@scalableinformatics.com> <20061227211502.GB10571@washoe.onerussian.com> Message-ID: <200612281124.25374.csamuel@vpac.org> On Thursday 28 December 2006 08:15, Yaroslav Halchenko wrote: > That is pity that Areca's drivers got kicked even from -mm devel branch > of linux mainstream kernel (unfortunately I don't remember in > which exact version it has happened). The ARECA drivers have just moved from the mm tree into the mainline kernel in 2.6.19, I've just downloaded 2.6.19.1 and verified that their arcmsr driver (version 1.20.00.13) still exists in drivers/scsi/arcmsr . I guess you could say they were kicked upstairs, but certainly not kicked out.. :-) Caveat: there is a known ext3 filesystem corruption bug in 2.6.19 and 2.6.19.1, it's apparently hard to trigger (though someone said they got it through just compiling). If you are a Linux Weekly News subscriber (and if you're not you probably should be, it's a great news resource) you can read about it here (should be freely available at the end of the week when the next weekly news comes out): http://lwn.net/Articles/215113/ Bugzilla entry for the compiling case: http://bugzilla.kernel.org/show_bug.cgi?id=7707 Good luck! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From list-beowulf at onerussian.com Wed Dec 27 19:10:15 2006 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed, 27 Dec 2006 22:10:15 -0500 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <200612281124.25374.csamuel@vpac.org> References: <45896A20.8060905@scalableinformatics.com> <20061227211502.GB10571@washoe.onerussian.com> <200612281124.25374.csamuel@vpac.org> Message-ID: <20061228031013.GC10571@washoe.onerussian.com> > The ARECA drivers have just moved from the mm tree into the mainline kernel in > 2.6.19, I've just downloaded 2.6.19.1 and verified that their arcmsr driver > (version 1.20.00.13) still exists in drivers/scsi/arcmsr . > I guess you could say they were kicked upstairs, but certainly not kicked > out.. :-) What a shame... I've checked on areca in mainline kernel a while ago (it must have been pre .19 time) and I even tracked down to -mm release where it was removed from. Today just searched for -iname *areca*, forgetting that arcmsr is the one I need to look for... damn shame on me ;-) on the server I am still running 2.6.18.2 and also I am not using ext3 (just reiser and xfs), and I am not sure when I will have a chance to reboot (that beast is main file server for our cluster at the moment). > If you are a Linux Weekly News subscriber (and if you're not you > probably should be, it's a great news resource) you can read about it > here (should be resource is great indeed. forcing them me to pay money for the subscription is not good IMHO... I prefer to donate my spare time and participate in FOSS projects (I am a GNU/Debian DD). > freely available at the end of the week when the next weekly news comes out): I will come back to read it then probably -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] From csamuel at vpac.org Wed Dec 27 19:40:58 2006 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 28 Dec 2006 14:40:58 +1100 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <20061228031013.GC10571@washoe.onerussian.com> References: <45896A20.8060905@scalableinformatics.com> <200612281124.25374.csamuel@vpac.org> <20061228031013.GC10571@washoe.onerussian.com> Message-ID: <200612281440.58783.csamuel@vpac.org> On Thursday 28 December 2006 14:10, Yaroslav Halchenko wrote: > on the server I am still running 2.6.18.2 and also I am not using ext3 (just > reiser and xfs), and I am not sure when I will have a chance to reboot (that > beast is main file server for our cluster at the moment). Understood, we tend to use XFS whenever we can, both on NFS servers, general service machines and clusters (though our PPC64 SLES9 cluster is running JFS because AutoYAST wouldn't build nodes with XFS). > resource is great indeed. forcing them me to pay money for the > subscription is not good IMHO... They've got to get the money to host the website and put bread on the table for the people who do the work somehow. They compromise by making those weekly issues freely available after a short space of time and almost all of their daily news articles are free from the moment they're published. > I prefer to donate my spare time and participate in FOSS projects (I am a > GNU/Debian DD). Understood, and my thanks for that as a Debian (and Kubuntu) user. cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From landman at scalableinformatics.com Wed Dec 27 20:09:12 2006 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 27 Dec 2006 23:09:12 -0500 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <20061227211502.GB10571@washoe.onerussian.com> References: <45896A20.8060905@scalableinformatics.com> <20061227211502.GB10571@washoe.onerussian.com> Message-ID: <45934368.6080000@scalableinformatics.com> Greetings Yaroslav: Yaroslav Halchenko wrote: > Thanks Joe > > I just want to comment on my experience with Areca. > > That is pity that Areca's drivers got kicked even from -mm devel branch > of linux mainstream kernel (unfortunately I don't remember in > which exact version it has happened). Recent driver provided by Areca I thought they were folded in at the 2.6.19 level. > seems to work fine for me (after I upgraded power supply of the box, > which was mentioned by Areca as the main possible cause of the > instability in operation we had). So problem seems to be resolved, but I > disliked that the areca developers kept the same version of the driver > (thus file name) while he introduced some changes, which they named (in > my inquiry to them) as the simple refactoring of the code without any > change in functionality. I saw other people in the mailing lists > complaining about the same or similar issue > (http://uwsg.iu.edu/hypermail/linux/kernel/0611.3/0590.html). Hmmmm. I simply packaged the drivers, specifically so that some dominant north american distributions and their derivatives, could use them easily. It seems that someone caught an issue where I did not generate a new initrd with this driver within it. Will look into doing it at some point. I am not sure that RHEL5 and its variants will have this support within it. Note also that in the same directories are xfs builds, for these same users. This "fixes" a glaring omission on their part (this has been beaten to death in their fora, not worth wasting electrons attempting to convince them that their mistake is a mistake). > So, things working well, but code was orphaned by linux kernel > developers (read more about some resolved and not resolved > problems in the mailing lists > http://marc.theaimsgroup.com/?l=linux-scsi&m=113597128115672&w=2 > http://marc.theaimsgroup.com/?l=linux-scsi&m=115226175822438&w=2), > versioning is somewhat is inconsistent as I've mentioned > > How do you do with your Areca? :-) The units we have used have behaved well. The major issue we have run into appears to be a mismatch between the driver and CLI tools (we package all of them of the correct versions). We are supporting the RPMs for our customers systems. So far no issues, and the Areca folks have been quite supportive and helpful. Joe > > On Wed, 20 Dec 2006, Joe Landman wrote: > >> Hi folks: > >> We have put together some Areca RPMs for x86_64 systems. RPMs for the Areca are available >> now in binary/source form at >> http://downloads.scalableinformatics.com/downloads/Scientific_Linux/4.4/x86_64/ From landman at scalableinformatics.com Wed Dec 27 20:12:17 2006 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 27 Dec 2006 23:12:17 -0500 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <200612281124.25374.csamuel@vpac.org> References: <45896A20.8060905@scalableinformatics.com> <20061227211502.GB10571@washoe.onerussian.com> <200612281124.25374.csamuel@vpac.org> Message-ID: <45934421.6080307@scalableinformatics.com> Chris Samuel wrote: [...] > Caveat: there is a known ext3 filesystem corruption bug in 2.6.19 and > 2.6.19.1, it's apparently hard to trigger (though someone said they got it > through just compiling). Hmmm... hadn't read about that yet. We generally don't recommend ext3 for more than /boot and maybe smaller / and other partitions. Not trying to light off a file system war here. > If you are a Linux Weekly News subscriber (and if you're not you probably > should be, it's a great news resource) you can read about it here (should be Allow me to second the LWN.net recommendation. > freely available at the end of the week when the next weekly news comes out): > > http://lwn.net/Articles/215113/ > > Bugzilla entry for the compiling case: > > http://bugzilla.kernel.org/show_bug.cgi?id=7707 Joe From landman at scalableinformatics.com Wed Dec 27 20:24:51 2006 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 27 Dec 2006 23:24:51 -0500 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <20061228031013.GC10571@washoe.onerussian.com> References: <45896A20.8060905@scalableinformatics.com> <20061227211502.GB10571@washoe.onerussian.com> <200612281124.25374.csamuel@vpac.org> <20061228031013.GC10571@washoe.onerussian.com> Message-ID: <45934713.7040509@scalableinformatics.com> Hi Yaroslav: Yaroslav Halchenko wrote: >> If you are a Linux Weekly News subscriber (and if you're not you >> probably should be, it's a great news resource) you can read about it >> here (should be > resource is great indeed. forcing them me to pay money for the > subscription is not good IMHO... I prefer to donate my spare time and > participate in FOSS projects (I am a GNU/Debian DD). The LWN folks (Jon Corbet and others) do a great job. Unfortunately they can't work for free, they have bills to pay. Donating spare time, while honorable, does little to help them pay their salaries, their server hosting/net connection bills, which allow them to continue to do what they do. Considering the value I believe they bring, I have for the past several years, been a paying subscriber. Unlike many of the other advertiser laden puff-zines out there, LWN is IMO a valuable resource. >> freely available at the end of the week when the next weekly news comes out): > I will come back to read it then probably If you have the financial wherewithal to support them, I urge you to do so. If not, try contacting Jon (his email is on the site), and see if you can contribute articles. Joe > From csamuel at vpac.org Wed Dec 27 21:27:02 2006 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 28 Dec 2006 16:27:02 +1100 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <45934713.7040509@scalableinformatics.com> References: <45896A20.8060905@scalableinformatics.com> <20061228031013.GC10571@washoe.onerussian.com> <45934713.7040509@scalableinformatics.com> Message-ID: <200612281627.03061.csamuel@vpac.org> On Thursday 28 December 2006 15:24, Joe Landman wrote: > If you have the financial wherewithal to support them, I urge you to do > so. They do have a "starving hacker" rate that's about US$40 a year (it's an honour system that you choose the appropriate amount, there's a "project leader" option for those who can afford a little more too). cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From amacater at galactic.demon.co.uk Wed Dec 27 23:40:47 2006 From: amacater at galactic.demon.co.uk (Andrew M.A. Cater) Date: Thu, 28 Dec 2006 07:40:47 +0000 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <20061228031013.GC10571@washoe.onerussian.com> References: <45896A20.8060905@scalableinformatics.com> <20061227211502.GB10571@washoe.onerussian.com> <200612281124.25374.csamuel@vpac.org> <20061228031013.GC10571@washoe.onerussian.com> Message-ID: <20061228074047.GA7667@galactic.demon.co.uk> On Wed, Dec 27, 2006 at 10:10:15PM -0500, Yaroslav Halchenko wrote: > > > If you are a Linux Weekly News subscriber (and if you're not you > > probably should be, it's a great news resource) you can read about it > > here (should be > resource is great indeed. forcing them me to pay money for the > subscription is not good IMHO... I prefer to donate my spare time and > participate in FOSS projects (I am a GNU/Debian DD). > LWN is freely available to Debian developers because some kind commercial company bought a group subscription for us :) Andy Cater (also a Debian GNU/Linux developer) - two vanishingly small subsets of people suddenly intersect :) From simon at thekelleys.org.uk Thu Dec 28 02:22:01 2006 From: simon at thekelleys.org.uk (Simon Kelley) Date: Thu, 28 Dec 2006 10:22:01 +0000 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <20061228031013.GC10571@washoe.onerussian.com> References: <45896A20.8060905@scalableinformatics.com> <20061227211502.GB10571@washoe.onerussian.com> <200612281124.25374.csamuel@vpac.org> <20061228031013.GC10571@washoe.onerussian.com> Message-ID: <45939AC9.8000100@thekelleys.org.uk> Yaroslav Halchenko wrote: > resource is great indeed. forcing them me to pay money for the > subscription is not good IMHO... I prefer to donate my spare time and > participate in FOSS projects (I am a GNU/Debian DD). LWN provides free subscriptions for Debian developers. (I think, but I'm not sure, that HP provides the money in that case.) Cheers, Simon. From chetoo.valux at gmail.com Wed Dec 27 09:46:25 2006 From: chetoo.valux at gmail.com (Chetoo Valux) Date: Wed, 27 Dec 2006 18:46:25 +0100 Subject: [Beowulf] Which distro for the cluster? Message-ID: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> Dear all, As a Linux user I've worked with several distros as RedHat, SuSE, Debian and derivatives, and recently Gentoo. Now I face the challenge of building a HPC for scientific calculations, and I wonder which distro would suit me best. As a Gentoo user, I've recognised the power of customisation, optimisation and lightweight system, for instance my 4 years old laptop flies like a youngster, and some desktops too. So I thought about building the HPC nodes (8+1 master) with Gentoo .... But then it comes the administration and maintenance burden, which for me it should be the less, since my main task here is research ... so browsing the net I found Rocks Linux with plenty of clustering docs and administration tools & guidelines. I feel this should be the choice in my case, even if I sacrifice some computation efficiency. Any advice on this will be appreciated. Chetoo. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff.johnson at wsm.com Wed Dec 27 12:36:52 2006 From: jeff.johnson at wsm.com (Jeff Johnson) Date: Wed, 27 Dec 2006 12:36:52 -0800 Subject: [Beowulf] Re: Selling computation time In-Reply-To: <200612272000.kBRK0Dd7019145@bluewest.scyld.com> References: <200612272000.kBRK0Dd7019145@bluewest.scyld.com> Message-ID: <4592D964.1050608@wsm.com> beowulf-request at beowulf.org wrote: > Dear all, > ..snip.. > > I wonder then if there would be potential buyers for cluster time. > > I've been browsing, not too deep, the net, and I've not found > > (yet) any information of someone selling cluster time. > > > I know about SUN selling it: > Interesting that Sun is doing that. Having been around different industry segments employing HPC it has been my experience that most would not trust their IP (intellectual property) outside of their exclusive control. Pharma, entertainment and commercial startups are usually too protective of their IP. It is a reasonable assumption that Sun did their homework. I wonder who they are targeting. --Jeff From ruhollah.mb at gmail.com Wed Dec 27 23:22:45 2006 From: ruhollah.mb at gmail.com (Ruhollah Moussavi Baygi ) Date: Thu, 28 Dec 2006 10:52:45 +0330 Subject: [Beowulf] SW Giaga, what kind? Message-ID: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> Hi everybody, Please let me know your idea about SW level1 (Giga). Is it a proper choice for a small Beowulf cluster? Any suggestion would be highly appreciated. Thanks, -- Best, Ruhollah Moussavi Baygi Computational Physical Sciences Research Laboratory, Department of NanoScience, IPM -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgb at phy.duke.edu Thu Dec 28 08:37:57 2006 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu, 28 Dec 2006 11:37:57 -0500 (EST) Subject: [Beowulf] Selling computation time In-Reply-To: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> References: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> Message-ID: On Tue, 26 Dec 2006, Chetoo Valux wrote: > I wonder then if there would be potential buyers for cluster time. I've been > browsing, not too deep, the net, and I've not found (yet) any information > of someone selling cluster time. This is a perrenial topic of discussion on the list, and the general answer is that SO FAR there are two generically distinct marketplaces for this sort of remote clustering. One is the sort that is already being filled by e.g. Google -- remote computations that may well be distributed over a carefully engineered cluster to perform a single task that is of value in some very specific mileau. In many cases the computations at hand aren't properly "HPC" in that they may not be "numerical", but they are certainly cluster apps and may well run on very massive clusters indeed. The other is numerical HPC applications. Here the marketplace is one where it is difficult to achieve a win. First of all, most people who are doing HPC have very specific, very diverse, applications and often these applications run on clusters that are at least to some extent custom-engineered for the application. A general purpose commercial cluster would face immediate problems providing a "grid-like" interface to the general population. Even something as simple as compiling for the target platform would become difficult, and is one of those areas where solutions on real grid computers tend to be at least somewhat ugly. Then there is access, accounting, security, storage, whether or not the applications is EP or has actual IPCs so that it needs allocations of blocks of nodes with some given communications stack and physical network. By the time one works out the economics, it tends to be a lose-lose proposition. It is just plain difficult to offer computational resources to your potential marketplace: a) in a way that they can afford -- more specifically can write into a grant proposal to afford. b) in a way that is cheaper than they could obtain it by e.g. spending the same budget on hardware they themselves own and operate. Clusters are really amazingly cheap, after all -- as little as a few $100 per node, almost certainly less than $1000 per CPU core even on bleeding edge hardware. Yes, there are administrative costs and so on, but for many projects those costs can be filled out of opportunity cost labor you're paying for anyway. c) and, if you manage a rate that satisfies b), that still makes YOU money. Dedicated cluster admins are expensive -- suppose you have just one of these (yourself) and are willing to do the entire entrepreneurial thing for a mere $60K/year salary and benefits (which I'd argue is starvation wages for this kind of work). A 100-node (CPU) pro-grade cluster will likely cost you at LEAST $50,000 up front, plus the rent of a physical space with adequate AC and power resources plus roughly $100/node/year -- call it between $10 and $20K/year just to keep the nodes powered up, plus another $5000 or so in spare parts and maintenance expenses. The amortization of the $50K up front investment is over at most three years (at which point your nodes will be too slow to be worth renting anyway, and you'll likely have to drop rental rates yearly to keep them in a worthwhile zone as it is, so call it $20K of depreciation and interest on the borrowed money per year plus $20K in operating expenses per year plus $60K for your salary -- you have to make about $100K/year, absolute minimum, just to break barely arguably not quite even. That's $1000/node, and you have to KEEP them rented out in such a way as to make this ALL the time for ALL three years to be able to rollover replace the cluster nodes over that interval and stay in business. In reality, in the closely comparable business of renting space and sysadmin time for network servers, rates are 2-5 times this, so even allowing for better scaling of service delivery for compute nodes compared to webservers, this estimate is still very likely quite optimistic. Well hell, for $1000 I can buy my OWN compute node -- one with multiple cores at that -- house it in my OWN space if I or my university have anything that will do for this purpose (as is usually the case), feed and cool it, and with FC+PXE+Kickstart and/or warewulf installing and maintaining it is for me at least a matter of a few hours initial investment for the entire cluster plus a couple of boots per node, as we have a FC repository and a PXE/DHCP server already configured. We can even handle moderate node heterogeneity with some of the tools developed at Duke that can rewrite kickstart scripts on the fly according to xmlish rules. And I can buy just as many newer, faster nodes next year, and the year after that, instead of renting your aging nodes. The point being that with VERY FEW EXCEPTIONS the economics just doesn't work out. Yes, some things can be scaled up or down to improve the basic picture I present, but only at a tremendous risk for a general purpose business. The only exceptions I know of personally are where somebody is already operating a cluster consultation service -- something like Scalable Informatics -- that helps customers design and build clusters for specific purposes. In some cases those customers have "no" existing expertise or infrastructure for the cluster they need and can obtain a task-specific win by effectively subcontracting the cluster's purchase AND housing AND operation to SI, where SI already has space and resources and administration and installation support set up and can install and operate their client's cluster with absolutely minimal investment in node-scaled time. Note that doing THIS provides you with all sorts of things that alter the basic equation portrayed above -- you already have an income from the consultative side and don't have to "live" on what you make running clusters, you don't actually buy the cluster and rent it out, the client buys the cluster and pays you to house and run it (so you don't have to deal with depreciation or rollover renewal, they do), you have preexisting but scalable infrastructure support for cluster installation and software maintenance, etc. Even here I imagine that the margins are dicey and somewhat high risk, but maybe Joe will comment. Maybe not -- I doubt that he'd welcome more competition, since there is probably "just enough" business for those fulfilling the need already. I don't view this as a high-growth industry...;-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From amacater at galactic.demon.co.uk Thu Dec 28 08:40:10 2006 From: amacater at galactic.demon.co.uk (Andrew M.A. Cater) Date: Thu, 28 Dec 2006 16:40:10 +0000 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> Message-ID: <20061228164010.GA10884@galactic.demon.co.uk> On Wed, Dec 27, 2006 at 06:46:25PM +0100, Chetoo Valux wrote: > Dear all, > > As a Linux user I've worked with several distros as RedHat, SuSE, Debian and > derivatives, and recently Gentoo. > > Now I face the challenge of building a HPC for scientific calculations, and > I wonder which distro would suit me best. As a Gentoo user, I've recognised > the power of customisation, optimisation and lightweight system, for > instance my 4 years old laptop flies like a youngster, and some desktops > too. So I thought about building the HPC nodes (8+1 master) with Gentoo .... > Don't use Gentoo unless you've a full, fast connection to the internet _AND_ you're prepared for your cluster to be internet connected while you build it. This IMHO. Scientific calculations: Quantian? Debian. Debian for the number of math and other packages and the ease of install. Over 8 nodes, it should be relatively easy to set up. But it depends what you want to do, what other users want to do etc. etc. > But then it comes the administration and maintenance burden, which for me it > should be the less, since my main task here is research ... so browsing the > net I found Rocks Linux with plenty of clustering docs and administration > tools & guidelines. I feel this should be the choice in my case, even if I > sacrifice some computation efficiency. Rocks / Warewulf perhaps. If you just want something you can build/update/maintain in your sleep, I'd still suggest Debian - if only because a _minimal_ install on the nodes is as small as you want it to be - and because it's fairly consistent. Your cluster - your choice but you may have to justify it to your co-workers. Andy > > Any advice on this will be appreciated. > > Chetoo. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Thu Dec 28 09:24:47 2006 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu, 28 Dec 2006 12:24:47 -0500 (EST) Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> Message-ID: On Wed, 27 Dec 2006, Chetoo Valux wrote: > Dear all, > > As a Linux user I've worked with several distros as RedHat, SuSE, Debian and > derivatives, and recently Gentoo. > > Now I face the challenge of building a HPC for scientific calculations, and > I wonder which distro would suit me best. As a Gentoo user, I've recognised > the power of customisation, optimisation and lightweight system, for > instance my 4 years old laptop flies like a youngster, and some desktops > too. So I thought about building the HPC nodes (8+1 master) with Gentoo .... > > But then it comes the administration and maintenance burden, which for me it > should be the less, since my main task here is research ... so browsing the > net I found Rocks Linux with plenty of clustering docs and administration > tools & guidelines. I feel this should be the choice in my case, even if I > sacrifice some computation efficiency. > > Any advice on this will be appreciated. Sigh. I already wrote up a nice offline reply to this once today; let me do a shorter version with commentary for you as well. I'd be interested in comments to the contrary, but I suspect that Gentoo is pretty close to the worst possible choice for a cluster base. Maybe slackware is worse, I don't know. I personally would suggest that you go with one of the mainstream, reasonably well supported, package based distributions. Centos, FC, RH, SuSE, Debian/Ubuntu. I myself favor RH derived, rpm-based, yum-supported distros that can be installed by PXE/DHCP, kickstart, yum from a repository server. Installation of such a cluster on diskful systems proceeds as follows: a) Set up a mirror (probably a tertiary mirror) of e.g. FC6 or Centos 4. Choose the former if you want to run really current hardware and like really up to date libraries, choose a more conservative distro like Centos or RHEL if you want longer term support and less volatility. If Centos supports your hardware platforms it is a fine choice, but it often won't run on very new chipsets and CPUs. b) Set up a DHCP/PXE server -- dhcpd, tftpd, etc -- to enable diskless boot of the standard installation images for the distro of your choice. c) Develop a node kickstart file, and hotwire it into your installation image. This can be done per node architecture, or can be done once and for all and then customized per node architecture with smart runtime scripts (that's how Duke tends to do it, but then folks here invented the scripts). Set up dhcp/pxe so that a network boot option triggers the appropriate kickstart install. The %post script should do all end-stage configuration. d) Boot each node once with KV to set the bios to boot from the network, and record the MAC address of the boot interface(s). Enter these into your dhcp file and other /etc places so that your nodes each have a unique boot identity and name (like b01, b02, b03...). e) Boot each node a second time, and either select the kickstart install from the netboot options or boot the node with a toggle script or a BIOS boot order that will network boot the install the first time and thereafter boot it normally from disk unless you REQUEST a reinstall. Once the node is installed initiating a reinstall can be done via grub, for example, without any need for a KV hookup to the node. Voila! Instant cluster. a-c are done once, d and e are the only steps that require per-node action on your part. If you have a friendly vendor you may even be able to skip d if they are willing to preset the BIOS according to your specs and label the network ports with the connected MAC address. In that case you install the cluster by editing in the cluster's MAC address (on the server) and turning it the node on. Can't get MUCH easier than that, although there are some toolsets out there that will just glomph the MAC address from the initial netboot and do the tablework for you -- which may or may not work for you depending on your network and how much control you want over the node name and persistence of the node identity. Thereafter yum maintains the nodes automatically from the repo mirror -- you shouldn't have to "touch" the nodes as unique objects to be administered except when they break at the hardware level, ever again. An alternative for diskless nodes or nodes that don't actually boot from their local disks but use them instead for scratch space is to use warewulf with the distro of your choice. You still have to do a lot of the above, but warewulf "helps" you set up PXE and tftpd and so on, and lets you use a single image per node architecture. You still use yum or apt to update those images (with a bit more chroot-y work to accomplish it) but the core warewulf package helps you maintain node identity from the single image and diskless nodes are arguably more stable -- disk is one of the top two or three sources of hardware failure. HTH, rgb > > Chetoo. > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From landman at scalableinformatics.com Thu Dec 28 12:15:54 2006 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 28 Dec 2006 15:15:54 -0500 Subject: [Beowulf] Selling computation time In-Reply-To: References: <1d151d3b0612261029r5de03546s7446ac81a0c8c720@mail.gmail.com> Message-ID: <459425FA.6030805@scalableinformatics.com> Robert G. Brown wrote: > The other is numerical HPC applications. Here the marketplace is one > where it is difficult to achieve a win. First of all, most people who > are doing HPC have very specific, very diverse, applications and often > these applications run on clusters that are at least to some extent > custom-engineered for the application. Hmmm... I have a recursive joke I like to use in cases like this ... Gross generalizations tend to be incorrect. There are many clusters that end users want their machine to run an app or three. These apps are "fixed" or very slowly changing, requirements are well defined and static. I think it might be a better approximation to say that clusters to a large extent mirror their intended usage pattern. The people who will run commercial or even non-commercial codes and do little to no modification, could be well served by a "fixed" resource. The folks working on more "researchy" type things, where they do code development as well as use the machines for simulation/analysis, are more likely to want finer grained control. > A general purpose commercial cluster would face immediate problems > providing a "grid-like" interface to the general population. Even See above: For the group for which this is a fixed immutable resource, this is not generally correct, "grid-like" or web-access to cluster resources works, quite well in fact. Bug me off line so I can avoid spamming if you want to know more. > something as simple as compiling for the target platform would become > difficult, and is one of those areas where solutions on real grid > computers tend to be at least somewhat ugly. For those doing development and more research-like things not using fixed resources (programs, ...), yes, this is an issue. > Then there is access, > accounting, security, storage, whether or not the applications is EP or > has actual IPCs so that it needs allocations of blocks of nodes with > some given communications stack and physical network. Again, lots of this stuff is handled quite well for programs that change infrequently. > By the time one works out the economics, it tends to be a lose-lose > proposition. I disagree, specifically for the groups that need the fixed resources. > It is just plain difficult to offer computational > resources to your potential marketplace: > > a) in a way that they can afford -- more specifically can write into a > grant proposal to afford. Cycles for sale would not fit into a grant model, which wants to minimize the variable and fixed costs. That is, this would not work well for a researcher in most cases. > > b) in a way that is cheaper than they could obtain it by e.g. spending > the same budget on hardware they themselves own and operate. Clusters > are really amazingly cheap, after all -- as little as a few $100 per > node, almost certainly less than $1000 per CPU core even on bleeding > edge hardware. Yes, there are administrative costs and so on, but for > many projects those costs can be filled out of opportunity cost labor > you're paying for anyway. Hmmm.... I would caution anyone reading (or writing.... cough cough) to prefix this with "for a specific configuration, not necessarily detailed here". Lest end users expect Infiniband connected dual/quad core machines with 8 GB ram per core for $100/core (or even $1000/core). It is hard to get a real "rule of thumb" for configurations, as a) everyone has a different "base" configuration, b) prices fluctuate, sometimes wildly, c) configuration drastically impacts cost. We strongly advise people considering such machines to survey the market *before* taking action. You may be sorely disappointed what $1000/socket or $1000/core can buy you if your expectations are set way off. Self education is always best. Ask questions, see what people are paying. Don't take much more than the order of magnitude as being correct in rules of thumb. > c) and, if you manage a rate that satisfies b), that still makes YOU > money. Dedicated cluster admins are expensive -- suppose you have just > one of these (yourself) and are willing to do the entire entrepreneurial > thing for a mere $60K/year salary and benefits (which I'd argue is > starvation wages for this kind of work). A 100-node (CPU) pro-grade > cluster will likely cost you at LEAST $50,000 up front, plus the rent of > a physical space with adequate AC and power resources plus roughly > $100/node/year -- call it between $10 and $20K/year just to keep the > nodes powered up, plus another $5000 or so in spare parts and > maintenance expenses. The amortization of the $50K up front investment > is over at most three years (at which point your nodes will be too slow > to be worth renting anyway, and you'll likely have to drop rental rates > yearly to keep them in a worthwhile zone as it is, so call it $20K of > depreciation and interest on the borrowed money per year plus $20K in > operating expenses per year plus $60K for your salary -- you have to > make about $100K/year, absolute minimum, just to break barely arguably > not quite even. [...] > > The point being that with VERY FEW EXCEPTIONS the economics just doesn't > work out. Yes, some things can be scaled up or down to improve the I disagree with the "VERY FEW EXCEPTIONS" portion. Actually it works out very nicely in specific cases. Maybe this is "FEW". I dunno. Where the cost of the hardware is relatively high, and the frequency of use is not. You need to do an occasional run on a 128 node machine to validate some of your tests. Maybe once a quarter. This run will take a day on that machine. You cannot possibly run this on your existing hardware, it is way too large. Yet for under $5k per quarter, you can do this run. For under $20k/year you can do all your runs that you occasionally need to do. Put another way. Take your machine utilization per year, divide by the number of CPUs in it, and you get your average utilization per CPU per year. Now take your depreciation cost per year, divide that per CPU, and you get your average depreciation cost per CPU. Even multiply that bv some appropriate factor to handle overhead (power, cooling, Mark Hahn clones, ...). If you have "high" utilizations (over 70%) over the entire year, then you are "wasting" very little of that depreciation. Call the excess = 100% - utilization, and multiply that by your depreciation cost per year per CPU. This is what you are effectively "throwing away" per CPU per year. If you have many nodes and low utilization, your aren't effectively utilizing your resource and likely could have been using a smaller resource (e.g. a less costly one) more effectively. Note that this scales both up and down, I picked 128 node as an example. There are good models to suggest a lower bound size on this, and there are good models to show where it makes sense for bulk cycle delivery. Simply put, when you work out a reasonable costing model, and work out the business plan in detail, this model can work (and does work) for specific types of usage. > basic picture I present, but only at a tremendous risk for a general > purpose business. The only exceptions I know of personally are where > somebody is already operating a cluster consultation service -- > something like Scalable Informatics -- that helps customers design and > build clusters for specific purposes. In some cases those customers Heh... We do far more than that these days. I'll avoid a commercial ... Bug me off line if you want it. > have "no" existing expertise or infrastructure for the cluster they need > and can obtain a task-specific win by effectively subcontracting the > cluster's purchase AND housing AND operation to SI, where SI already > has space and resources and administration and installation support set > up and can install and operate their client's cluster with absolutely > minimal investment in node-scaled time. Actually this model, is the important one. Lower the barriers to usage, so that they can call up, get started right away. There's lots more to it, building a viable and sustainable business model for this is *not easy* by any stretch of imagination. What soured most people (and investors) on this in the late 90's and early 00's was the promise that end users would run to this model going forward due to the huge cost of running their own stuff. Turns out that the assumption of the huge cost of running their own stuff was small compared to the cost of the aging of the hardware. As Robert pointed out, as the hardware ages, end users get less value. Call this the Moore's law based value drain. In 2 years time, it is 1/2 the speed of the new stuff. Unless your depreciation model takes this into account, and front-loads these costs into your pricing, you are going to lose, as your customers expect (and in one case, demanded) that we have the top-o-the-line stuff up, always. Of course, this is just like the new-car effect. As soon as you drive your new car off the dealer lot, it loses value. As soon as your machines start to age, they lose value. And few customers want to pay a premium to work on such machines, or on any machines for that matter. What they want is to buy cycles in bulk, either delivered from local machines, or remote machines (most prefer local). A cluster is a bulk supply of processing cycles. You want the most cost effective cycles that maximize your productivity. In the majority of cases, these cycles are best procured locally. Get the cycles in house, and use them. In some cases, the costs to host locally would be far higher than remote hosting, and the cycles aren't needed that often. These are the cases where you need this sort of service. The problem is the pricing model. Cycles are not premium services, they are bulk supply. The more you can supply the better. The more that are cost effective to supply the better. Where I am going with this is that the vast majority of the ASP business models of old were garbage (fit with the times). Now saner business models have to a degree emerged, but there are no willing investors in such things. Hence the cycles that are delivered tend to be "extra". > > Note that doing THIS provides you with all sorts of things that alter > the basic equation portrayed above -- you already have an income from > the consultative side and don't have to "live" on what you make running > clusters, you don't actually buy the cluster and rent it out, the client > buys the cluster and pays you to house and run it (so you don't have to > deal with depreciation or rollover renewal, they do), you have > preexisting but scalable infrastructure support for cluster installation > and software maintenance, etc. Even here I imagine that the margins are > dicey and somewhat high risk, but maybe Joe will comment. Maybe not -- > I doubt that he'd welcome more competition, since there is probably > "just enough" business for those fulfilling the need already. I don't > view this as a high-growth industry...;-) Nah, nothing to see here, move along, move along .... The important thing in any viable business is adaptation, ability to adjust the business model to reflect how customers want to work, in such a way that it is a win for everyone. We have adapted and changed over time, and continue to do so. This is in part as our customers needs have changed, and what problems they consider to be major ones have changed. HPC is a great market. No significant investment capital interest to speak of (this is a down side), but good growth year over year (10-20+%), a large size ($10B), and Microsoft just joined the market. Maybe, someday, the downside will be fixed. Competition tends to follow from investment. Joe > > rgb > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From csamuel at vpac.org Thu Dec 28 14:17:29 2006 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 29 Dec 2006 09:17:29 +1100 Subject: [Beowulf] Re: Selling computation time In-Reply-To: <4592D964.1050608@wsm.com> References: <200612272000.kBRK0Dd7019145@bluewest.scyld.com> <4592D964.1050608@wsm.com> Message-ID: <200612290917.32401.csamuel@vpac.org> On Thursday 28 December 2006 07:36, Jeff Johnson wrote: > It is a reasonable assumption that Sun did their homework. I wonder who > they are targeting. http://www.channelregister.co.uk/2005/10/25/sun_grid_slip/ Sun's grid: lights on, no customers (October 2005) 14 months of utility computing vision Many of you will remember the fanfare and bravado surrounding Sun Microsystems' Sep. 2004 announcement of a $1 per hour per processor utility computing plan. What you won't remember is Sun revealing a single customer using the service. That's because it hasn't. [...] "It has been harder than we anticipated," said Aisling MacRunnels, Sun's senior director of utility computing in an interview. "It has been really hard. All of this has been a massive learning experience for us a company. I am not embarrassed to say this because we have been on the leading edge." [...] No idea if things have changed in the 14 months since that was written! -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From csamuel at vpac.org Thu Dec 28 14:24:28 2006 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 29 Dec 2006 09:24:28 +1100 Subject: [Beowulf] SW Giaga, what kind? In-Reply-To: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> References: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> Message-ID: <200612290924.29021.csamuel@vpac.org> On Thursday 28 December 2006 18:22, Ruhollah Moussavi Baygi wrote: > Please let me know your idea about SW level1 (Giga). Is it a proper choice > for a ?small Beowulf cluster? Never heard of it. Care to enlighten us ? -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From csamuel at vpac.org Thu Dec 28 14:39:59 2006 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 29 Dec 2006 09:39:59 +1100 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> Message-ID: <200612290939.59593.csamuel@vpac.org> On Friday 29 December 2006 04:24, Robert G. Brown wrote: > I'd be interested in comments to the contrary, but I suspect that Gentoo > is pretty close to the worst possible choice for a cluster base.?Maybe > slackware is worse, I don't know. But think of the speed you could emerge applications with a large cluster, distcc and ccache! :-) Then add on the hours of fun trying to track down a problem that's unique to your cluster due to combinations of compiler quirks, library versions, kernel bugs and application odditites.. > I personally would suggest that you go with one of the mainstream, > reasonably well supported, package based distributions. ?Centos, FC, RH, > SuSE, Debian/Ubuntu. I'd have to agree there. > I myself favor RH derived, rpm-based, > yum-supported distros that can be installed by PXE/DHCP, kickstart, yum > from a repository server. ?Installation of such a cluster on diskful > systems proceeds as follows: What I'd really like is for a kickstart compatible Debian/Ubuntu (but with mixed 64/32 bit support for AMD64 systems). I know the Ubuntu folks started on this [1], but I don't think they managed to get very far. The sad fact of the matter is that often it's the ISV's and cluster management tools that determine what choice of distro you have. :-( Yes, I know all about LSB but there are a grand total of 0 applications certified for the current version (3.x) [2] and a grand total of 1 certified application (though on 3 platforms) [3] over the total life of the LSB standards. [4] To paraphrase the Blues Brothers: ISV Vendor: Oh we got both kinds of Linux here, RHEL and SLES! Bah humbug. :-) Chris [1] - https://help.ubuntu.com/community/KickstartCompatibility [2] - http://www.freestandards.org/en/Products [3] - Lymeware's IAgent3 Interactive Agent Gateway 3.1.1 [4] - http://www.freestandards.org/cert/certified.php -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From list-beowulf at onerussian.com Thu Dec 28 14:51:25 2006 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Thu, 28 Dec 2006 17:51:25 -0500 Subject: [Beowulf] OT: some quick Areca rpm packages for interested users In-Reply-To: <20061228074047.GA7667@galactic.demon.co.uk> References: <45896A20.8060905@scalableinformatics.com> <20061227211502.GB10571@washoe.onerussian.com> <200612281124.25374.csamuel@vpac.org> <20061228031013.GC10571@washoe.onerussian.com> <20061228074047.GA7667@galactic.demon.co.uk> Message-ID: <20061228225124.GH10571@washoe.onerussian.com> > LWN is freely available to Debian developers because some kind > commercial company bought a group subscription for us :) Good to know ;-) I am a fresh DD although has been maintaining packages for a while. ok - I am following http://lwn.net/Articles/13797/ Hopefully the deal is still valid and there is a person behind that email address ;-) Thanks everyone for the information - apparently I was starving on it ;-)! -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] From amacater at galactic.demon.co.uk Thu Dec 28 16:57:49 2006 From: amacater at galactic.demon.co.uk (Andrew M.A. Cater) Date: Fri, 29 Dec 2006 00:57:49 +0000 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <200612290939.59593.csamuel@vpac.org> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> Message-ID: <20061229005749.GA13471@galactic.demon.co.uk> On Fri, Dec 29, 2006 at 09:39:59AM +1100, Chris Samuel wrote: > On Friday 29 December 2006 04:24, Robert G. Brown wrote: > > > I'd be interested in comments to the contrary, but I suspect that Gentoo > > is pretty close to the worst possible choice for a cluster base.?Maybe > > slackware is worse, I don't know. > > But think of the speed you could emerge applications with a large cluster, > distcc and ccache! :-) > > Then add on the hours of fun trying to track down a problem that's unique to > your cluster due to combinations of compiler quirks, library versions, kernel > bugs and application odditites.. > This is a valid point. If you are not a professional sysadmin / don't have one of those around, you don't want to spend time needlessly doing hacker/geek/sysadmin type wizardry - you need to get on with your research. Most of the academics and bright people on this list have become Beowulf experts and admins. by default - no one else has been there to do it for them - but that's not originally their main area of expertise. Pick a distribution that you know that provides the maximum ease of maintenance with the maximum number of useful applications already packaged / readily available / easily ported. This will depend on your problem set: simulating nuclear explosions/weather storm cells/crashing cars or are you sequencing genomes/calculating pi/drawing ray traced images? > > I personally would suggest that you go with one of the mainstream, > > reasonably well supported, package based distributions. ?Centos, FC, RH, > > SuSE, Debian/Ubuntu. > > I'd have to agree there. > Red Hat Enterprise based solutions don't cut it on the application front / packaged libraries in my (very limited) experience. The upgrade/maintenance path is not good - it's easier to start from scratch and reinstall than to move from one major release to another. Fedora Core - you _have_ to be joking :) Lots of applications - but little more than 6 - 12 months of support. SuSE is better than RH in some respects, worse in others. OpenSuSE - you may be on your own. SLED 10.2 may have licence costs? Debian (and to a lesser extent Ubuntu) has the largest set of pre-packaged "stuff" for specialist maths that I know of and has reasonable general purpose tools. > > I myself favor RH derived, rpm-based, > > yum-supported distros that can be installed by PXE/DHCP, kickstart, yum > > from a repository server. ?Installation of such a cluster on diskful > > systems proceeds as follows: > If I read the original post correctly, you're talking of an initial 8 nodes or so and a head node. Prototype it - grab a couple of desktop machines from somewhere, a switch and some cat 5. Set up three machines: one head and two nodes. Work your way through a toy problem. Do this for Warewulf/Rocks/Oscar or whatever - it will give you a feel for something of the complexity you'll get and the likely issues you'll face. > What I'd really like is for a kickstart compatible Debian/Ubuntu (but with > mixed 64/32 bit support for AMD64 systems). I know the Ubuntu folks started > on this [1], but I don't think they managed to get very far. > dpkg --get-selections >> tempfile ; pxe boot for new node ; scp tempfile root at newnode ; ssh newnode; dpkg --set-selections < /root/tempfile ; apt-get update ; apt-get dselect-upgrade goes a long way :) > The sad fact of the matter is that often it's the ISV's and cluster management > tools that determine what choice of distro you have. :-( > HP and IBM are distro neutral - they'll install / support whatever you ask them to (and pay them for). > Yes, I know all about LSB but there are a grand total of 0 applications > certified for the current version (3.x) [2] and a grand total of 1 certified > application (though on 3 platforms) [3] over the total life of the LSB > standards. [4] > > To paraphrase the Blues Brothers: > > ISV Vendor: Oh we got both kinds of Linux here, RHEL and SLES! > > Bah humbug. :-) > > Chris > From csamuel at vpac.org Thu Dec 28 19:35:20 2006 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 29 Dec 2006 14:35:20 +1100 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <20061229005749.GA13471@galactic.demon.co.uk> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> Message-ID: <200612291435.23461.csamuel@vpac.org> On Friday 29 December 2006 11:57, Andrew M.A. Cater wrote: > Pick a distribution that you know that provides the maximum ease of > maintenance with the maximum number of useful applications already > packaged / readily available / easily ported. This will depend on your > problem set: simulating nuclear explosions/weather storm cells/crashing > cars or are you sequencing genomes/calculating pi/drawing ray traced > images? All of the above, and more.. Makes life interesting sometimes. User 1: Why do let all these single CPU jobs onto the cluster ? 5 minutes later.. User 2: Why do you let one person hog 64 CPUs for one job ? > On Fri, Dec 29, 2006 at 09:39:59AM +1100, Chris Samuel wrote: > > On Friday 29 December 2006 04:24, Robert G. Brown wrote: > > > > I personally would suggest that you go with one of the mainstream, > > > reasonably well supported, package based distributions. ?Centos, FC, > > > RH, SuSE, Debian/Ubuntu. > > > > I'd have to agree there. > > Red Hat Enterprise based solutions don't cut it on the application > front / packaged libraries in my (very limited) experience. Our (500+) users fall into one of 2 camps usually, these being: 1) They want to run a commercial / 3rd party code, e.g. LS-Dyna, Abaqus, NAMD, Schrodinger, etc, that we provide for them. 2) They are compiling code they've obtained from the collaborators, colleagues, supervisors, random websites and they need compilers, MPI versions and supporting libraries. There are a couple of people who use toolkits bundled with the OS (R is a good example), but just a few. We avoid RHEL because of their lack of support for useful filesystems. > The upgrade/maintenance path is not good - it's easier to start from scratch > and reinstall than to move from one major release to another. We treat all compute nodes (and, to a lesser degree, head nodes) as disposable, they should be able to be rebuilt on a whim from a kickstart/autoyast and come up looking exactly the same as the rest. Our major clusters tend to last about as long as a major distro release (4 years before they're due for replacement), our users would get a bit upset if they found out all their code suddenly stopped working because someone had upgraded the version of, say, the systems C++ libraries, etc, under them. That said.. > Fedora Core - you _have_ to be joking :) Lots of applications - but > little more than 6 - 12 months of support. ...we do run a tiny Opteron cluster (16 dual CPU nodes) with Fedora quite happily, it started off with FC2 and is now running FC5. It's likely to disappear though when the new 64-bit cluster happens next year. > SuSE is better than RH in some respects, worse in others. We find it's miles better in terms of filesystem support. However, they blotted their copy book early on by releasing an update of lilo for PPC that didn't boot on our Power5 cluster. Fortunately we tried it out on a single test compute node first and they got a fix out. But Novell's deal with MS hasn't done it any favours. > OpenSuSE - you may be on your own. SLED 10.2 may have licence costs? Never tried either of those, they're not supported by IBM's cluster management software (CSM). > Debian (and to a lesser extent Ubuntu) has the largest set of > pre-packaged "stuff" for specialist maths that I know of and has > reasonable general purpose tools. Agreed, but our users tend not to use those. > If I read the original post correctly, you're talking of an initial 8 > nodes or so and a head node. Prototype it - grab a couple of desktop > machines from somewhere, a switch and some cat 5. Set up three machines: > one head and two nodes. Work your way through a toy problem. Do this for > Warewulf/Rocks/Oscar or whatever - it will give you a feel for something > of the complexity you'll get and the likely issues you'll face. Amen! > > What I'd really like is for a kickstart compatible Debian/Ubuntu (but > > with mixed 64/32 bit support for AMD64 systems). I know the Ubuntu folks > > started on this [1], but I don't think they managed to get very far. > > dpkg --get-selections >> tempfile ; pxe boot for new node ; scp tempfile > root at newnode ; ssh newnode; dpkg --set-selections < /root/tempfile ; > apt-get update ; apt-get dselect-upgrade > > goes a long way :) Is that all the way to completely unattended ? :-) > > The sad fact of the matter is that often it's the ISV's and cluster > > management tools that determine what choice of distro you have. :-( > > HP and IBM are distro neutral - they'll install / support whatever you > ask them to (and pay them for). Sadly that's not the case in our experience, IBM's CSM supports either RHEL or SLES (and lags the current updates as they go through a huge testing process before releasing it). This is mainly because CSM isn't just for HPC clusters, it also gets used for business HA and OLTP clusters too.. This is why I find Warewulf an interesting concept, and the WareCat (Warewulf+xCat) especially so. We don't have an HP cluster, but I know a man who has and he dreads talking to their tech support and having to explain again about why he cannot go to the Start Menu and click on a particular icon followed rapidly with why he cannot install Windows and call them back. :-( This is quite sad as Bdale is such an icon and HP use Debian on some of their firmware and diagnostic CD's.. All the best, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From rgb at phy.duke.edu Thu Dec 28 23:48:04 2006 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri, 29 Dec 2006 02:48:04 -0500 (EST) Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <20061229005749.GA13471@galactic.demon.co.uk> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> Message-ID: On Fri, 29 Dec 2006, Andrew M.A. Cater wrote: > On Fri, Dec 29, 2006 at 09:39:59AM +1100, Chris Samuel wrote: >> On Friday 29 December 2006 04:24, Robert G. Brown wrote: >> >>> I'd be interested in comments to the contrary, but I suspect that Gentoo >>> is pretty close to the worst possible choice for a cluster base.?Maybe >>> slackware is worse, I don't know. >> >> But think of the speed you could emerge applications with a large cluster, >> distcc and ccache! :-) >> >> Then add on the hours of fun trying to track down a problem that's unique to >> your cluster due to combinations of compiler quirks, library versions, kernel >> bugs and application odditites.. >> > > This is a valid point. If you are not a professional sysadmin / don't > have one of those around, you don't want to spend time needlessly doing > hacker/geek/sysadmin type wizardry - you need to get on with your > research. Also, how large are those speed advantages? How many of them cannot already be obtained by simply using a good commercial compiler and spending some time tuning the application? Very few tools (ATLAS being a good example) really tune per microarchitecture. The process is not linear, and it is not easy. Even ATLAS tunes "automatically" more from a multidimensional gradient search based on certain assumptions -- I don't think it would be easy to prove that the optimum it reaches is a global optimum. > Red Hat Enterprise based solutions don't cut it on the application > front / packaged libraries in my (very limited) experience. The > upgrade/maintenance path is not good - it's easier to start from scratch > and reinstall than to move from one major release to another. > > Fedora Core - you _have_ to be joking :) Lots of applications - but > little more than 6 - 12 months of support. No, not joking at all. FC is perfectly fine for a cluster, especially one built with very new hardware (hardware likely to need a very recent kernel and libraries to work at all) and actually upgrades-by-one tend to work quite well at this point for systems that haven't been overgooped with user-level crack or homemade stuff overlaid outside of the RPM/repo/yum ritual. Remember, a cluster node is likely to have a really, really boring and very short package list. We're not talking about major overhauls in X or gnome or the almost five thousand packages in extras having much impact -- it is more a matter of the kernel and basic libraries, PVM and/or MPI and/or a few user's choice packages, maybe some specialty libraries. I'm guessing four or five very basic package groups and a dozen individual packages and whatever dependencies they pull in. Or less. The good thing about FC >>is<< the relatively rapid renewal of at least some of the libraries -- one could die of old age waiting for the latest version of the GSL, for example, to get into RHEL/Centos. So one possible strategy is to develop a very conservative cluster image and upgrade every other FC release, which is pretty much what Duke does with FC anyway. Also, plenty of folks on this list have done just fine running "frozen" linux distros "as is" for years on cluster nodes. If they aren't broke, and live behind a firewall so security fixes aren't terribly important, why fix them? I've got a server upstairs (at home) that is still running RH 9. I keep meaning to upgrade it, but I never have time to set up and safely solve the bootstrapping problem involved, and it works fine (well inside a firewall and physically secure). Similarly, I had nodes at Duke that ran RH 7.3 for something like four years, until they were finally reinstalled with FC 2 or thereabouts. Why not? 7.3 was stable and just plain "worked" on at least these nodes; the nodes ran just fine without crashing and supported near-continuous computation for that entire time. So one could also easily use FC-whatever by developing and fine tuning a reasonably bulletproof cluster node configuration for YOUR hardware within its supported year+, then just freeze it. Or freeze it until there is a strong REASON to upgrade it -- a miraculously improved libc, a new GSL that has routines and bugfixes you really need, superyum, bproc as a standard option, cernlib in extras (the latter a really good reason for at least SOME people to upgrade to FC6:-). Honestly, with a kickstart-based cluster, reinstalling a thousand nodes is a matter of preparing the (new) repo -- usually by rsync'ing one of the toplevel mirrors -- and debugging the old install on a single node until satisfied. One then has a choice between a yum upgrade or (I'd recommend instead) yum-distributing an "upgrade" package that sets up e.g. grub to do a new, clean, kickstart reinstall, and then triggers it. You could package the whole thing to go off automagically overnight and not even be present -- the next day you come in, your nodes are all upgraded. I used to include a "node install" in my standard dog and pony show for people come to visit our cluster -- I'd walk up to an idle node, reboot it into the PXE kickstart image, and talk about the fact that I was reinstalling it. We had a fast enough network and tight enough node image that usually the reinstall would finish about the same time that my spiel was finished. It was then immediately available for more work. Upgrades are just that easy. That's scalability. Warewulf makes it even easier -- build your new image, change a single pointer on the master/server, reboot the cluster. I wouldn't advise either running upgrades or freezes of FC for all cluster environments, but they certainly are reasonable alternatives for at least some. FC is far from laughable as a cluster distro. > SuSE is better than RH in some respects, worse in others. OpenSuSE - you > may be on your own. SLED 10.2 may have licence costs? Yeah, I dunno about SuSE. I tend to include it in any list because it is a serious player and (as has been pointed out already in this thread e.g. deleted below) only the serious players tend to attract commercial/supported software companies. Still, as long as it and RH maintain ridiculously high prices (IMHO) for non-commercial environments I have a hard time pushing either one native anywhere but in a corporate environment or a non-commercial environment where their line of support or a piece of software that "only" runs on e.g. RHEL or SuSE is a critical issue. Banks need super conservatism and can afford to pay for it. Cluster nodes can afford to be agile and change, or not, as required by their function and environment, and cluster builders in academe tend to be poor and highly cost senstive. Most of them don't need to pay for either one. > Debian (and to a lesser extent Ubuntu) has the largest set of > pre-packaged "stuff" for specialist maths that I know of and has > reasonable general purpose tools. Not to argue, but Scientific Linux is (like Centos) recompiled RHEL and also has a large set of these tools including some physics/astronomy related tools that were, at least, hard to find other places. However, FC 6 is pretty insane. There are something like 6500 packages total in the repo list I have selected in yumex on my FC 6 laptop (FC itself, livna, extras, some Duke stuff, no freshrpms. This number seems to have increased by around 500 in the last four weeks IIRC -- I'm guessing people keep adding stuff to extras and maybe livna. At this point FC 6 has e.g. cernlib, ganglia, and much more -- I'm guessing that anything that is in SL is now in FC 6 extras as SL is too slow/conservative for a lot of people (as is the RHEL/Centos that is its base). Debian may well have more stuff, or better stuff for doing numerical work -- I personally haven't done a detailed package-by-package comparison and don't know. I do know that only a tiny fraction of all of the packages available in either one are likely to be relevant to most cluster builders, and that it is VERY likely that anything that is missing from either one can easily be packaged and added to your "local" repo with far less work than what is involved in learning a "new" distro if you're already used to one. The bottom line is that I think that most people will find it easiest to install the linux distro they are most used to and will find that nearly any of them are adequate to the task, EXCEPT (as noted) non-packaged or poorly packaged distros -- gentoo and slackware e.g. Scaling is everything. Scripted installs (ideally FAST scripted installs) and fully automated maintenance from a common and user-modifiable repo base are a necessity. There is no question that Debian has this. There is also no question that most of the RPM-based distros have it as well, and at this point with yum they are pretty much AS easy to install and update and upgrade as Debian ever has been. So it ends up being a religious issue, not a substantive one, except where economics or task specific functionality kick in (which can necessitate a very specific distro choice even if it is quite expensive). >>> I myself favor RH derived, rpm-based, >>> yum-supported distros that can be installed by PXE/DHCP, kickstart, yum >>> from a repository server. ?Installation of such a cluster on diskful >>> systems proceeds as follows: >> > > If I read the original post correctly, you're talking of an initial 8 > nodes or so and a head node. Prototype it - grab a couple of desktop > machines from somewhere, a switch and some cat 5. Set up three machines: > one head and two nodes. Work your way through a toy problem. Do this for > Warewulf/Rocks/Oscar or whatever - it will give you a feel for something > of the complexity you'll get and the likely issues you'll face. Excellent advice. Warewulf in particular will help you learn some of the solutions that make a cluster scalable even if you opt for some other paradigm in the end. A "good" solution in all cases is one where you prototype with a server and ONE node initially, and can install the other six or seven by at most network booting them and going off to play with your wii and drink a beer for a while. Possibly a very short while. If, of course, you managed to nab a wii (we hypothesized that wii stands for "where is it?" and not "wireless interactive interface" while shopping before Christmas...;-). And like beer. >> What I'd really like is for a kickstart compatible Debian/Ubuntu (but with >> mixed 64/32 bit support for AMD64 systems). I know the Ubuntu folks started >> on this [1], but I don't think they managed to get very far. Yeah, kickstart is lovely. It isn't quite perfect -- I personally wish it were a two-phase install, with a short "uninterruptible" installation of the basic package group and maybe X, followed by a yum-based overlay installation of everything else that is entirely interruptible and restartable. But then, I install over DSL lines from home sometimes and get irritated if the install fails for any reason before finishing, which over a full day of installation isn't that unlikely... Otherwise, though, it is quite decent. > dpkg --get-selections >> tempfile ; pxe boot for new node ; scp tempfile > root at newnode ; ssh newnode; dpkg --set-selections < /root/tempfile ; > apt-get update ; apt-get dselect-upgrade > > goes a long way :) Oooo, that sounds a lot like using yum to do a RPM-based install from a "naked" list of packages and PXE/diskless root. Something that I'd do if my life depended on it, for sure, but way short of what kickstart does and something likely to be a world of fix-me-up-after-the-fact pain. kickstart manages e.g. network configuration, firewall setup, language setup, time setup, KVM setup (or not), disk and raid setup (and properly layered mounting), grup/boot setup, root account setup, more. The actual installation of packages from a list is the easy part, at least at this point, given dpkg and/or yum. Yes, one can (re)invent many wheels to make all this happen -- package up stuff, rsync stuff, use cfengine (in FC6 extras:-), write bash or python scripts. Sheer torture. Been there, done that, long ago and never again. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From john.hearns at streamline-computing.com Fri Dec 29 01:50:45 2006 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri, 29 Dec 2006 09:50:45 +0000 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> Message-ID: <4594E4F5.8010700@streamline-computing.com> Chetoo Valux wrote: > > > But then it comes the administration and maintenance burden, which for > me it > should be the less, since my main task here is research ... so browsing the > net I found Rocks Linux with plenty of clustering docs and administration > tools & guidelines. I feel this should be the choice in my case, even if I > sacrifice some computation efficiency. You are thinking well here. Choose a 'mainstream' distro - Rocks, Redhat based distro (Scientific Linux?) or SuSE. An HPC cluster exists to run applications - you should look at the applications and which distro they will run under, and which are supported. By this I mean - are your applications written in-house? If so, ask your users which compilers and libraries they need. If you run commercial codes, you need to find out which distros are supported. Again, usually SuSE or Redhat. So, being a little harsh at this time of year, Gentoo is unlikely to be your first choice in terms of getting support. Also "sacrificing computational efficiency" is a red herring. In HPC, there is a very unusual work pattern on machines - which I think people who think only in terms of web servers, general use machines etc. are caught out by. IF you get your HPC cluster right - and you should try to as they cost $$$$, then 99% of CPU time is spent in applications. You should then substitute "Gentoo for efficiency" by "Which compiler for efficiency". Get your applications together, and download the one-month trial versions of Pathscale, Portland and Intel. And try them out. -- John Hearns Senior HPC Engineer Streamline Computing, The Innovation Centre, Warwick Technology Park, Gallows Hill, Warwick CV34 6UW Office: 01926 623130 Mobile: 07841 231235 From gdjacobs at gmail.com Fri Dec 29 02:05:40 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 29 Dec 2006 04:05:40 -0600 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <200612290939.59593.csamuel@vpac.org> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> Message-ID: <4594E874.9060905@gmail.com> Chris Samuel wrote: > On Friday 29 December 2006 04:24, Robert G. Brown wrote: > >> I'd be interested in comments to the contrary, but I suspect that Gentoo >> is pretty close to the worst possible choice for a cluster base. Maybe >> slackware is worse, I don't know. > > But think of the speed you could emerge applications with a large cluster, > distcc and ccache! :-) > > Then add on the hours of fun trying to track down a problem that's unique to > your cluster due to combinations of compiler quirks, library versions, kernel > bugs and application odditites.. > >> I personally would suggest that you go with one of the mainstream, >> reasonably well supported, package based distributions. Centos, FC, RH, >> SuSE, Debian/Ubuntu. > > I'd have to agree there. > >> I myself favor RH derived, rpm-based, >> yum-supported distros that can be installed by PXE/DHCP, kickstart, yum >> from a repository server. Installation of such a cluster on diskful >> systems proceeds as follows: > > What I'd really like is for a kickstart compatible Debian/Ubuntu (but with > mixed 64/32 bit support for AMD64 systems). I know the Ubuntu folks started > on this [1], but I don't think they managed to get very far. Here's a bare bones kickstart method (not Kickstart[tm] per se): http://linuxmafia.com/faq/Debian/kickstart.html Regarding kickstart, among choices for pre-scripted installers it is one of many. I personally favor the likes of SystemImager, even though it's not quite in the same category (FAI is though, IMO). Even dd with netcat is pretty powerful for homogeneous nodes. Once you've chosen your distro based on experience/need, there are usually a few ways to put it on your spindles. > The sad fact of the matter is that often it's the ISV's and cluster management > tools that determine what choice of distro you have. :-( > > Yes, I know all about LSB but there are a grand total of 0 applications > certified for the current version (3.x) [2] and a grand total of 1 certified > application (though on 3 platforms) [3] over the total life of the LSB > standards. [4] > > To paraphrase the Blues Brothers: > > ISV Vendor: Oh we got both kinds of Linux here, RHEL and SLES! Here, here. Sometimes I feel like going after these clowns with a clue-by-four. > Bah humbug. :-) > > Chris -- Geoffrey D. Jacobs From john.hearns at streamline-computing.com Fri Dec 29 02:09:34 2006 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri, 29 Dec 2006 10:09:34 +0000 Subject: [Beowulf] SW Giaga, what kind? In-Reply-To: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> References: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> Message-ID: <4594E95E.6060102@streamline-computing.com> Ruhollah Moussavi Baygi wrote: > Hi everybody, > > > > Please let me know your idea about SW level1 (Giga). Is it a proper choice > for a small Beowulf cluster? > Any suggestion would be highly appreciated. > Yes, a small Beowulf cluster will work very well with Gigabit Ethernet. My advice would be to not use desktop type systems, but to specify 1U chassis based server systems. Server motherboards generally have two on-board gigabit ethernet ports. You can use one port for the general cluster traffic and NFS mounts, and the other one is dedicated to MPI traffic. For future expansion, you have the capability to fit Infiniband or Myrinet cards for lower latency if your applications need this. But start with gigabit. -- John Hearns Senior HPC Engineer Streamline Computing, The Innovation Centre, Warwick Technology Park, Gallows Hill, Warwick CV34 6UW Office: 01926 623130 Mobile: 07841 231235 From gdjacobs at gmail.com Fri Dec 29 02:54:30 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 29 Dec 2006 04:54:30 -0600 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> Message-ID: <4594F3E6.5010803@gmail.com> Forward: I don't actually take any advocacy position on choice of distro. RH, Debian, BSD, I don't care. Any contrary statements are made strictly in the interest of the truth. Robert G. Brown wrote: > On Fri, 29 Dec 2006, Andrew M.A. Cater wrote: > > Also, how large are those speed advantages? How many of them cannot > already be obtained by simply using a good commercial compiler and > spending some time tuning the application? Very few tools (ATLAS > being a good example) really tune per microarchitecture. The process > is not linear, and it is not easy. Even ATLAS tunes "automatically" > more from a multidimensional gradient search based on certain > assumptions -- I don't think it would be easy to prove that the > optimum it reaches is a global optimum. It most definitely isn't. Goto trounces it easily. ATLAS is the first stab at an optimized BLAS library before the hand coders go to work. > No, not joking at all. FC is perfectly fine for a cluster, > especially one built with very new hardware (hardware likely to need > a very recent kernel and libraries to work at all) and actually > upgrades-by-one tend to work quite well at this point for systems > that haven't been overgooped with user-level crack or homemade stuff > overlaid outside of the RPM/repo/yum ritual. > > Remember, a cluster node is likely to have a really, really boring > and very short package list. We're not talking about major overhauls > in X or gnome or the almost five thousand packages in extras having > much impact -- it is more a matter of the kernel and basic libraries, > PVM and/or MPI and/or a few user's choice packages, maybe some > specialty libraries. I'm guessing four or five very basic package > groups and a dozen individual packages and whatever dependencies they > pull in. Or less. The good thing about FC >>is<< the relatively > rapid renewal of at least some of the libraries -- one could die of > old age waiting for the latest version of the GSL, for example, to > get into RHEL/Centos. So one possible strategy is to develop a very > conservative cluster image and upgrade every other FC release, which > is pretty much what Duke does with FC anyway. I'd rather have volatile user-level libraries and stable system level software than vice versa. Centos users need to be introduced to the lovely concept of backporting. > Also, plenty of folks on this list have done just fine running > "frozen" linux distros "as is" for years on cluster nodes. If they > aren't broke, and live behind a firewall so security fixes aren't > terribly important, why fix them? I've got a server upstairs (at > home) that is still running RH 9. I keep meaning to upgrade > it, but I never have time to set up and safely solve the > bootstrapping problem involved, and it works fine (well inside a > firewall and physically secure). Call me paranoid, but I don't like the idea of a Cadbury Cream Egg security model (hard outer shell, soft gooey center). I won't say more, 'cuz I feel like I've had this discussion before. Upgrade it, man. Once, when I was bored, I installed apt-rpm on a RH8 machine to see what dist-upgrade looked like in the land of the Red Hat. Interesting experience, and it worked just fine. > Similarly, I had nodes at Duke that ran RH 7.3 for something like > four years, until they were finally reinstalled with FC 2 or > thereabouts. Why not? 7.3 was stable and just plain "worked" on at > least these nodes; the nodes ran just fine without crashing and > supported near-continuous computation for that entire time. So one > could also easily use FC-whatever by developing and fine tuning a > reasonably bulletproof cluster node configuration for YOUR hardware > within its supported year+, then just freeze it. Or freeze it until > there is a strong REASON to upgrade it -- a miraculously improved > libc, a new GSL that has routines and bugfixes you really need, > superyum, bproc as a standard option, cernlib in extras (the latter a > really good reason for at least SOME people to upgrade to FC6:-). Or use a distro that backports security fixes into affected packages while maintaining ABI and API stability. Gives you a frozen target for your users and more peace of mind. > Honestly, with a kickstart-based cluster, reinstalling a thousand > nodes is a matter of preparing the (new) repo -- usually by rsync'ing > one of the toplevel mirrors -- and debugging the old install on a > single node until satisfied. One then has a choice between a yum > upgrade or (I'd recommend instead) yum-distributing an "upgrade" > package that sets up e.g. grub to do a new, clean, kickstart > reinstall, and then triggers it. You could package the whole thing > to go off automagically overnight and not even be present -- the next > day you come in, your nodes are all upgraded. Isn't automatic package management great. Like crack on gasoline. > I used to include a "node install" in my standard dog and pony show > for people come to visit our cluster -- I'd walk up to an idle node, > reboot it into the PXE kickstart image, and talk about the fact that > I was reinstalling it. We had a fast enough network and tight enough > node image that usually the reinstall would finish about the same > time that my spiel was finished. It was then immediately available > for more work. Upgrades are just that easy. That's scalability. > > Warewulf makes it even easier -- build your new image, change a > single pointer on the master/server, reboot the cluster. > > I wouldn't advise either running upgrades or freezes of FC for all > cluster environments, but they certainly are reasonable alternatives > for at least some. FC is far from laughable as a cluster distro. What I'd like to see is an interested party which would implement a good, long term security management program for FC(2n+b) releases. RH obviously won't do this. > Yeah, I dunno about SuSE. I tend to include it in any list because > it is a serious player and (as has been pointed out already in this > thread e.g. deleted below) only the serious players tend to attract > commercial/supported software companies. Still, as long as it and RH > maintain ridiculously high prices (IMHO) for non-commercial > environments I have a hard time pushing either one native anywhere > but in a corporate environment or a non-commercial environment where > their line of support or a piece of software that "only" runs on e.g. > RHEL or SuSE is a critical issue. Banks need super conservatism and > can afford to pay for it. Cluster nodes can afford to be agile and > change, or not, as required by their function and environment, and > cluster builders in academe tend to be poor and highly cost senstive. > Most of them don't need to pay for either one. > Not to argue, but Scientific Linux is (like Centos) recompiled RHEL > and also has a large set of these tools including some > physics/astronomy related tools that were, at least, hard to find > other places. However, FC 6 is pretty insane. There are something > like 6500 packages total in the repo list I have selected in yumex on > my FC 6 laptop (FC itself, livna, extras, some Duke stuff, no > freshrpms. This number seems to have increased by around 500 in the > last four weeks IIRC -- I'm guessing people keep adding stuff to > extras and maybe livna. At this point FC 6 has e.g. cernlib, > ganglia, and much more -- I'm guessing that anything that is in SL is > now in FC 6 extras as SL is too slow/conservative for a lot of > people (as is the RHEL/Centos that is its base). Do _not_ start a contest like this with the Debian people. You _will_ lose. > Debian may well have more stuff, or better stuff for doing numerical > work -- I personally haven't done a detailed package-by-package > comparison and don't know. I do know that only a tiny fraction of > all of the packages available in either one are likely to be relevant > to most cluster builders, and that it is VERY likely that anything > that is missing from either one can easily be packaged and added to > your "local" repo with far less work than what is involved in > learning a "new" distro if you're already used to one. Agreed, and security is not as much of a concern with such user-level programs, so these packages don't necessarily have to follow any security patching regime. > The bottom line is that I think that most people will find it easiest > to install the linux distro they are most used to and will find that > nearly any of them are adequate to the task, EXCEPT (as noted) > non-packaged or poorly packaged distros -- gentoo and slackware e.g. > Scaling is everything. Scripted installs (ideally FAST scripted > installs) and fully automated maintenance from a common and > user-modifiable repo base are a necessity. There is no question that > Debian has this. There is also no question that most of the > RPM-based distros have it as well, and at this point with yum they > are pretty much AS easy to install and update and upgrade as Debian > ever has been. So it ends up being a religious issue, not a > substantive one, except where economics or task specific > functionality kick in (which can necessitate a very specific distro > choice even if it is quite expensive). I haven't used a RH based machine which regularly synced against a fast-moving package repository, so I can't really compare. :) > Excellent advice. Warewulf in particular will help you learn some of > the solutions that make a cluster scalable even if you opt for some > other paradigm in the end. > > A "good" solution in all cases is one where you prototype with a > server and ONE node initially, and can install the other six or seven > by at most network booting them and going off to play with your wii > and drink a beer for a while. Possibly a very short while. If, of > course, you managed to nab a wii (we hypothesized that wii stands for > "where is it?" and not "wireless interactive interface" while > shopping before Christmas...;-). And like beer. Prototyping is absolutely necessary for any large-scale roll out. Better to learn how to do it right. > Yeah, kickstart is lovely. It isn't quite perfect -- I personally > wish it were a two-phase install, with a short "uninterruptible" > installation of the basic package group and maybe X, followed by a > yum-based overlay installation of everything else that is entirely > interruptible and restartable. But then, I install over DSL > lines from home sometimes and get irritated if the install fails for > any reason before finishing, which over a full day of installation > isn't that unlikely... > > Otherwise, though, it is quite decent. > > Oooo, that sounds a lot like using yum to do a RPM-based install from > a "naked" list of packages and PXE/diskless root. Something that > I'd do if my life depended on it, for sure, but way short of what > kickstart does and something likely to be a world of > fix-me-up-after-the-fact pain. kickstart manages e.g. network > configuration, firewall setup, language setup, time setup, KVM setup > (or not), disk and raid setup (and properly layered mounting), > grup/boot setup, root account setup, more. The actual installation of > packages from a list is the easy part, at least at this point, given > dpkg and/or yum. I personally believe more configuration is done on Debian systems in package configuration than in the installer as compared with RH, but I do agree with you mainly. It's way short of what FAI, replicator, and system imager do too. > Yes, one can (re)invent many wheels to make all this happen -- > package up stuff, rsync stuff, use cfengine (in FC6 extras:-), write > bash or python scripts. Sheer torture. Been there, done that, long > ago and never again. Hey, some people like this. Some people compete in Japanese game shows. > rgb -- Geoffrey D. Jacobs From gdjacobs at gmail.com Fri Dec 29 02:57:55 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 29 Dec 2006 04:57:55 -0600 Subject: [Beowulf] SW Giaga, what kind? In-Reply-To: <4594E95E.6060102@streamline-computing.com> References: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> <4594E95E.6060102@streamline-computing.com> Message-ID: <4594F4B3.4070602@gmail.com> John Hearns wrote: > Yes, a small Beowulf cluster will work very well with Gigabit Ethernet. > My advice would be to not use desktop type systems, > but to specify 1U chassis based server systems. > Server motherboards generally have two on-board gigabit ethernet ports. > You can use one port for the general cluster traffic and NFS mounts, > and the other one is dedicated to MPI traffic. Depends how small we're talking about. If we're talking Value Cluster 2K7, the purchasing goals are quite different. -- Geoffrey D. Jacobs From john.hearns at streamline-computing.com Fri Dec 29 03:26:37 2006 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri, 29 Dec 2006 11:26:37 +0000 Subject: [Beowulf] SW Giaga, what kind? In-Reply-To: <4594F4B3.4070602@gmail.com> References: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> <4594E95E.6060102@streamline-computing.com> <4594F4B3.4070602@gmail.com> Message-ID: <4594FB6D.6070402@streamline-computing.com> Geoff Jacobs wrote: > Depends how small we're talking about. If we're talking Value Cluster > 2K7, the purchasing goals are quite different. But of course. Indeed, you could start with desktop type systems and upgrade with (say) Intel gigabit NIC cards when you have the need for a dedicated network. Just trying to point out that (in general) you should make a good choice of hardware platform to start with. Server-grade hardware in rackmount cases is almost always the way to go. PSUs are more robust, cabling is easier and as I said you get a second onboard Gigabit NIC. Add in the capability for lights-out management for IPMI and you have a good basis for cluster expansion. -- John Hearns Senior HPC Engineer Streamline Computing, The Innovation Centre, Warwick Technology Park, Gallows Hill, Warwick CV34 6UW Office: 01926 623130 Mobile: 07841 231235 From ed at eh3.com Fri Dec 29 06:24:48 2006 From: ed at eh3.com (Ed Hill) Date: Fri, 29 Dec 2006 09:24:48 -0500 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> Message-ID: <20061229092448.0c6f7c30@ernie> On Fri, 29 Dec 2006 02:48:04 -0500 (EST) "Robert G. Brown" wrote: Hi folks, I wanted to respond earlier to this thread but RGB has already done a better job of covering the points I hoped to make. :-) In my experience FC makes a very usable cluster OS. > Also, plenty of folks on this list have done just fine running > "frozen" linux distros "as is" for years on cluster nodes. If they > aren't broke, and live behind a firewall so security fixes aren't > terribly important, why fix them? I've got a server upstairs (at > home) that is still running RH 9. I keep meaning to upgrade > it, but I never have time to set up and safely solve the > bootstrapping problem involved, and it works fine (well inside a > firewall and physically secure). Yes! If it has sufficient security (FW, private network, etc.) and its running well for your applications then theres *no* reason to blush here. If it works, it works. There's no shame in making efficient use of your time. Quite the opposite... Ed -- Edward H. Hill III, PhD | ed at eh3.com | http://eh3.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From rgb at phy.duke.edu Fri Dec 29 09:24:14 2006 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri, 29 Dec 2006 12:24:14 -0500 (EST) Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <4594F3E6.5010803@gmail.com> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <4594F3E6.5010803@gmail.com> Message-ID: On Fri, 29 Dec 2006, Geoff Jacobs wrote: > I'd rather have volatile user-level libraries and stable system level > software than vice versa. Centos users need to be introduced to the > lovely concept of backporting. The problem (one of many) is with operations like banks. In order for a bank to use a distro at all, it has to be audited for security at a horrendous level. If you change a single library, they have to audit the whole thing all over again. Costly and annoying, so RHEL "freezes" except for bugfixes because for companies like banks and other large operations, any change at all costs money. You can see the mentality running wild in lots of other places -- most "big iron" machine rooms were rife with it for a couple of decades, and even though I've been in this business one way or another for most of my professional life I >>still<< underestimate the length of time it will take for really beneficial changes to permeate the computing community. By years if not decades. I fully expected MS to be on the ropes at this point, being truly hammered by linux on all fronts, for example -- but linux keeps "missing" the mass desktop market by ever smaller increments even as it has finally produced systems that do pretty damn well on office desktops. I still view Linus's dream of world domination as a historical inevitability, mind you, I just no longer think that it will happen quite as catastrophically suddenly. Centos, of course, won't alter this pattern because diverging from RHEL also costs money and obviates the point for Centos users, who want the conservatism and update stream without the silly cost scaling or largely useless support. However, Centos "users" are largely sysadmins, not end users per se, and lots of them DO backport really important updates on an as needed basis. Fortunately, in many cases an FC6 src rpm will build just fine on a Centos 4 system, and rpmbuild --rebuild takes a few seconds to execute and drop the result into a yum-driven local "updates" repo. So I'd say most pro-grade shops already do this as needed. My problem with being conservative with a cluster distro is that it requires impeccable timing. If you happen to build your cluster right when the next release of RHEL happens to correspond with the next release of FC, it is auspicious. In that case both distros are up to date on the available kernel drivers and patches for your (presumably new and cutting edge) hardware, with the highest probability of a fortunate outcome. However, if you try to build a cluster with e.g. AMD64 nodes and the "wrong motherboard" on top of Centos/RHEL 4, all frozen and everything back when the motherboard and CPU itself really didn't exist, you have an excellent chance of discovering that they distro won't even complete an install, at least not with x86_64 binaries. Or it will install but its built in graphics adapter won't work. Or its sound card (which may not matter, but the point is clear). Then you've got a DOUBLE problem -- to use Centos you have to somehow backport regularly from a dynamically maintained kernel stream, or else avoid a potentially cost-efficient node architecture altogether, or else -- abandon Centos. The stars just aren't right for the conservative streams for something like the last year of each release if you are interested in running non-conservative hardware. The problem is REALLY evident for laptops -- there are really major changes in the way the kernel, rootspace, and userspace manages devices, changes that are absolutely necessary for us to be able to plug cameras, memory sticks, MP3 players, printers, bluetooth devices, and all of that right into the laptop and have it "just work". NetworkManager simply doesn't work for most laptops and wireless devices before FC5, and it doesn't really work "right" until you get to FC6 AND update to at least 0.6.4. On RHEL/Centos 4 (FC4 frozen, basically), well... One of the major disadvantages linux has had relative to WinXX over the years has been hardware support that lags, often by years, behind the WinXX standard. Because of the way linux is developed, the ONLY way one can fix this is to ride a horse that is rapidly co-developed as new hardware is released, and pray for ABI and API level standards in the hardware industry in front of your favorite brazen idol every night (something that is unlikely to work but might make you feel better:-). The fundamental "advantage" of FC6 is that its release timing actually matches up pretty well against the frenetic pace of new hardware development -- six to twelve month granularity means that you can "usually" by an off-the shelf laptop or computer and have a pretty good chance of it either being fully supported right away (if it is older than six months) or being fully supported within weeks to months -- maybe before you smash it with a sledgehammer out of sheer frustration. >From what I've seen, ubuntu/debian has a somewhat similar aspect, user driven to get that new hardware running even more aggressively than with FC (and with a lot of synergy, of course, even though the two communities in some respects resemble Sunnis vs the Shites in Iraq:-). SINCE they are user driven, they also tend to have lots of nifty userspace apps, and since we have entered the age of the massive, fully compatible, contributed package repo I expect FC7 to provide something on the order of 10K packages, maybe 70% of them square in userspace (and the rest libraries etc). This might even be the "nextgen" revolution -- Windows cannot yet provide fully transparent application installation (for money or not) over the network -- they have security issues, payment issues, installshield/automation issues, permission issues, and compatibility/library issues all to resolve before they get anywhere close to what yum and friends (or debian's older and also highly functional equivalents) can do already for linux. What the software companies that are stuck in the "RHEL grove" don't realize is that RPMs, yum and the idea of a repo enable them to set up a completely different software distribution paradigm, one that can in fact be built for and run on all the major RPM distros with minimal investment or risk on their part. Then don't "get it" yet. When they do, there could be an explosion in commercial grade, web-purchased linux software and something of a revolution in software distribution and maintenance (as this would obviously drive WinXX to clone/copy). Or not. Future cloudy, try again later. > Call me paranoid, but I don't like the idea of a Cadbury Cream Egg > security model (hard outer shell, soft gooey center). I won't say more, > 'cuz I feel like I've had this discussion before. Ooo, then you really don't like pretty much ANY of the traditional "true beowulf" designs. They are all pretty much cream eggs. Hell, lots of them use rsh without passwords, or open sockets with nothing like a serious handshaking layer to do things like distribute binary applications and data between nodes. Grid designs, of course, are another matter -- they tend to use e.g. ssh and so on but they have to because nodes are ultimately exposed to users, probably not in a chroot jail. Even so, has anyone really done a proper security audit of e.g. pvm or mpi? How difficult is it to take over a PVM virtual machine and insert your own binary? I suspect that it isn't that difficult, but I don't really know. Any comments, any experts out there? In the specific case of my house, anybody who gets to where they can actually bounce a packet off of my server is either inside its walls and hence has e.g. cracked e.g. WPA or my DSL firewall or one of my personal accounts elsewhere that hits the single (ssh) passthrough port. In all of these cases the battle is lost already, as I am God on my LAN of course, so a trivial password trap on my personal account would give them root everywhere in zero time. In fact, being a truly lazy individual who doesn't mind exposing his soft belly to the world, if they get root anywhere they've GOT it everywhere -- I have root set up to permit free ssh between all client/nodes so that I have to type a root password only once and can then run commands as root on any node from an xterms as one-liners. This security model is backed up by a threat of physical violence against my sons and their friends, who have carefully avoided learning linux at anything like the required level for cracking because they know I'd like them to, and the certain knowledge that my wife is doing very well if she can manage to crank up a web browser and read her mail without forgetting something and making me get up out of bed to help her at 5:30 am. So while I do appreciate your point on a production/professional network level, it really is irrelevant here. > Upgrade it, man. Once, when I was bored, I installed apt-rpm on a RH8 > machine to see what dist-upgrade looked like in the land of the Red Hat. > Interesting experience, and it worked just fine. There are three reasons I haven't upgraded it. One is sheer bandwidth. It takes three days or so to push FCX through my DSL link, and while I'm doing it all of my sons and wife and me myself scream because their ain't no bandwidth leftover for things like WoW and reading mail and working. This can be solved with a backpack disk and my laptop -- I can take my laptop into Duke and rsync mirror a primary mirror, current snapshot, with at worst a 100 Mbps network bottleneck (I actually think that the disk bottleneck might be slower, but it is still way faster than 384 kbps or thereabouts:-). The second is the bootstrapping problem. The system in question is my internal PXE/install server, a printer server, and an md raid fileserver. I really don't feel comfortable trying an RH9 -> FC6 "upgrade" in a single jump, and a clean reinstall requires that I preserve all the critical server information and restore it post upgrade. At the same time it would be truly lovely to rebuild the MD partitions from scratch, as I believe that MD has moved along a bit in the meantime. This is the third problem -- I need to construct a full backup of the /home partition, at least, which is around 100 GB and almost full. Hmmm, it might be nice to upgrade the RAID disks from 80 GB to 160's or 250's and get some breathing room at the same time, which requires a small capital investment -- say $300 or thereabouts. Fortunately I do have a SECOND backpack disk with 160 GB of capacity that I use as a backup, so I can do an rsync mirror to that of /home while I do the reinstall shuffle, with a bit of effort. All of this takes time, time, time. And I cannot begin to describe my life to you, but time is what I just don't got to spare unless my life depends on it. That's the level of triage here -- staunch the spurting arteries first and apply CPR as necessary -- the mere compound fractures and contusions have to wait. You might have noticed I've been strangely quiet on-list for the last six months or so... there is a reason:-) At the moment, evidently, I do have some time and am kind of catching up. Next week I might have even more time -- perhaps even the full day and change the upgrade will take. I actually do really want to do it -- both because I do want it to be nice and current and secure and because there are LOTS OF IMPROVEMENTS at the server level in the meantime -- managing e.g. printers with RH9 tools sucks for example, USB support is trans-dubious, md is iffy, and I'd like to be able to test out all sorts of things like the current version of samba, a radius server to be able to drop using PSK in WPA, and so on. So sure, I'll take your advice "any day now", but it isn't that simple a matter. >> within its supported year+, then just freeze it. Or freeze it until >> there is a strong REASON to upgrade it -- a miraculously improved >> libc, a new GSL that has routines and bugfixes you really need, >> superyum, bproc as a standard option, cernlib in extras (the latter a >> really good reason for at least SOME people to upgrade to FC6:-). > Or use a distro that backports security fixes into affected packages > while maintaining ABI and API stability. Gives you a frozen target for > your users and more peace of mind. No arguments. But remember, you say "users" because you're looking at topdown managed clusters with many users. There are lots of people with self-managed clusters with just a very few. And honestly, straightforward numerical code is generally cosmically portable -- I almost never even have to do a recompile to get it to work perfectly across upgrades. So YMMV as far as how important that stability is to users of any given cluster. There is a whole spectrum here, no simple or universal answers. >> Honestly, with a kickstart-based cluster, reinstalling a thousand >> nodes is a matter of preparing the (new) repo -- usually by rsync'ing >> one of the toplevel mirrors -- and debugging the old install on a >> single node until satisfied. One then has a choice between a yum >> upgrade or (I'd recommend instead) yum-distributing an "upgrade" >> package that sets up e.g. grub to do a new, clean, kickstart >> reinstall, and then triggers it. You could package the whole thing >> to go off automagically overnight and not even be present -- the next >> day you come in, your nodes are all upgraded. > Isn't automatic package management great. Like crack on gasoline. Truthfully, it is trans great. I started doing Unix admin in 1986, and have used just about every clumsy horrible scheme you can imagine to handle add-on open source packages without which Unix (of whatever vendor-supplied flavor) was pretty damn useless even way back then. They still don't have things QUITE as simple as they could be -- setting up a diskless boot network for pxe installs or standalone operation is still an expert-friendly sort of thing and not for the faint of heart or tyro -- but it is down to where a single relatively simple HOWTO or set of READMEs can guide a moderately talented sysadmin type through the process. With these tools, you can adminster at the theoretical/practical limit of scalability. One person can take care of literally hundreds of machines, either nodes or LAN clients, limited only by the need to provide USER support and by the rate of hardware failure. I could see a single person taking care of over a thousand nodes for a small and undemanding user community, with onsite service on all node hardware. I think Mark Hahn pushes this limit, as do various others on list. That's just awesome. If EVER corporate america twigs to the cost advantages of this sort of management scalability on TOP of free as in beer software for all standard needs in the office workplace... well, one day it will. Too much money involved for it not to. >> I used to include a "node install" in my standard dog and pony show >> for people come to visit our cluster -- I'd walk up to an idle node, >> reboot it into the PXE kickstart image, and talk about the fact that >> I was reinstalling it. We had a fast enough network and tight enough >> node image that usually the reinstall would finish about the same >> time that my spiel was finished. It was then immediately available >> for more work. Upgrades are just that easy. That's scalability. >> >> Warewulf makes it even easier -- build your new image, change a >> single pointer on the master/server, reboot the cluster. >> >> I wouldn't advise either running upgrades or freezes of FC for all >> cluster environments, but they certainly are reasonable alternatives >> for at least some. FC is far from laughable as a cluster distro. > What I'd like to see is an interested party which would implement a > good, long term security management program for FC(2n+b) releases. RH > obviously won't do this. I thought there was such a party, but I'm too lazy to google for it. I think Seth mentioned it on the yum or dulug list. It's the kind of thing a lot of people would pay for, actually. > Do _not_ start a contest like this with the Debian people. You _will_ lose. And I _won't_ care...;-) It took me two days to wade through extras in FC6, "shopping", and now there are another 500 packages I haven't even looked at a single time. The list of games on my laptop is something like three screenfuls long, and it would take me weeks to just explore the new applications I did install. And truthfully, the only reason I push FC is because (as noted above) it a) meets my needs pretty well; and b) has extremely scalable installation and maintenance; and c) (most important) I know how to install and manage it. I could probably manage debian as well, or mandriva, or SuSE, or Gentoo -- one advantage of being a 20 year administrator is I do know how everything works and where everything lives at the etc level beneath all GUI management tool gorp layers shovelled on top by a given distro -- but I'm lazy. Why learn YALD? One can be a master of one distro, or mediocre at several... > I haven't used a RH based machine which regularly synced against a > fast-moving package repository, so I can't really compare. :) Pretty much all of the current generation do this. Yum yum. Where one is welcome to argue about what constitutes a "fast-moving" repository. yum doesn't care, really. Everything else is up to the conservative versus experimental inclinations of the admin. > I personally believe more configuration is done on Debian systems in > package configuration than in the installer as compared with RH, but I > do agree with you mainly. It's way short of what FAI, replicator, and > system imager do too. The last time I looked at FAI with was Not Ready For Prime Time and languishing unloved. Of course this was a long time ago. I'm actually glad that it is loved. The same is true of replicators and system imagers -- I've written them myself (many years ago) and found them to be a royal PITA to maintain as things evolve, but at this point they SHOULD be pretty stable and functional. One day I'll play with them, as I'd really like to keep a standard network bootable image around to manage disk crashes on my personal systems, where I can't quite boot to get to a local disk to recover any data that might be still accessible. Yes there are lots of ways to do this and I do have several handy but a pure PXE boot target is very appealing. >> Yes, one can (re)invent many wheels to make all this happen -- >> package up stuff, rsync stuff, use cfengine (in FC6 extras:-), write >> bash or python scripts. Sheer torture. Been there, done that, long >> ago and never again. > Hey, some people like this. Some people compete in Japanese game shows. Yes, but from the point of view of perfect scaling theory, heterogeneity and nonstandard anything is all dark evil. Yes, many people like to lose themselves in customization hell, but there is a certain zen element here and Enlightment consists of realizing that all of this is Illusion and that there is a great Satori to be gained by following the right path.... OK, enough system admysticstration...;-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From gdjacobs at gmail.com Fri Dec 29 12:14:11 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 29 Dec 2006 14:14:11 -0600 Subject: [Beowulf] SW Giaga, what kind? In-Reply-To: <4594FB6D.6070402@streamline-computing.com> References: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> <4594E95E.6060102@streamline-computing.com> <4594F4B3.4070602@gmail.com> <4594FB6D.6070402@streamline-computing.com> Message-ID: <45957713.2080001@gmail.com> John Hearns wrote: > Geoff Jacobs wrote: > >> Depends how small we're talking about. If we're talking Value Cluster >> 2K7, the purchasing goals are quite different. > > But of course. > Indeed, you could start with desktop type systems and upgrade with (say) > Intel gigabit NIC cards when you have the need for a dedicated network. > > Just trying to point out that (in general) you should make a good choice > of hardware platform to start with. This goes without saying. Careful shopping is the name of the game. > Server-grade hardware in rackmount cases is almost always the way to go. > PSUs are more robust, cabling is easier and as I said you get a second > onboard Gigabit NIC. > Add in the capability for lights-out management for IPMI and you have a > good basis for cluster expansion. There are downsides to server hardware. Cost for one. Also, if it's 9 machines sitting in the back of a lab, the noise factor from 1U cases would probably be an issue. By the time desktop cases are impractical and the cluster needs dedicated space, I agree with you 100%. The trouble is we don't know how many nodes the OP is thinking of. -- Geoffrey D. Jacobs From gdjacobs at gmail.com Fri Dec 29 12:48:33 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 29 Dec 2006 14:48:33 -0600 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <4594F3E6.5010803@gmail.com> Message-ID: <45957F21.4010601@gmail.com> Robert G. Brown wrote: > The problem is REALLY evident for laptops -- there are really major > changes in the way the kernel, rootspace, and userspace manages devices, > changes that are absolutely necessary for us to be able to plug cameras, > memory sticks, MP3 players, printers, bluetooth devices, and all of that > right into the laptop and have it "just work". NetworkManager simply > doesn't work for most laptops and wireless devices before FC5, and it > doesn't really work "right" until you get to FC6 AND update to at least > 0.6.4. On RHEL/Centos 4 (FC4 frozen, basically), well... Laptops. *shudder* How often do laptop manufacturers change their hardware configurations? Every other week? I've found the optimal way to purchase laptop hardware is with a good live cd for testing. No boot, no buy. > One of the major disadvantages linux has had relative to WinXX over the > years has been hardware support that lags, often by years, behind the > WinXX standard. Because of the way linux is developed, the ONLY way one > can fix this is to ride a horse that is rapidly co-developed as new > hardware is released, and pray for ABI and API level standards in the > hardware industry in front of your favorite brazen idol every night > (something that is unlikely to work but might make you feel better:-). > > The fundamental "advantage" of FC6 is that its release timing actually > matches up pretty well against the frenetic pace of new hardware > development -- six to twelve month granularity means that you can > "usually" by an off-the shelf laptop or computer and have a pretty good > chance of it either being fully supported right away (if it is older > than six months) or being fully supported within weeks to months -- > maybe before you smash it with a sledgehammer out of sheer frustration. >> From what I've seen, ubuntu/debian has a somewhat similar aspect, user > driven to get that new hardware running even more aggressively than with > FC (and with a lot of synergy, of course, even though the two > communities in some respects resemble Sunnis vs the Shites in Iraq:-). RGB must now go into hiding due to the fatwa against him. > SINCE they are user driven, they also tend to have lots of nifty > userspace apps, and since we have entered the age of the massive, fully > compatible, contributed package repo I expect FC7 to provide something > on the order of 10K packages, maybe 70% of them square in userspace (and > the rest libraries etc). > > This might even be the "nextgen" revolution -- Windows cannot yet > provide fully transparent application installation (for money or not) > over the network -- they have security issues, payment issues, > installshield/automation issues, permission issues, and > compatibility/library issues all to resolve before they get anywhere > close to what yum and friends (or debian's older and also highly > functional equivalents) can do already for linux. What the software > companies that are stuck in the "RHEL grove" don't realize is that RPMs, > yum and the idea of a repo enable them to set up a completely different > software distribution paradigm, one that can in fact be built for and > run on all the major RPM distros with minimal investment or risk on > their part. Then don't "get it" yet. When they do, there could be an > explosion in commercial grade, web-purchased linux software and > something of a revolution in software distribution and maintenance (as > this would obviously drive WinXX to clone/copy). Or not. > > Future cloudy, try again later. > > > Ooo, then you really don't like pretty much ANY of the traditional "true > beowulf" designs. They are all pretty much cream eggs. Hell, lots of > them use rsh without passwords, or open sockets with nothing like a > serious handshaking layer to do things like distribute binary > applications and data between nodes. How things have improved... > Grid designs, of course, are > another matter -- they tend to use e.g. ssh and so on but they have to > because nodes are ultimately exposed to users, probably not in a chroot > jail. Even so, has anyone really done a proper security audit of e.g. > pvm or mpi? How difficult is it to take over a PVM virtual machine and > insert your own binary? I suspect that it isn't that difficult, but I > don't really know. Any comments, any experts out there? Would compromising PVM frag a user or the whole system? > In the specific case of my house, anybody who gets to where they can > actually bounce a packet off of my server is either inside its walls and > hence has e.g. cracked e.g. WPA or my DSL firewall or one of my personal > accounts elsewhere that hits the single (ssh) passthrough port. In all > of these cases the battle is lost already, as I am God on my LAN of > course, so a trivial password trap on my personal account would give > them root everywhere in zero time. In fact, being a truly lazy > individual who doesn't mind exposing his soft belly to the world, if > they get root anywhere they've GOT it everywhere -- I have root set up > to permit free ssh between all client/nodes so that I have to type a > root password only once and can then run commands as root on any node > from an xterms as one-liners. > > This security model is backed up by a threat of physical violence > against my sons and their friends, who have carefully avoided learning > linux at anything like the required level for cracking because they know > I'd like them to, and the certain knowledge that my wife is doing very > well if she can manage to crank up a web browser and read her mail > without forgetting something and making me get up out of bed to help her > at 5:30 am. So while I do appreciate your point on a > production/professional network level, it really is irrelevant here. > There are three reasons I haven't upgraded it. One is sheer bandwidth. > It takes three days or so to push FCX through my DSL link, and while I'm > doing it all of my sons and wife and me myself scream because their > ain't no bandwidth leftover for things like WoW and reading mail and > working. This can be solved with a backpack disk and my laptop -- I can > take my laptop into Duke and rsync mirror a primary mirror, current > snapshot, with at worst a 100 Mbps network bottleneck (I actually think > that the disk bottleneck might be slower, but it is still way faster > than 384 kbps or thereabouts:-). > > The second is the bootstrapping problem. The system in question is my > internal PXE/install server, a printer server, and an md raid > fileserver. I really don't feel comfortable trying an RH9 -> FC6 > "upgrade" in a single jump, and a clean reinstall requires that I > preserve all the critical server information and restore it post > upgrade. At the same time it would be truly lovely to rebuild the MD > partitions from scratch, as I believe that MD has moved along a bit in > the meantime. > > This is the third problem -- I need to construct a full backup of the > /home partition, at least, which is around 100 GB and almost full. > Hmmm, it might be nice to upgrade the RAID disks from 80 GB to 160's or > 250's and get some breathing room at the same time, which requires a > small capital investment -- say $300 or thereabouts. Fortunately I do > have a SECOND backpack disk with 160 GB of capacity that I use as a > backup, so I can do an rsync mirror to that of /home while I do the > reinstall shuffle, with a bit of effort. > > All of this takes time, time, time. And I cannot begin to describe my > life to you, but time is what I just don't got to spare unless my life > depends on it. That's the level of triage here -- staunch the spurting > arteries first and apply CPR as necessary -- the mere compound fractures > and contusions have to wait. You might have noticed I've been strangely > quiet on-list for the last six months or so... there is a reason:-) Time. The great equalizer. > At the moment, evidently, I do have some time and am kind of catching > up. Next week I might have even more time -- perhaps even the full day > and change the upgrade will take. I actually do really want to do it -- > both because I do want it to be nice and current and secure and because > there are LOTS OF IMPROVEMENTS at the server level in the meantime -- > managing e.g. printers with RH9 tools sucks for example, USB support is > trans-dubious, md is iffy, and I'd like to be able to test out all sorts > of things like the current version of samba, a radius server to be able > to drop using PSK in WPA, and so on. So sure, I'll take your advice > "any day now", but it isn't that simple a matter. Walled gardens and VPNs for wireless access? Sweet. > No arguments. But remember, you say "users" because you're looking at > topdown managed clusters with many users. There are lots of people with > self-managed clusters with just a very few. And honestly, > straightforward numerical code is generally cosmically portable -- I > almost never even have to do a recompile to get it to work perfectly > across upgrades. So YMMV as far as how important that stability is to > users of any given cluster. There is a whole spectrum here, no simple > or universal answers. > Truthfully, it is trans great. I started doing Unix admin in 1986, and > have used just about every clumsy horrible scheme you can imagine to > handle add-on open source packages without which Unix (of whatever > vendor-supplied flavor) was pretty damn useless even way back then. > They still don't have things QUITE as simple as they could be -- setting > up a diskless boot network for pxe installs or standalone operation is > still an expert-friendly sort of thing and not for the faint of heart or > tyro -- but it is down to where a single relatively simple HOWTO or set > of READMEs can guide a moderately talented sysadmin type through the > process. > > With these tools, you can adminster at the theoretical/practical limit > of scalability. One person can take care of literally hundreds of > machines, either nodes or LAN clients, limited only by the need to > provide USER support and by the rate of hardware failure. I could see a > single person taking care of over a thousand nodes for a small and > undemanding user community, with onsite service on all node hardware. I > think Mark Hahn pushes this limit, as do various others on list. That's > just awesome. If EVER corporate america twigs to the cost advantages of > this sort of management scalability on TOP of free as in beer software > for all standard needs in the office workplace... well, one day it will. > Too much money involved for it not to. > I thought there was such a party, but I'm too lazy to google for it. I > think Seth mentioned it on the yum or dulug list. It's the kind of > thing a lot of people would pay for, actually. > And I _won't_ care...;-) Come to think of it, the only way you can lose in such a contest is if quality slips. Pretty much plusses across the board. :-D > It took me two days to wade through extras in FC6, "shopping", and now > there are another 500 packages I haven't even looked at a single time. > The list of games on my laptop is something like three screenfuls long, > and it would take me weeks to just explore the new applications I did > install. And truthfully, the only reason I push FC is because (as noted > above) it a) meets my needs pretty well; and b) has extremely scalable > installation and maintenance; and c) (most important) I know how to > install and manage it. I could probably manage debian as well, or > mandriva, or SuSE, or Gentoo -- one advantage of being a 20 year > administrator is I do know how everything works and where everything > lives at the etc level beneath all GUI management tool gorp layers > shovelled on top by a given distro -- but I'm lazy. Why learn YALD? > One can be a master of one distro, or mediocre at several... This is absolutely valid. There is no need to move to the latest whiz-bang distro if what you're using works fine. > Pretty much all of the current generation do this. Yum yum. > > Where one is welcome to argue about what constitutes a "fast-moving" > repository. yum doesn't care, really. Everything else is up to the > conservative versus experimental inclinations of the admin. How usable is the FC development repository? > The last time I looked at FAI with was Not Ready For Prime Time and > languishing unloved. Of course this was a long time ago. I'm actually > glad that it is loved. The same is true of replicators and system > imagers -- I've written them myself (many years ago) and found them to > be a royal PITA to maintain as things evolve, but at this point they > SHOULD be pretty stable and functional. One day I'll play with them, as > I'd really like to keep a standard network bootable image around to > manage disk crashes on my personal systems, where I can't quite boot to > get to a local disk to recover any data that might be still accessible. > Yes there are lots of ways to do this and I do have several handy but a > pure PXE boot target is very appealing. > >>> Yes, one can (re)invent many wheels to make all this happen -- >>> package up stuff, rsync stuff, use cfengine (in FC6 extras:-), write >>> bash or python scripts. Sheer torture. Been there, done that, long >>> ago and never again. >> Hey, some people like this. Some people compete in Japanese game shows. > > Yes, but from the point of view of perfect scaling theory, heterogeneity > and nonstandard anything is all dark evil. Yes, many people like to > lose themselves in customization hell, but there is a certain zen > element here and Enlightment consists of realizing that all of this is > Illusion and that there is a great Satori to be gained by following the > right path.... > > OK, enough system admysticstration...;-) > > rgb > -- Geoffrey D. Jacobs From rgb at phy.duke.edu Fri Dec 29 15:49:31 2006 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri, 29 Dec 2006 18:49:31 -0500 (EST) Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <45957F21.4010601@gmail.com> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <4594F3E6.5010803@gmail.com> <45957F21.4010601@gmail.com> Message-ID: On Fri, 29 Dec 2006, Geoff Jacobs wrote: > Robert G. Brown wrote: > >> The problem is REALLY evident for laptops -- there are really major >> changes in the way the kernel, rootspace, and userspace manages devices, >> changes that are absolutely necessary for us to be able to plug cameras, >> memory sticks, MP3 players, printers, bluetooth devices, and all of that >> right into the laptop and have it "just work". NetworkManager simply >> doesn't work for most laptops and wireless devices before FC5, and it >> doesn't really work "right" until you get to FC6 AND update to at least >> 0.6.4. On RHEL/Centos 4 (FC4 frozen, basically), well... > Laptops. *shudder* How often do laptop manufacturers change their > hardware configurations? Every other week? > > I've found the optimal way to purchase laptop hardware is with a good > live cd for testing. No boot, no buy. I just drink heavily, personally, and trust in various deities. >> FC (and with a lot of synergy, of course, even though the two >> communities in some respects resemble Sunnis vs the Shites in Iraq:-). > RGB must now go into hiding due to the fatwa against him. I'm just wearing my flameproof suit. That's really what they need to introduce into Iraq -- a fireproof kevlar birka. Bomb goes off in the market? No problem. Just dust yourself off and move on, picking bomb splinters out of your robes. Eventually the bombers get bored and go back to hurling imprecations at one another after drinking large amounts of extremely strong coffee. Sigh. I suppose I really shouldn't joke about it, but it IS so very very very tiresome and evil. >> pvm or mpi? How difficult is it to take over a PVM virtual machine and >> insert your own binary? I suspect that it isn't that difficult, but I >> don't really know. Any comments, any experts out there? > Would compromising PVM frag a user or the whole system? I think this very much depends on the architecture and how things are set up. Some message libraries run as root, others as nobody, others as users. And compromising users isn't necessarily all that great -- it is usually the first step in compromising a network anyway. There are usually small rare carefully guarded holes at the port level as nearly everybody knows by now that leaving lots of open ports on a system is a bad idea, so there are only a very few applications that really have to be secure to keep a system safe from outside intruders. In many cases, only ssh, for example. Once an intruder has user-level access, though, they have the ability to use ALL of the applications on the system to promote to root. Suddenly there are lots of running daemons, lots of root processes, lots of suid root binaries all of which have to be exploit free to prevent promotion. These are almost invariably less well audited than primary network daemons. Again, though, the point is clear. Compute farm style grids can be made "semi-hard" -- as hard as any client/server lan with a very large and uncontrolled user base running lots of unaudited code with very nearly arbitrary libraries (in some cases users can even upload their own libraries to the system as part of their "package"). Still, the usual boundaries between users and at the network layer exist and are pretty well defended, and it is not at all easy to take over another user's process or corrupt its data stream or monitor its data stream from an ordinary user account or network PoP (assume a laptop snapped into a hot port where a bad guy has root and promiscuous network access). Inside a "true beowulf" or firewall protected not-quite-so-tight grid cluster that runs VM parallel code this is not true, because the VM libraries do not use encryption for obvious reasons and use nothing beyond the network stacks themselves for validating connections. Which in the case of UDP is virtually nothing -- connectionless and nearly anonymous save for the IP numbers in the packet headers, that are pretty much just trusted. TCP is a bit better, although I was around in the days of the Morris worm and would never bet than an ubercracker couldn't snag a TCP connection given enough time and effort. rsh is a joke as far as security is concerned (and ssh adds overhead to at least some things...). I just have little confidence that a cluster in active use, especially one that uses rsh as a base and/or bproc, is in any way "secure" against cracking. In most cases the security layer is the head node through which it is accessed, which is a de facto firewall. I don't know that people have spent a lot of time trying to figure out how hard PVM or MPI is to exploit, but as network services they are certainly pretty high up there on the list of possibly exploitable things... >> Where one is welcome to argue about what constitutes a "fast-moving" >> repository. yum doesn't care, really. Everything else is up to the >> conservative versus experimental inclinations of the admin. > How usable is the FC development repository? I don't know. My laptop is the only thing that I've got that is running FC6 yet (until post server upgrade, at which time I'll flash everything on up) and I don't screw around on a production system, which my laptop definitely is. As in I cannot afford for it to be down. I no longer use a real desk for much of anything except holding bill receipts until I can file them. My laptop, with internet access to e.g. wikipedia and google, is a large fraction of my brain. I've used it a dozen times to access information I half remember (or half forgot) and validate it in the discussion so far. I've thought about turning it on in yum long enough to nail e.g. the latest upgrade to NetworkManager, as it apparently fixes a nasty bug in the openvpn segment that causes my access to Duke's vpn to work, but only by crashing NM. I really want to be able to toggle it up and down at will, and expect that I could if I grabbed from the devel branch and/or downloaded from CVS and rebuilt. Time time time. It works well enough (and ever better). Better than WinXX's related tools that are just plain evil, especially on a user's laptop that has loaded three layers of things -- microsoft's, the laptop vendor's, and an ISP's -- all trying to provide an "easy" interface to the wireless card, and all colliding... rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From gdjacobs at gmail.com Fri Dec 29 17:11:46 2006 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 29 Dec 2006 19:11:46 -0600 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <4594F3E6.5010803@gmail.com> <45957F21.4010601@gmail.com> Message-ID: <4595BCD2.2080304@gmail.com> Robert G. Brown wrote: > I'm just wearing my flameproof suit. > > That's really what they need to introduce into Iraq -- a fireproof > kevlar birka. Bomb goes off in the market? No problem. Just dust > yourself off and move on, picking bomb splinters out of your robes. > Eventually the bombers get bored and go back to hurling imprecations at > one another after drinking large amounts of extremely strong coffee. I was alluding to your potentially inflammatory misspelling of Shi'ite. And burqas are worn specifically by women, usually in Afghanistan, Pakistan, and some parts of India. What Iraqis need is this man's help: http://en.wikipedia.org/wiki/Troy_Hurtubise > > rgb > -- Geoffrey D. Jacobs From greg.lindahl at qlogic.com Fri Dec 29 17:12:18 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri, 29 Dec 2006 17:12:18 -0800 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <4594F3E6.5010803@gmail.com> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <4594F3E6.5010803@gmail.com> Message-ID: <20061230011218.GA6466@greglaptop.hsd1.ca.comcast.net> On Fri, Dec 29, 2006 at 04:54:30AM -0600, Geoff Jacobs wrote: > It most definitely isn't. Goto trounces it easily. ATLAS is the first > stab at an optimized BLAS library before the hand coders go to work. Actually, on most platforms ATLAS uses hand-written routines for the most common cases. It's just that Goto is better at hand-writing for the matrix sizes that people use for HPL. -- greg From rgb at phy.duke.edu Sat Dec 30 14:19:55 2006 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat, 30 Dec 2006 17:19:55 -0500 (EST) Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <45961D6E.1040907@nada.kth.se> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <4594F3E6.5010803@gmail.com> <45961D6E.1040907@nada.kth.se> Message-ID: On Sat, 30 Dec 2006, Jon Tegner wrote: > Robert G. Brown wrote: >> >> All of this takes time, time, time. And I cannot begin to describe my >> life to you, but time is what I just don't got to spare unless my life >> depends on it. That's the level of triage here -- staunch the spurting >> arteries first and apply CPR as necessary -- the mere compound fractures >> and contusions have to wait. You might have noticed I've been strangely >> quiet on-list for the last six months or so... there is a reason:-) >> > What has been boring is that all eight of the rgb-clones (by doing a careful > analysis of the texts - as well as the "submit pattern" - I'm convinced there > are eight of them) have been quiet at the same time. I am happy to see that > two of them (rgb_3 and rgb_7) are active again! Ya, well, actually I use procmail to pipe the list mail through a very, very advanced version of the eliza chatbot. It parses and rearranges my actual responses from as many as four or five years ago into plausible replies based on keywords in the message. That's why prices, hardware descriptions, and commentary seem so outdated. The amazing thing is that nobody noticed until you did! -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From dave.cunningham at lmco.com Thu Dec 28 12:12:16 2006 From: dave.cunningham at lmco.com (Cunningham, Dave) Date: Thu, 28 Dec 2006 12:12:16 -0800 Subject: FW: [Beowulf] Which distro for the cluster? Message-ID: <3D92CA467E530B4E8295214868F840FE0A317F81@emss01m12.us.lmco.com> I notice that Scyld is notable by it's absence from this discussion. Is that due to cost, or bad/no experience, or other factors? There is a lot of interest in it around my company lately. Dave Cunningham -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Andrew M.A. Cater Sent: Thursday, December 28, 2006 8:40 AM To: beowulf at beowulf.org Subject: Re: [Beowulf] Which distro for the cluster? On Wed, Dec 27, 2006 at 06:46:25PM +0100, Chetoo Valux wrote: > Dear all, > > As a Linux user I've worked with several distros as RedHat, SuSE, Debian and > derivatives, and recently Gentoo. > > Now I face the challenge of building a HPC for scientific calculations, and > I wonder which distro would suit me best. As a Gentoo user, I've recognised > the power of customisation, optimisation and lightweight system, for > instance my 4 years old laptop flies like a youngster, and some desktops > too. So I thought about building the HPC nodes (8+1 master) with Gentoo .... > Don't use Gentoo unless you've a full, fast connection to the internet _AND_ you're prepared for your cluster to be internet connected while you build it. This IMHO. Scientific calculations: Quantian? Debian. Debian for the number of math and other packages and the ease of install. Over 8 nodes, it should be relatively easy to set up. But it depends what you want to do, what other users want to do etc. etc. > But then it comes the administration and maintenance burden, which for me it > should be the less, since my main task here is research ... so browsing the > net I found Rocks Linux with plenty of clustering docs and administration > tools & guidelines. I feel this should be the choice in my case, even if I > sacrifice some computation efficiency. Rocks / Warewulf perhaps. If you just want something you can build/update/maintain in your sleep, I'd still suggest Debian - if only because a _minimal_ install on the nodes is as small as you want it to be - and because it's fairly consistent. Your cluster - your choice but you may have to justify it to your co-workers. Andy > > Any advice on this will be appreciated. > > Chetoo. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From hanzl at noel.feld.cvut.cz Thu Dec 28 12:24:26 2006 From: hanzl at noel.feld.cvut.cz (Vaclav Hanzl) Date: Thu, 28 Dec 2006 21:24:26 +0100 (CET) Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> Message-ID: <20061228.212426.74735341.hanzl@noel.feld.cvut.cz> > ...So I thought about building the HPC nodes (8+1 master) with Gentoo .... > > But then it comes the administration and maintenance burden, which for me it > should be the less, since my main task here is research ... so browsing the > net I found Rocks... Years ago, I installed a small cluster of ten 200MHz PPro machines. Cluster distributions were dream of the future that time. I still keep the cluster at that size, replacing parts of it as funding allows - the size is quite right for the research we do. I've been through most types of installation adventures this list might suggest you - and in the retrospective, I am surprised how little help of those wonderful clustering tools (which emerged exactly as our dreams predicted) I was able to enjoy. With this size of cluster and this type of funding, I spent by far the biggest part of cluster maintenance by selecting the right new hardware every year and making sure that linux kernel has right drivers for it (network, hd controllers). Finally, the thing which helps me most is - surprise - Knoppix, which is better than me at supporting new hardware :-) (well, I could do it better than Knoppix auto-detection if I devoted more time to it, but no, I do not want to, there were already enough kernel compiles in my life). I can give a CD to my (most collaborative) hardware vendor and he can pre-select hardware for me. I install it to harddisks using my own simple scripts (via USB stick) and I am quite happy with the result. Even closer match to my package selection needs is ParallelKnoppix (which can also be used to create volatile cluster - I do not use this option but it might be nice to play with). The little rest can be added via apt-get. I just had to mention my experience as it seems to be so different from experience of people who care about bigger clusters and can devote bigger share of their human time to cluster maintenance... Best Regards Vaclav Hanzl From r.vadivelanrhce at gmail.com Fri Dec 29 02:26:55 2006 From: r.vadivelanrhce at gmail.com (Vadivelan Rathinasabapathy) Date: Fri, 29 Dec 2006 15:56:55 +0530 Subject: [Beowulf] running MPICH on AMD Opteron Dual Core Processor Cluster( 72 Cpu's) Message-ID: <9fe360270612290226yb9e3ccbua77a1febf4123fc6@mail.gmail.com> Dear all We have a problem of running application that are complied with MPICH. Our Setup is a 16 Node 72 Cpu AMD Opteron cluster which has Rocks-4.1.2 and RHEL 4.0 update 4 installed in it. We are trying to run a benchmark with MPICH which came along with the ROCKS installation. the run starts and then the following error occurs after sometime. " p1_8544: p4_error: Timeout in Establishing connection to remote process: 0 " rm_l_1_8667: (359.417969) net_send: could not write to fd=5, errno=104 We have been trying the same for the past two days and we didnt get any solution for the above. Also we downloaded the Latest MPICH 1.2.7p1 and configured the same. now for the same testing with the latest mpich, the code seems to be running in the Master Server no matter, how many number of processors we give. The same testing with LAM/MPI and OPENMPI are working fine. pls provide us a good solution -- Thanks and Regards R.Vadivelan CMC Ltd, Bangalore r.vadivelanrhce at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ntmoore at gmail.com Fri Dec 29 07:49:38 2006 From: ntmoore at gmail.com (Nathan Moore) Date: Fri, 29 Dec 2006 09:49:38 -0600 Subject: [Beowulf] picking a job scheduler Message-ID: <2CA4B96D-5BAD-4B60-995F-DF50AC94D864@gmail.com> I've presently set up a cluster of 5 AMD dual-core linux boxes for my students (at a small college). I've got MPICH running, shared NIS/NFS home directories etc. After reading the MPICH installation guide and manual, I can't say I understand how to deploy MPICH for my students to use. So far as I can tell, there no load balancing or migration of processes in the library, and so now I'm trying to figure out what piece of software to add to the cluster to (for example) prevent the starting of an MPI job when there's already another job running. (1) Is openPBS or gridengine the appropriate tool to use for a multi- user system where mpich is available? Are there better scheduling options? (1.5) Can mortals install and configure Gridengine? Thus far it seems too wonderful for me to understand. (2) Also, if my cluster is made up of a mix of single and dual processor machines, what's the proper way to tell mpd about that topology? (3) Its likely that in the future I'll have part-time access to another cluster of dual-boot (XP/linux) machines. The machines will default to booting to Linux, but will occasionally (5-20 hours a week) be used as windows workstations by a console user (when a user is finished, they'll restart the machine and it will boot back to linux). If cluster nodes are available in this sort of unpredictable and intermittent way, can they be used as compute nodes in some fashion? Wil gridengine/PBS /??? take care of this sort of process migration? bet regards, Nathan - - - - - - - - - - - - - - - - - - - - - - - Nathan Moore Physics Winona State University AIM:nmoorewsu From ntmoore at gmail.com Fri Dec 29 08:40:11 2006 From: ntmoore at gmail.com (Nathan Moore) Date: Fri, 29 Dec 2006 10:40:11 -0600 Subject: [Beowulf] picking out a job scheduler Message-ID: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> I've presently set up a cluster of 5 AMD dual-core linux boxes for my students (at a small college). I've got MPICH running, shared NIS/NFS home directories etc. After reading the MPICH installation guide and manual, I can't say I understand how to deploy MPICH for my students to use. So far as I can tell, there no load balancing or migration of processes in the library, and so now I'm trying to figure out what piece of software to add to the cluster to (for example) prevent the starting of an MPI job when there's already another job running. (1) Is openPBS or gridengine the appropriate tool to use for a multi- user system where mpich is available? Are there better scheduling options? (1.5) Can mortals install and configure Gridengine? Thus far it seems too wonderful for me to understand. (2) Also, if my cluster is made up of a mix of single and dual processor machines, what's the proper way to tell mpd about that topology? (3) Its likely that in the future I'll have part-time access to another cluster of dual-boot (XP/linux) machines. The machines will default to booting to Linux, but will occasionally (5-20 hours a week) be used as windows workstations by a console user (when a user is finished, they'll restart the machine and it will boot back to linux). If cluster nodes are available in this sort of unpredictable and intermittent way, can they be used as compute nodes in some fashion? Wil gridengine/PBS /??? take care of this sort of process migration? best regards, Nathan - - - - - - - - - - - - - - - - - - - - - - - Nathan Moore Physics Winona State University nmoore at winona.edu AIM:nmoorewsu From tegner at nada.kth.se Sat Dec 30 00:03:58 2006 From: tegner at nada.kth.se (Jon Tegner) Date: Sat, 30 Dec 2006 09:03:58 +0100 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <4594F3E6.5010803@gmail.com> Message-ID: <45961D6E.1040907@nada.kth.se> Robert G. Brown wrote: > > All of this takes time, time, time. And I cannot begin to describe my > life to you, but time is what I just don't got to spare unless my life > depends on it. That's the level of triage here -- staunch the spurting > arteries first and apply CPR as necessary -- the mere compound fractures > and contusions have to wait. You might have noticed I've been strangely > quiet on-list for the last six months or so... there is a reason:-) > What has been boring is that all eight of the rgb-clones (by doing a careful analysis of the texts - as well as the "submit pattern" - I'm convinced there are eight of them) have been quiet at the same time. I am happy to see that two of them (rgb_3 and rgb_7) are active again! Welcome back! ;-) /jon From ruhollah.mb at gmail.com Sat Dec 30 01:53:41 2006 From: ruhollah.mb at gmail.com (Ruhollah Moussavi Baygi ) Date: Sat, 30 Dec 2006 13:23:41 +0330 Subject: [Beowulf] SW Giaga, what kind? In-Reply-To: <45957713.2080001@gmail.com> References: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> <4594E95E.6060102@streamline-computing.com> <4594F4B3.4070602@gmail.com> <4594FB6D.6070402@streamline-computing.com> <45957713.2080001@gmail.com> Message-ID: <1bef2ce30612300153o4d1ae055n1374b976e846d258@mail.gmail.com> Hi every body @ Beowulf!, Thanks for anyone' help in answering my question about "SW GIGA, What kind?". But, originally, my question was about the quality and reliability of the brand of *LevelOne* SW (Unmanaged, Gigabit ports), in comparison to its fairly low price, on one hand, and the brand of *3COM* SW (Unmanaged, Gigabit ports) on the other hand. The number of nodes in our initial plan is 6 nodes, AMD DualCore, desktop type systems. Any useful hints, help, suggestion is appreciated. Thanks in advance, -- Best, Ruhollah Moussavi Baygi Computational Physical Sciences Research Laboratory, Department of NanoScience, IPM -------------- next part -------------- An HTML attachment was scrubbed... URL: From manalorama at gmail.com Sat Dec 30 08:21:18 2006 From: manalorama at gmail.com (Manal Helal) Date: Sun, 31 Dec 2006 03:21:18 +1100 Subject: [Beowulf] mpich mpd ring on a network of 2 pcs In-Reply-To: <200612302000.kBUK07m3028696@bluewest.scyld.com> References: <200612302000.kBUK07m3028696@bluewest.scyld.com> Message-ID: <459691FE.7050700@gmail.com> Hi I am trying to setup a small cluster incrementally, to run mpi programs only. I have 4 PCs with linux fedora core, 2 with core 5, and one with core 6, and I will install the new one with core 6. I installed mpich2 on fedora core 6, and I can run mpd and the mpi programs on this machine fine, and I can ping and ssh from and to all machines, then I added an smb share to the install bin path, and can access it from the other machine, and updated the mpd.hosts file (in the user folder on the mpich2 installation machine) with the names of both machines for now, (I copied .mpf.conf to the user folder on both machines, and same about the mpd.hosts - not sure if this is right or not) on the second machine, I can read and write the mpich2 bin folder, and I can run mpd command only and when I try to mpdtrace, it says no mpd is running, when I try to run mpd on the installation machine, and can mpdtrace it and get the port number, and run on the other machine, mpd -h hostname -p port & I receive: ********************** [1] 7007 [mhelal at manal mhelal]# manal.localhits_45668: conn error in connect_rhs: Connection refused manal.localhits_45668 (connect_rhs 726): failed to connect to rhs at 127.0.0.1 56317 manal.localhits_45668 (enter_ring 633): rhs connect failed manal.localhits_45668 (run 245): failed to enter ring ********************** and on the installation machine I keep getting: lot rhs; re-entering ring ..... back in ring ********************** another scenario, I tried on the installation machine: ********************** [mhelal at manallpt ~]$ mpdboot -n 2 mhelal at manal's password: mpdboot_manallpt.localhits (handle_mpd_output 388): from mpd on manal, invalid port info: /home/mhelal: Permission denied. /home/mhelal/mpich2-install/bin/mpd.py: Command not found. ********************** how can I debug this problem, any help is highly appreciated, I only have the mpich2 README and it says refer to the installation guide for more information, and I can not find that. It would be really helpful if anyone points me to a tutorial (detailed step by step) on how to create a small simple network to run mpi jobs, and the things I need to take care of, Thank you in advance, Kind Regards, Manal