From eugen at leitl.org Wed Mar 2 02:51:11 2005 From: eugen at leitl.org (Eugen Leitl) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] Re: [Bioclusters] error while running mpiblast (fwd from landman@scalableinformatics.com) Message-ID: <20050302105110.GC13336@leitl.org> ----- Forwarded message from Joe Landman ----- From: Joe Landman Date: Wed, 02 Mar 2005 00:25:05 -0500 To: "Clustering, compute farming & distributed computing in life science informatics" Cc: Subject: Re: [Bioclusters] error while running mpiblast User-Agent: Mozilla Thunderbird 1.0 (X11/20041207) Reply-To: "Clustering, compute farming & distributed computing in life science informatics" James Cuff wrote: >"Iam running this on SGI multiprocessor(numa)", > >you are running on a single shared (well near unified, and SGI do this >very, very well) memory server with, as you said and appear to understand >shared storage... > >*sigh* > >What on earth are you going to gain from MPI? Standard NCBI threads >should do for you just fine, or maybe I've been smoking the funny stuff. Hi James: it is quite possible that mpiblast will scale better than NCBI blast on this system. mpi forces you to pay attention to locality of reference, so you tend to do a good job partitioning your code (that is, if it scales). NCBI is built with pthreads, and I haven't seen it scale much beyond about 10 CPUs on an SMP. The coarser grain of the mpiblast partitioning (the pthread partitioning is very fine grain) will very likely result in better scalability on a NUMA. Not only that, but large multicpu NUMA's have problems with memory hotspots. I remember in the Origin days we used to play games with DPLACE directives and whatnot else to control memory layout, replication of pages, etc. This was under Irix, and there were rich sets of tools to help. I don't think many of them are available under Linux right now (possibly in the SGI propack). You don't see much a problem in 2/4 way systems. It becomes serious when you load data into a page, and you start getting 16 requestors for that page. Page migration is not a win here. readonly page replication can be a huge win here. Luckily, with mpi, all references are local to begin with ... That said, I don't have ready access to one, so I cannot test this hypothesis, though I might just throw together a BBS experiment to test this. I'd love to play with a nice 9MB cache machine. This would be a sweet blast engine :) Expensive... yes, but running out of cache is a "good thing" (TM). >If you _do_ happen to have multiple NUMA's in a cluster, (1) you are very >lucky and (2) you should the still listen to Joe's advice... Local is >only local so far, try: > > Shared=/home/kalyani/toolkit/ncbi > Local=/tmp/kalyani_mpiblast/ > >(or as Joe maybe put better) > > Shared=/home/kalyani/toolkit/ncbi > Local=/mylocalfilesystemthatnoonewillmesswith/kalyani_mpiblast/ Lucas sent me a note indicated that in 1.3.0 they allow for shared and local to coexist. Aaron/Lucas, if you are about, could you clarify some of this? I don't want to lead people astray (and I will need to update the SGE tool). > > WFM, YMMV.. Note: We have not built the mpiblast RPM for Itanium (nor for that matter, any of our RPMs). Is there any interest in this? Curious. Joe > >Best, > >J. > >-- >James Cuff, D. Phil. >Group Leader, Applied Production Systems >Broad Institute of MIT and Harvard. 320 Charles Street, >Cambridge, MA. 02141. Tel: 617-252-1925 Fax: 617-258-0903 > > > > >_______________________________________________ >Bioclusters maillist - Bioclusters@bioinformatics.org >https://bioinformatics.org/mailman/listinfo/bioclusters -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 _______________________________________________ Bioclusters maillist - Bioclusters@bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050302/aba839cc/attachment.bin From eugen at leitl.org Wed Mar 2 09:30:08 2005 From: eugen at leitl.org (Eugen Leitl) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support Message-ID: <20050302173008.GY13336@leitl.org> Speaking of InfiniBand, I presume there are still no motherboards with IB ports onboard? http://www.internetnews.com/dev-news/article.php/3485401 February 24, 2005 Linux Kernel 2.6.11 Supports InfiniBand By Sean Michael Kerner The Linux world is bracing for the final release of the new Linux 2.6.11 kernel, which will include a long list of driver updates and patches, with InfiniBand support perhaps being one of most interesting new additions. Late last night, Linux creator Linus Torvalds issued the fifth release candidate for the 2.6.11 kernel. The first 2.6.11 RC was issued on Jan. 12; the second on Jan 21; the third on Feb. 2; and the fourth on Feb. 12. In the RC5 posting, Torvalds indicated that it was likely the last RC before the final release. "Hey, I hoped -- rc4 was the last one, but we had some laptop resource conflicts, various ppc TLB flush issues, some possible stack overflows in networking and a number of other details warranting a quick -- rc5 before the final 2.6.11," Torvalds wrote. "This time it's really supposed to be a quickie, so people who can, please check it out, and we'll make the real 2.6.11 asap." The long list of updates in the 2.6.11 kernel includes architecture updates for x86-64, ia64, ppc, arm and mips, as well as updates to ACPI (define), DRI (Direct Rendering Infrastructure, which permits direct access to graphics hardware for X Window System users), ALSA (Advanced Linux Sound Architecture, which provides MIDI and audio functionality to the Linux), SCSI (define) and the XFS high-performance journaling filesystem. The 2.6.11 kernel will also be significant in that it includes driver support for the InfiniBand (define) interconnect architecture. InfiniBand, which is derived from its underlying concept of "infinite bandwidth," is a switched fabric interconnect technology for high-performance network devices that is common in a number of supercomputer clusters. The upcoming inclusion of InfiniBand support in the Linux kernel is a major step according to the InfiniBand Trade Association. "The inclusion of InfiniBand drivers in the upstream Linux kernel is a significant milestone," Ross Schibler, CTO of InfiniBand vendor Topspin Communications, told internetnews.com. InfiniBand support was available previously in various Linux distributions, but it wasn't part of the mainstream kernel.org Linux. "This now means that anyone that downloads a kernel will have automatic access to the software," explained Schibler. "It also means that any upcoming distributions (Red Hat, SUSE, etc.) will have the software included on their CDs. Previously SUSE had it on a distribution, but only in the 'unsupported' directory." Schibler sees the inclusion of InfiniBand as a testament to the maturation of the technology. "Now that the technology has matured to such a point that Linus has accepted it into the kernel, the way is paved for greater distribution of the code and accelerated deployment of the technology," Schibler said. The previous Linux kernel.org release, version 2.6.10 was issued on Dec. 24 after two release candidates. Linux distribution began including the 2.6.10 thereafter with Red Hat's Fedora Project being one of the first. Fedora Core 3 initially shipped with the 2.6.9 kernel and then upgraded to the 2.6.10 kernel on Jan 13. Mandrakelinux's 10.2 Beta 3 also includes the 2.6.10 release. SUSE Linux 9.2 currently includes the 2.6.8 kernel. Including the most recent kernel into a distribution is not a particularly easy task. The upcoming Debian, code-named Sarge, will only ship with the 2.6.8 kernel. In a release update e-mail, Debian Sarge release manager Andreas Barth related that a meeting was recently held to review the status of which kernel they would include. "The team leads involved eventually decided to stay with kernel 2.6.8 and 2.4.27, rather than bumping the 2.6 kernel to 2.6.10," Barth wrote. "This decision was made upon review of the known bugs in each of the 2.6 kernel versions; despite some significant bugs in the Debian 2.6.8 kernel tree, these bugs were weighed against the additional delays that a kernel version bump would introduce in the schedule for debian-installer RC3." "As it happens, preparing 2.4 and 2.6 kernels with the security fixes for all architectures took roughly two months from start to finish, during which time preparation of the next debian-installer release candidate has been entirely stalled," he added. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050302/aad3689c/attachment.bin From gotero at linuxprophet.com Wed Mar 2 11:08:19 2005 From: gotero at linuxprophet.com (Glen Otero) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <20050302173008.GY13336@leitl.org> References: <20050302173008.GY13336@leitl.org> Message-ID: <5769e684325b11f3146f648fe06d327f@linuxprophet.com> Arima and Iwill have mobos with IB LOM (Landed on Motherboard). Glen On Mar 2, 2005, at 9:30 AM, Eugen Leitl wrote: > > Speaking of InfiniBand, I presume there are still no motherboards with > IB > ports onboard? > > http://www.internetnews.com/dev-news/article.php/3485401 > > February 24, 2005 > Linux Kernel 2.6.11 Supports InfiniBand > By Sean Michael Kerner > > The Linux world is bracing for the final release of the new Linux > 2.6.11 > kernel, which will include a long list of driver updates and patches, > with > InfiniBand support perhaps being one of most interesting new additions. > > Late last night, Linux creator Linus Torvalds issued the fifth release > candidate for the 2.6.11 kernel. The first 2.6.11 RC was issued on > Jan. 12; > the second on Jan 21; the third on Feb. 2; and the fourth on Feb. 12. > > In the RC5 posting, Torvalds indicated that it was likely the last RC > before > the final release. > > "Hey, I hoped -- rc4 was the last one, but we had some laptop resource > conflicts, various ppc TLB flush issues, some possible stack overflows > in > networking and a number of other details warranting a quick -- rc5 > before the > final 2.6.11," Torvalds wrote. > > "This time it's really supposed to be a quickie, so people who can, > please > check it out, and we'll make the real 2.6.11 asap." > > The long list of updates in the 2.6.11 kernel includes architecture > updates > for x86-64, ia64, ppc, arm and mips, as well as updates to ACPI > (define), DRI > (Direct Rendering Infrastructure, which permits direct access to > graphics > hardware for X Window System users), ALSA (Advanced Linux Sound > Architecture, > which provides MIDI and audio functionality to the Linux), SCSI > (define) and > the XFS high-performance journaling filesystem. > > The 2.6.11 kernel will also be significant in that it includes driver > support > for the InfiniBand (define) interconnect architecture. InfiniBand, > which is > derived from its underlying concept of "infinite bandwidth," is a > switched > fabric interconnect technology for high-performance network devices > that is > common in a number of supercomputer clusters. > > The upcoming inclusion of InfiniBand support in the Linux kernel is a > major > step according to the InfiniBand Trade Association. > > "The inclusion of InfiniBand drivers in the upstream Linux kernel is a > significant milestone," Ross Schibler, CTO of InfiniBand vendor Topspin > Communications, told internetnews.com. > > InfiniBand support was available previously in various Linux > distributions, > but it wasn't part of the mainstream kernel.org Linux. > > "This now means that anyone that downloads a kernel will have automatic > access to the software," explained Schibler. "It also means that any > upcoming > distributions (Red Hat, SUSE, etc.) will have the software included on > their > CDs. Previously SUSE had it on a distribution, but only in the > 'unsupported' > directory." > > Schibler sees the inclusion of InfiniBand as a testament to the > maturation of > the technology. > > "Now that the technology has matured to such a point that Linus has > accepted > it into the kernel, the way is paved for greater distribution of the > code and > accelerated deployment of the technology," Schibler said. > > The previous Linux kernel.org release, version 2.6.10 was issued on > Dec. 24 > after two release candidates. Linux distribution began including the > 2.6.10 > thereafter with Red Hat's Fedora Project being one of the first. > > Fedora Core 3 initially shipped with the 2.6.9 kernel and then > upgraded to > the 2.6.10 kernel on Jan 13. Mandrakelinux's 10.2 Beta 3 also includes > the > 2.6.10 release. SUSE Linux 9.2 currently includes the 2.6.8 kernel. > > Including the most recent kernel into a distribution is not a > particularly > easy task. The upcoming Debian, code-named Sarge, will only ship with > the > 2.6.8 kernel. In a release update e-mail, Debian Sarge release manager > Andreas Barth related that a meeting was recently held to review the > status > of which kernel they would include. > > "The team leads involved eventually decided to stay with kernel 2.6.8 > and > 2.4.27, rather than bumping the 2.6 kernel to 2.6.10," Barth wrote. > "This > decision was made upon review of the known bugs in each of the 2.6 > kernel > versions; despite some significant bugs in the Debian 2.6.8 kernel > tree, > these bugs were weighed against the additional delays that a kernel > version > bump would introduce in the schedule for debian-installer RC3." > > "As it happens, preparing 2.4 and 2.6 kernels with the security fixes > for all > architectures took roughly two months from start to finish, during > which time > preparation of the next debian-installer release candidate has been > entirely > stalled," he added. > > -- > Eugen* Leitl leitl > ______________________________________________________________ > ICBM: 48.07078, 11.61144 http://www.leitl.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > http://moleculardevices.org http://nanomachines.net > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > Glen Otero Ph.D. Linux Prophet -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 5263 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050302/3d6e786d/attachment.bin From hahn at physics.mcmaster.ca Wed Mar 2 15:09:09 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <5769e684325b11f3146f648fe06d327f@linuxprophet.com> Message-ID: > Arima and Iwill have mobos with IB LOM (Landed on Motherboard). given the choice between a $150 pcie IB nic and having it onboard, I'd choose the separate card. I know the IB salesdroids always say that getting onto the MB will change everything, but this doesn't make sense. IB is completely different from onboard gigabit, for instance, because there is no ubiquitous IB infrastructure ready, waiting to be exploited. the problem with "if you build it onboard, they will come" is also the marginal cost. onboard gigabit is nearly the same cost as onboard 100bT, very low, and you pretty much always want it. onboard IB is noticably higher than onboard GBE, noticable in absolute terms, and you definitely have no possible use for it on many systems. remember, most people don't even saturate GBE yet, and GBE ports are damned cheap. GBE nics are free, and switch ports are now down to $US 23/port: http://froogle.google.com/froogle?q=netgear+GS748T&btnG=Search+Froogle fundamentally, IB is still facing most of the same problems it always has: - requires fairly expensive, unique infrastructure - not the greatest physical layer: it's easy to wind up with literally tons of IB cables. - not clearly superior in performance vs alternatives. - apparently designed by people who disliked existing technique or were ignorant of it. - not a drop-in replacement for alternatives. From bill at cse.ucdavis.edu Wed Mar 2 15:17:53 2005 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <5769e684325b11f3146f648fe06d327f@linuxprophet.com> References: <20050302173008.GY13336@leitl.org> <5769e684325b11f3146f648fe06d327f@linuxprophet.com> Message-ID: <20050302231753.GA5857@cse.ucdavis.edu> On Wed, Mar 02, 2005 at 11:08:19AM -0800, Glen Otero wrote: > Arima and Iwill have mobos with IB LOM (Landed on Motherboard). > Via pci-express? Or via an HTX[1] slot? [1] http://www.hypertransport.org/products/productdetail.cfm?RecordID=65 When? -- Bill Broadley Computational Science and Engineering UC Davis From gotero at linuxprophet.com Wed Mar 2 15:21:19 2005 From: gotero at linuxprophet.com (Glen Otero) Date: Tue Jan 6 01:03:53 2009 Subject: Fwd: [Beowulf] 2.6.11 is out; with InfiBand support Message-ID: <0a773778f2243e80397520d276ba56b0@linuxprophet.com> Begin forwarded message: > From: Glen Otero > Date: March 2, 2005 3:20:41 PM PST > To: Bill Broadley > Subject: Re: [Beowulf] 2.6.11 is out; with InfiBand support > > > On Mar 2, 2005, at 3:17 PM, Bill Broadley wrote: > >> On Wed, Mar 02, 2005 at 11:08:19AM -0800, Glen Otero wrote: >>> Arima and Iwill have mobos with IB LOM (Landed on Motherboard). >>> >> >> Via pci-express? > > PCI-Express > >> Or via an HTX[1] slot? >> >> [1] >> http://www.hypertransport.org/products/productdetail.cfm?RecordID=65 >> >> When? > > Available now, according to Mellanox. I've seen pictures of the boards. >> >> -- >> Bill Broadley >> Computational Science and Engineering >> UC Davis >> >> > Glen Otero Ph.D. > Linux Prophet > > Glen Otero Ph.D. Linux Prophet -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 1172 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050302/0ef3889c/attachment.bin From jrajiv at hclinsys.com Tue Mar 1 03:07:53 2005 From: jrajiv at hclinsys.com (Rajiv) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] GRID APPLICATION Message-ID: <012e01c51e4e$ea4b6860$0f120897@PMORND> Dear All, I have setup Globus 3.2 on two machines and I am able to submit job from one machine to another. I have a basic doubt about what application to run in GRID environments. Shouldn't the GRID application use resources of both the GRID machines simultaniously. Are there any applications like this. So far I am only running remote jobs from on machine to another - for eg I can submit and run LINPACK/GROMACS job from one master of a cluster to a master of another cluster. Regards, Rajiv From jakob at unthought.net Tue Mar 1 07:51:34 2005 From: jakob at unthought.net (Jakob Oestergaard) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> <20050225183146.GA1563@greglaptop.internal.keyresearch.com> <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> Message-ID: <20050301155134.GM347@unthought.net> On Sat, Feb 26, 2005 at 10:59:57AM +0000, John Hearns wrote: > On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote: > > > > > Doesn't make any sense; I have seen people describe such systems where > > they download a disk image when a batch job wants a different software > > load. It's certainly doable that way: it does have different tradeoffs > > from the diskless case, but if it gives you a headache, it's probably > > I've always dreamed of using User Mode Linux images for this. > In a Grid-based world, prepare a UML instance which has all the > libraries and runtime to run your code. Ship it across the grid with > your executable. > The cluster at the receiving end can be running any distribution - it > runs your UML in a sandbox. Please see RFC 1925, corollary 6a: It is always possible to add another level of indirection. Coming from truth 6: It is easier to move a problem around (for example, by moving the problem to a different part of the overall network architecture) than it is to solve it. Your UML is, as its name implies, a user-space application, just like the real application you were actually trying to run. If your application cannot be run on a given distro, I pretty much doubt your UML (which is a very very complex user mode application) will run. What you want is KISS: Keep It Simple (Stupid) Don't link to a gazillion libraries if you don't have to. Link the libraries statically when feasible (gives you a performance gain in many cases anyway). A statically linked application, or one with only glibc linked dynamically, will run on very wide ranges of distributions. Trust me on this; I make a living from selling an evil capitalistic closed-source solution which needs to run on a very wide range of distributions (and no, we do not link glibc statically because we're not allowed to, but we keep our dependencies minimal and our binaries do run on a very wide range of distributions). > > And before anyone says it, yes performance would be a dog, > and I don't see how UML could access all those nice Myrinet and > Infiniband cards. SO I'm definitely blue-skying. Again; adding layers of indirection is rarely a solution. -- / jakob From peter at cs.usfca.edu Tue Mar 1 09:19:41 2005 From: peter at cs.usfca.edu (Peter Pacheco) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] Re: Pi calculator In-Reply-To: <42231589.4080706@scalableinformatics.com> References: <42231589.4080706@scalableinformatics.com> Message-ID: <20050301171941.GA5545@cs.usfca.edu> On Mon, Feb 28, 2005 at 07:58:49AM -0500, Joe Landman wrote: > > >>2. Does anybody know of a program that will calculate pi, one digit at a > >>time, infinitely that will run in parallel? > > > > > >I don't know about one that will compute an infinite number of digits in > >PI, but the computation of PI via the arctan series is trivially > >partitionable in a variety of ways. You'll spend more time working to > >sum and align the digits you get (as they obviously will have to be > >obtained and manipulated piecewise as strings) than you will doing the > >computation per se. It actually sounds like a decent exercise, as the > >carry from small digits may have to propagate iteratively back to larger > >ones as you extend the computation farther and farther. > > > > > http://mathworld.wolfram.com/PiDigits.html > http://mathworld.wolfram.com/PiFormulas.html > http://www.andrews.edu/~calkins/physics/Miracle.pdf > > and others. > > It is possible to calculate the digits individually using the Bailey et > al algorithm. > > Joe I wrote a short MPI program last summer that uses the Bailey-Borwein-Plouffe algorithm and the GMP library (http://www.swox.com/gmp/) to compute arbitrarily many digits of pi. Jake, send me email (peter@cs.usfca.edu) if you want a copy. Best wishes, Peter Pacheco Department of Computer Science University of San Francisco San Francisco, CA 94117 (415) 422-6630 peter@cs.usfca.edu From steve_heaton at ozemail.com.au Tue Mar 1 16:21:04 2005 From: steve_heaton at ozemail.com.au (steve_heaton@ozemail.com.au) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] So we will write our own book - next steps... Message-ID: <20050302002104.EBYF8920.swebmail00.mail.ozemail.net@localhost> G'day all I humbly submit my A$0.02 as a novice Beowulf'er. I don't have a problem with a series of collected articles. I agree it's a great way to keep the journal/book fresh. The personal styles of the authors doesn't present a challenge for me if the content is good quality. This list is a great example! I'd like to see "something in front of the punters" rather than aiming for perfection with little output as a result. Esp as we're initally looking at a soft format. Articles need not be long and involved. Some of the gems I've got from this list are 25 words or less ;) Another advantage of the article approach. Once we get to "things", my suggestion for an outline (per topic) is: -) What is it? (with a bit of background/history etc) -) How does it work? (Roughly) -) How to I install/use it? -) Tricks and tips (solutions to common problems) -) Where to find more info (net refs, books etc) You're basic FAQ thang :) Vendors are in but the editors wield a heavy hand on 'barrow pushing. The vendors on this list seem good on the education and rarely get pulled into a p'ing contest. Something rare and beautiful compared to other lists! They know their kit, I'd like their knowledge and experience. No doubt they'll be flooded with sales as a result ;) Ability to download journal/book for offline reading is critical. Editors are neutral moderators. Eg. They don't side on local HD v's net boot but will present (all) options without fear nor favour. The goal is to leave the reader in a position to make an informed decision :) I have every intention to contribute. The words are vapour until it do! =) Cheers Stevo This message was sent through MyMail http://www.mymail.com.au From ddw at dreamscape.com Tue Mar 1 21:02:48 2005 From: ddw at dreamscape.com (Dan Williams) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] Re: So we will write our own book - next steps... In-Reply-To: <200503012001.j21K0Yik026268@bluewest.scyld.com> References: <200503012001.j21K0Yik026268@bluewest.scyld.com> Message-ID: <422548F8.7010603@dreamscape.com> The question has been raised as to addressing the needs of beginners, as well as advanced people. I am about as beginner as you can get. I have never built or used a cluster, and am a Linux newbie, besides. If a rank beginner chapter is desired, I volunteer to write it, if someone can hold my hand while I turn a pair of Pentium 100MHz motherboards and miscellaneous parts I have in my junk pile into a working (2 node) cluster. I am pretty good at writing non-fiction if it's a subject I know or can learn about, but as of now, I only have the vaguest notions on how to make a functioning cluster. If there is interest in including a chapter that is detailed enough but basic enough that someone who knows nothing on the subject can learn enough to actually build a functioning cluster from junk parts, then I'm your writer. I'll build a "proof of concept" junkyard cluster and write about it, if someone can help me figure out how. DDW From eno at dorsai.org Wed Mar 2 01:00:43 2005 From: eno at dorsai.org (Alpay Kasal) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] where can i learn to build a cluster machine? In-Reply-To: Message-ID: <0ICP00HHNVJ1JO@mta2.srv.hcvlny.cv.net> Holy Cow... This message is a keeper. Thanks a million Robert. Alpay Kasal -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Robert G. Brown Sent: Friday, February 25, 2005 12:54 PM To: Starship Warrior Cc: beowulf@beowulf.org Subject: Re: [Beowulf] where can i learn to build a cluster machine? ...snip... To give you the direct answer, it goes something like the following: a) Hook systems into a common switched LAN e.g. an ethernet switch. b) If possible use decent quality PXE-aware NICs c) If possible use nodes with a decent amount of installed memory (>= 192 MB) although it is possible to get by with less, with effort. d) Node hard disk is optional for at least some installation methods (e.g. warewulf) but is useful and enables others. e) At least one system NEEDS ample hard disk and will serve as a "server" or "head node" to your cluster. This node will manage boot images, the distro you wish to install, NFS or other shared filesystems, authentication, and gives you a place to "login to the cluster". Note that this is a sloppy requirement -- there are many different ways to manage this and I'm just describing one of the simplest and most straightforward ones. ...snip... rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From atp at piskorski.com Wed Mar 2 18:06:00 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: Message-ID: <20050303020600.GA56437@piskorski.com> On Wed, Mar 02, 2005 at 06:09:09PM -0500, Mark Hahn wrote: > > Arima and Iwill have mobos with IB LOM (Landed on Motherboard). > > given the choice between a $150 pcie IB nic and having it onboard, > I'd choose the separate card. I know the IB salesdroids always Except, a single PCI-X Infiniband card currently costs $1000 or so, right? (That's for a 4x 2 port card, but Froogle does not seem to know of any cheaper cards.) http://h30094.www3.hp.com/product.asp?sku=2603660&jumpid=ex_r2910_frooglesmb/accessories http://www.costcentral.com/proddetail/HP_NC570C/376158B21/F35425/froogle/ -- Andrew Piskorski http://www.piskorski.com/ From rene at renestorm.de Wed Mar 2 19:05:54 2005 From: rene at renestorm.de (rene) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <1109659454.6544.2.camel@localhost.localdomain> References: <1109659454.6544.2.camel@localhost.localdomain> Message-ID: <200503030405.54791.rene@renestorm.de> Hi Michael, in my opinion, this would be a gather with lenght 1 but sended 4 times. This seems to be the easiest and slowest way. If Im not totally wrong your interleaving looks like an Alltoall followed by a reduce operation, but why don't you sort the recv buffer afterwards? Cu Rene > Dear List, > > I would like to gather the data from several processes. > Instead of the comonly used stride, I want to interleave > the data: > > Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > Rank 1: BBBBB ----^---^---^---^---^ > Rank 2: CCCCC -----^---^---^---^---^ > Rank 3: DDDDD ------^---^---^---^---^ > > Since the stride of the receive type is indicated > in multpiles of its mpi_type, no interleaving is > possible (the smallest striping factor leads to > AAAAABBBBBBCCCCCDDDDD). > > Is there a way to achieve this behaviour in an > elegant way, as MPI_Gather promises it? Or do > I need to do Send/Recv with self-aligned offsets? > > Thank you for your help! > > Michael > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From felix.rauch.valenti at gmail.com Wed Mar 2 17:00:49 2005 From: felix.rauch.valenti at gmail.com (Felix Rauch Valenti) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] Re: Pi calculator In-Reply-To: <20050301171941.GA5545@cs.usfca.edu> References: <42231589.4080706@scalableinformatics.com> <20050301171941.GA5545@cs.usfca.edu> Message-ID: <4eafc81b050302170064486203@mail.gmail.com> On Tue, 1 Mar 2005 09:19:41 -0800, Peter Pacheco wrote: > I wrote a short MPI program last summer that uses > the Bailey-Borwein-Plouffe algorithm and the GMP library > (http://www.swox.com/gmp/) to compute arbitrarily many digits of pi. > Jake, send me email (peter@cs.usfca.edu) if you want a copy. If somebody really wants to spend zillions of cycles on calculating Pi just for fun, you could also look for non-random patterns in Pi on the way. Maybe you will become famous one day. (insert reference to Carl Sagan's "Contact" here) - Felix From joachim at ccrl-nece.de Thu Mar 3 01:14:40 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <1109659454.6544.2.camel@localhost.localdomain> References: <1109659454.6544.2.camel@localhost.localdomain> Message-ID: <4226D580.2010206@ccrl-nece.de> Michael Gauckler wrote: > Dear List, > > I would like to gather the data from several processes. > Instead of the comonly used stride, I want to interleave > the data: > > Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > Rank 1: BBBBB ----^---^---^---^---^ > Rank 2: CCCCC -----^---^---^---^---^ > Rank 3: DDDDD ------^---^---^---^---^ > > Since the stride of the receive type is indicated > in multpiles of its mpi_type, no interleaving is > possible (the smallest striping factor leads to > AAAAABBBBBBCCCCCDDDDD). > > Is there a way to achieve this behaviour in an > elegant way, as MPI_Gather promises it? Or do > I need to do Send/Recv with self-aligned offsets? Actually, I don't see an 'elegant' way to do this, either. The decision between multiple MPI_Gatherv() calls and a Irecv/Send/Waitall construct depends on the quality of the MPI implementation you use (MPI_Gatherv can be optimized well for small amounts of data), the characteristics of you interconnect (high latency gives more room for optimization) and the number of processes you use. For small process numbers, you wont see much of a difference anyway. You could also try to gather all data on the root in separate buffers, and then let this process send/recv to itself using the proper datatypes. Finally, if this communication is not a significant part of your runtime, you shouldn't spend much time optimizing it anyway. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From rgb at phy.duke.edu Thu Mar 3 04:54:46 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] Re: Pi calculator In-Reply-To: <4eafc81b050302170064486203@mail.gmail.com> References: <42231589.4080706@scalableinformatics.com> <20050301171941.GA5545@cs.usfca.edu> <4eafc81b050302170064486203@mail.gmail.com> Message-ID: On Thu, 3 Mar 2005, Felix Rauch Valenti wrote: > On Tue, 1 Mar 2005 09:19:41 -0800, Peter Pacheco wrote: > > I wrote a short MPI program last summer that uses > > the Bailey-Borwein-Plouffe algorithm and the GMP library > > (http://www.swox.com/gmp/) to compute arbitrarily many digits of pi. > > Jake, send me email (peter@cs.usfca.edu) if you want a copy. > > If somebody really wants to spend zillions of cycles on calculating Pi > just for fun, you could also look for non-random patterns in Pi on the > way. Maybe you will become famous one day. > (insert reference to Carl Sagan's "Contact" here) Just be sure that you look with a powerful statistical tool -- remembering those damnable typing monkeys. Pi is well known to have all sorts of non-random-looking patterns in it. Distributed (as far as all studies done to date that I found referenced on the web) completely randomly...;-) Wait! I see a cloud that looks like the Virgin Mary! Gotta go and write the Enquirer...:-) rgb (Still haven't taken my medicine this morning, and Deadline hisself is already pre-emptively hassling me for the column I haven't written yet for May:-) (I just HAVE to quit playing WoW until 2 am before cumulative sleep deprivation slays me like a dragon did last night.) (Hmmm, combine business with pleasure? Maybe I'll try to contact the WoW folks and get some detail about their realm cluster. That would make a nifty article for June...) (Damn, my interior monologue isn't working this morning. Must sleep...:-) > > - Felix > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Thu Mar 3 05:20:57 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <4226D580.2010206@ccrl-nece.de> References: <1109659454.6544.2.camel@localhost.localdomain> <4226D580.2010206@ccrl-nece.de> Message-ID: On Thu, 3 Mar 2005, Joachim Worringen wrote: > Michael Gauckler wrote: > > Dear List, > > > > I would like to gather the data from several processes. > > Instead of the comonly used stride, I want to interleave > > the data: > > > > Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > > Rank 1: BBBBB ----^---^---^---^---^ > > Rank 2: CCCCC -----^---^---^---^---^ > > Rank 3: DDDDD ------^---^---^---^---^ > > > > Since the stride of the receive type is indicated > > in multpiles of its mpi_type, no interleaving is > > possible (the smallest striping factor leads to > > AAAAABBBBBBCCCCCDDDDD). > > > > Is there a way to achieve this behaviour in an > > elegant way, as MPI_Gather promises it? Or do > > I need to do Send/Recv with self-aligned offsets? What about RMA-like commands? MPI_Get in a loop? Since that is controlled by the gatherer, one would presume that it preserves call order (although it is non-blocking). Or of course there are always raw sockets... where you have complete control. Depending on how critical it is that you preserve this strict interleaving order. rgb > > Actually, I don't see an 'elegant' way to do this, either. The decision > between multiple MPI_Gatherv() calls and a Irecv/Send/Waitall construct > depends on the quality of the MPI implementation you use (MPI_Gatherv > can be optimized well for small amounts of data), the characteristics of > you interconnect (high latency gives more room for optimization) and the > number of processes you use. For small process numbers, you wont see > much of a difference anyway. > > You could also try to gather all data on the root in separate buffers, > and then let this process send/recv to itself using the proper datatypes. > > Finally, if this communication is not a significant part of your > runtime, you shouldn't spend much time optimizing it anyway. > > Joachim > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From eugen at leitl.org Thu Mar 3 05:42:20 2005 From: eugen at leitl.org (Eugen Leitl) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] purchasing sources for Newisys 2100 (and 4300)? Message-ID: <20050303134219.GB13336@leitl.org> Question to resident hardware purchasers: where do you get your Newisys systems, in the EU (Germany, especially)? Small quantities. The company I work for has good prices for Sun V20z iron, but naturally I'm looking for better deals, especially with large memory configurations. I know this is the wrong place to ask, but I can't find a lead on the web. Thanks, -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050303/f67aa6aa/attachment.bin From gropp at mcs.anl.gov Thu Mar 3 06:23:10 2005 From: gropp at mcs.anl.gov (William Gropp) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <1109659454.6544.2.camel@localhost.localdomain> References: <1109659454.6544.2.camel@localhost.localdomain> Message-ID: <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> At 12:44 AM 3/1/2005, Michael Gauckler wrote: >Dear List, > >I would like to gather the data from several processes. >Instead of the comonly used stride, I want to interleave >the data: > >Rank 0: AAAAA -> ABCDABCDABCDABCDABCD >Rank 1: BBBBB ----^---^---^---^---^ >Rank 2: CCCCC -----^---^---^---^---^ >Rank 3: DDDDD ------^---^---^---^---^ > >Since the stride of the receive type is indicated >in multpiles of its mpi_type, no interleaving is >possible (the smallest striping factor leads to >AAAAABBBBBBCCCCCDDDDD). > >Is there a way to achieve this behaviour in an >elegant way, as MPI_Gather promises it? Or do >I need to do Send/Recv with self-aligned offsets? You should be able to do this with MPI_Gather by creating a new datatype on the receiving process whose extent is the size of a single item; that will get you the correct offset for the first element. In order to receive the subsequent elements into the desired location, you need to use a vector type containing the number of elements. And for this to be fast, you need an MPI implementation that will handle the "resized" datatype efficiently (use MPI_Type_vector to create the full datatype and MPI_Type_create_resized to change its effective extent). If you are moving large amounts of data, separate send/recvs are probably a better choice. Bill >Thank you for your help! > > Michael > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf William Gropp http://www.mcs.anl.gov/~gropp From rross at mcs.anl.gov Thu Mar 3 08:35:37 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: References: <1109659454.6544.2.camel@localhost.localdomain> <4226D580.2010206@ccrl-nece.de> Message-ID: <42273CD9.5030503@mcs.anl.gov> Robert G. Brown wrote: > On Thu, 3 Mar 2005, Joachim Worringen wrote: > >>Michael Gauckler wrote: >> >>>I would like to gather the data from several processes. >>>Instead of the comonly used stride, I want to interleave >>>the data: >>> >>>Rank 0: AAAAA -> ABCDABCDABCDABCDABCD >>>Rank 1: BBBBB ----^---^---^---^---^ >>>Rank 2: CCCCC -----^---^---^---^---^ >>>Rank 3: DDDDD ------^---^---^---^---^ >>> >>>Since the stride of the receive type is indicated >>>in multpiles of its mpi_type, no interleaving is >>>possible (the smallest striping factor leads to >>>AAAAABBBBBBCCCCCDDDDD). >>> >>>Is there a way to achieve this behaviour in an >>>elegant way, as MPI_Gather promises it? Or do >>>I need to do Send/Recv with self-aligned offsets? > > > What about RMA-like commands? MPI_Get in a loop? Since that is > controlled by the gatherer, one would presume that it preserves call > order (although it is non-blocking). I would hope that one would read the spec instead! MPI_Get()s don't necessarily *do* anything until the corresponding synchronization call. This allows the implementation to aggregate messages. Call order (of the MPI_Get()s in an epoch) is ignored. > Or of course there are always raw sockets... where you have complete > control. Depending on how critical it is that you preserve this strict > interleaving order. > > rgb No you don't! You're just letting the kernel buffer things instead of the MPI implementation. Plus, Michael's original concern was doing this in an elegant way, not explicitly controlling the ordering. Joachim had some good options for MPI. Regards, Rob From rross at mcs.anl.gov Thu Mar 3 08:36:48 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> Message-ID: <42273D20.2000702@mcs.anl.gov> William Gropp wrote: > At 12:44 AM 3/1/2005, Michael Gauckler wrote: > >> Dear List, >> >> I would like to gather the data from several processes. >> Instead of the comonly used stride, I want to interleave >> the data: >> >> Rank 0: AAAAA -> ABCDABCDABCDABCDABCD >> Rank 1: BBBBB ----^---^---^---^---^ >> Rank 2: CCCCC -----^---^---^---^---^ >> Rank 3: DDDDD ------^---^---^---^---^ >> >> Since the stride of the receive type is indicated >> in multpiles of its mpi_type, no interleaving is >> possible (the smallest striping factor leads to >> AAAAABBBBBBCCCCCDDDDD). >> >> Is there a way to achieve this behaviour in an >> elegant way, as MPI_Gather promises it? Or do >> I need to do Send/Recv with self-aligned offsets? > > > You should be able to do this with MPI_Gather by creating a new datatype > on the receiving process whose extent is the size of a single item; that > will get you the correct offset for the first element. In order to > receive the subsequent elements into the desired location, you need to > use a vector type containing the number of elements. And for this to be > fast, you need an MPI implementation that will handle the "resized" > datatype efficiently (use MPI_Type_vector to create the full datatype > and MPI_Type_create_resized to change its effective extent). If you are > moving large amounts of data, separate send/recvs are probably a better > choice. > > Bill Nice! Rob From joachim at ccrl-nece.de Thu Mar 3 08:47:23 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> Message-ID: <42273F9B.4040506@ccrl-nece.de> William Gropp wrote: > You should be able to do this with MPI_Gather by creating a new datatype > on the receiving process whose extent is the size of a single item; that > will get you the correct offset for the first element. In order to > receive the subsequent elements into the desired location, you need to > use a vector type containing the number of elements. And for this to be > fast, you need an MPI implementation that will handle the "resized" > datatype efficiently (use MPI_Type_vector to create the full datatype > and MPI_Type_create_resized to change its effective extent). If you are > moving large amounts of data, separate send/recvs are probably a better > choice. Oh yes, I forgot, twiddling with LB and UB. I never liked this, esp. as an MPI implementor. Not especially 'elegant', but it should work. Good conformance test, BTW. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From rsweet at aoes.com Thu Mar 3 00:29:56 2005 From: rsweet at aoes.com (Ryan Sweet) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] GRID APPLICATION In-Reply-To: <012e01c51e4e$ea4b6860$0f120897@PMORND> References: <012e01c51e4e$ea4b6860$0f120897@PMORND> Message-ID: On Tue, 1 Mar 2005, Rajiv wrote: > Dear All, > I have setup Globus 3.2 on two machines and I am able to submit job from > one machine to another. I have a basic doubt about what application to run > in GRID environments. Shouldn't the GRID application use resources of both > the GRID machines simultaniously. Are there any applications like this. So > far I am only running remote jobs from on machine to another - for eg I can > submit > and run LINPACK/GROMACS job from one master of a cluster to a master of > another cluster. Dear Rajiv, There seem to be a lot of people building clusters and grid systems lately without applications to run on them ;-). That's nice to see, I guess, in as much as it indicates the broad level of acceptance for the technologies. It is very much the reverse of the "to scratch an itch" way in which things used to be done. ;-) Grid systems (the term means many things to many people - here I mean roughly "a collection of resources used in collaboration spanning multiple geographic locations and administrative domains") are used in a wide variety of ways. In your scenario, if you are building a globus system in order to learn about globus, and you can now run jobs on one host from another and vice-versa, you've already got a lot of the hard work done. If you would like to use multiple grid resources to simultaneously work on a larger problem than any of them can tackle when working alone, then you need a way to take your problem and partition it into chunks that can be submitted to various resources around the grid, in your example, split a larger problem in two, and submit half to each resource. It is not usually practical (though there are exceptions) to run jobs which have a parallel communication component (MPI or PVM) across grid resoures (submitting multiple local mpi jobs to multiple resources is ok though, provided that you have a way to verify that the resources can accept your mpi jobs and run them - thats where RSL and a broker, etc. come in). Some middleware to broker between the requirements of your job and the available resources is usually used. There are a lot of projects that do that, and any attempt I would make to list them would surely leave out one or more deserving ones. Google is your friend. For GROMACS there are lots of examples out there. Here is a very friendly one from the UK NGS: http://www.ngs.ac.uk/sites/ox/software/gromacs.html regards, -Ryan -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From rsweet at aoes.com Thu Mar 3 00:52:32 2005 From: rsweet at aoes.com (Ryan Sweet) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] So we will write our own book - next steps... In-Reply-To: <20050302002104.EBYF8920.swebmail00.mail.ozemail.net@localhost> References: <20050302002104.EBYF8920.swebmail00.mail.ozemail.net@localhost> Message-ID: The response to this thread has been great! I am keeping track of all the responses, and will try to present some sort of overview. I have a system ready for hosting. What I'd like to do is to review a few different wikis/collaboration systems/etc. to check on some issues such as their security risks, ease of installation/maintenance, printing/offline reading support, and so on. If you have a preference please send me suggestions for consideration off-list. Either I'll setup up a few of the best ones and then we can choose among them, or if I don't have time I'll just setup the one I think is the best and if people who are actually contributing don't like it we can discuss changing it at that time. I think I will have something by Tuesday, though maybe earlier. If you are thinking of writing something, please go have a look at the (older but weathering well) FAQ on beowulf.org and then at Robert Brown's book first and to avoid re-inventing the wheel, update, acknowledge, and borrow. regards, -Ryan -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From djholm at fnal.gov Thu Mar 3 06:58:33 2005 From: djholm at fnal.gov (Don Holmgren) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <20050303020600.GA56437@piskorski.com> References: <20050303020600.GA56437@piskorski.com> Message-ID: I was just quoted quantity 2 PCI-X HCA's (4x 2 port Mellanox) by a reseller for $655 each. Last Fall we purchased a large quantity of PCI-E HCA's for considerably less than that unit price. Supposedly the "memory free" PCI-E HCA's that use host memory, rather than on-board sram, should move prices towards $100 sometime this year - we'll see (landed on motherboard pricing at ~ $70, see http://www.mellanox.com/news/press/pr_030105.html) Like other high performance network gear, it's tough to get accurate pricing information without going out and getting quotes. Don Holmgren On Wed, 2 Mar 2005, Andrew Piskorski wrote: > On Wed, Mar 02, 2005 at 06:09:09PM -0500, Mark Hahn wrote: > > > Arima and Iwill have mobos with IB LOM (Landed on Motherboard). > > > > given the choice between a $150 pcie IB nic and having it onboard, > > I'd choose the separate card. I know the IB salesdroids always > > Except, a single PCI-X Infiniband card currently costs $1000 or so, > right? (That's for a 4x 2 port card, but Froogle does not seem to > know of any cheaper cards.) > > http://h30094.www3.hp.com/product.asp?sku=2603660&jumpid=ex_r2910_frooglesmb/accessories > http://www.costcentral.com/proddetail/HP_NC570C/376158B21/F35425/froogle/ > > -- > Andrew Piskorski > http://www.piskorski.com/ > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From list-beowulf at onerussian.com Thu Mar 3 08:33:59 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] DCC (debian cluster components) Message-ID: <20050303163359.GJ4482@washoe.rutgers.edu> Dear Debianized beowulfers or beowulfiezed Debian users, Does any one has experience with http://www.irb.hr/en/cir/projects/dcc/ which recently was released? Project Goals We expect to integrate some existing technologies (like LDAP, System Installation Suite, Torque, C3...) and develop a production-grade toolset for easier cluster management, based on Debian GNU/Linux distribution. This involves development of automation mechanisms that provide a flexible platform for high-performance computation tasks, but also provide a system-administrator to have a secure, easy to maintain, reliable and good supported cluster administration toolbox, based on Debian/GNU Linux. -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050303/f24dcef6/attachment.bin From rgb at phy.duke.edu Thu Mar 3 10:35:08 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <42273CD9.5030503@mcs.anl.gov> References: <1109659454.6544.2.camel@localhost.localdomain> <4226D580.2010206@ccrl-nece.de> <42273CD9.5030503@mcs.anl.gov> Message-ID: On Thu, 3 Mar 2005, Rob Ross wrote: > > What about RMA-like commands? MPI_Get in a loop? Since that is > > controlled by the gatherer, one would presume that it preserves call > > order (although it is non-blocking). > > I would hope that one would read the spec instead! MPI_Get()s don't > necessarily *do* anything until the corresponding synchronization call. > This allows the implementation to aggregate messages. Call order (of > the MPI_Get()s in an epoch) is ignored. Ouch! I did read the spec, about ten seconds before replying, and note that I SAID it was non-blocking (from the spec) and was thinking about using it with sync's. However, yes, this is a bit oxymoronic on a reread (preserve call order vs non-blocking? Jeeze:-) and I consider myself whomped upside the head:-) Still, what IS to prevent him from alternating gets and synchronization calls while retrieving . I don't think there is any alternative to doing this in ANY non-blocking, potentially aggregating or parallel communications scenario, although where he puts the barrier might vary. Depending on whether he cares about the (ABCD)(ABCD)... order per se he might try (however he does the "get"ting or receiving, or whatever): get A (sync) get B (sync) get C (sync) (inefficient but absolutely guarantees loop order). If (ABCD)(BADC)(DCAB)... is ok (he doesn't care what order they arrive in but he doesn't want the A communications to be aggregated so that two A's get there before he gets the BCD from the same cycle of computations) then he should be able to do: get A get B get C get D (sync) get A get B get C... If I understand things correctly (where there is a very definite chance that I don't!) these (on top of any library) will in the first case be equivalent to a blocking TCP read from A, B, C... in order (but probably not as efficient as TCP would be in this particular case because MPI is optimized against the near diametrically opposite assumption) while the second would be equivalent to using select to monitor a group of open sockets for I/O, reading from them in the order that data becomes available, but adding a toggle so you don't read from one twice before reading from all of them. Although there is likely more than one way to do it, and where in low-level programming one might well want to implement handshaking of some sort to trigger the next cycle's send (blocking the remote client's execution as necessary) to avoid overrunning buffers or exhausting memory on the master/aggregator if for any reason one host turns out to be "slow" relative to the others. In MPI one hopes all of that is handled for you, and more. > > Or of course there are always raw sockets... where you have complete > > control. Depending on how critical it is that you preserve this strict > > interleaving order. > > > > rgb > > No you don't! You're just letting the kernel buffer things instead of > the MPI implementation. Plus, Michael's original concern was doing this > in an elegant way, not explicitly controlling the ordering. Of course one has maximum control with raw sockets (or more generically, raw devices). Somewhere down inside MPI there ARE raw sockets (or equivalent at some level in the ISO/OSI stack). The MPI library wraps them up and hides all sorts of details while making all the device features uniform across the now "invisible" devices and in the process necessarily excluding one from ACCESSING all of those features a.k.a. the details being hidden. I may have misunderstood the recent discussion about the possible need for an MPI ABI but I thought that was what it was all about -- providing a measure of low level control that is BY DESIGN hidden by the API but is there, should one every try to code for it, in the actual kernel/device interface (e.g. regulated by ioctls). Note that at this low (device driver) level I would expect the kernel to handle at least some asynchronous low-level buffering and the primary interrupt processing for the physical device layer FOR the MPI implementation or any other program that uses the device -- you cannot safely escape this. This does not mean that you cannot control just where you stop using the kernel and rely on your own intermediary layers for handing the device above the level of raw read/write plus ioctls. That is the application layer or higher order kernel networking layers (depending on just where and how you access the device itself) may well manage buffers of their own, reliability, retransmission, RDMA, blocking/non-blocking of the interface, timeouts, and more. Low level networking is not easy, which is WHY people wrote PVM and the MPI network devices. So ultimately, all I was observing is that it is pretty straightforward (not necessarily easy, but straightforward) to write an application that very definitely and without any question goes and gets a chunk of data (say, contents of a struct) from an open socket on system A and puts it in a memory location (with appropriate pointers and sizeof and so forth for the struct), THEN gets a chunk of data from an open socket on system B and puts it in the next memory location, THEN gets a chunk of data from an open socket on system C and puts it ... In fact, since TCP generally does block on a read until there is data on the socket, it is relatively difficult to do it any other way in a simple loop over sockets -- you have to use select as noted above to avoid polling and non-blocking I/O, and in all cases one has to be pretty careful not to drop things and to handle cases where a stream runs slow or does other bad things. As I've learned the hard way. As far as elegance is concerned: a) That's a bit in the eye of the beholder. There are tradeoffs between simplicity of code and ease of development work vs performance and control but it is hard to say which is more "elegant". It's fairer to make the value-neutral statement that you have to work much harder to write a parallel application on top of raw sockets (no question at all;-) but have all the control and optimization potential available to userspace (at the ioctl level above the kernel and device driver itself) if you do so. To cite a metaphorical situation, is coding in assembler, whether one is coding a complete application or an embedded optimized operation, "inelegant"? Perhaps, but that's not the word I would have used. There are times when assembler is very elegant, in the sense that it directly encodes an algorithm with the greatest degree of optimization and control where a compiler might well generate indifferent code or fail to use all the features of the hardware. Once upon a time many many years ago I handed coded e.g. trig functions and arithmetic for the 8087 in assembler because my compiler generated 8088 code that ran about ten times more slowly. Elegant or inelegant? Compare to just this situation -- if for some reason you require e.g. absolute control over the order of utilization of your network links in a parallel computation (perhaps to avoid collisions or device/line contention, to do something exotic with transmission order and pattern on a hypercubical network) you may well find that MPI or PVM simply do not provide that degree of control, period. They try to "do the right thing" for a generic class of problem and simple assumptions as to the kind and number of interfaces and routes between nodes and load patterns along those routes, BUT there is no >>guarantee<< that the right thing they end up with (often chosen for robustness and optimization for the most common classes of problems) will be right for YOUR problem and network and no way to tweak it if it is not. In that case, using raw network devices (whatever they might be) might well be the only way to achieve the requisite level of control and yes, might be worth a factor of 10 in speed. I'll bet money that if you polled the list, you'd find that there exist people who have gone in and hacked MPI at the source level to "break" it (de-optimize it for the most common applications so they run worse) or who have run over time several versions of MPI including "new and improved" ones, who have found empirically that there are applications for which the hacked/older "disimproved" versions perform better. b) Anyway, this explains why I mentioned raw sockets at he end. Note well the "Depending on how..." Maybe I read the original message incorrectly, but I thought that the issue was that (for reasons unknown) he wanted to guarantee collection in the order A then B then C... Why he wanted to do this wasn't clear, nor was it clear whether (in any given cycle) it would be ok to do A then C then B then... (and just not overlap the next A). If the strict interleaving wasn't an issue, than I would have thought just putting a barrier at the end of an ABC...Z cycle would have forced all communications to complete before starting the next cycle. So IF this really IS a critical requirement -- he has to read from A, complete the read blocking no fooling, move on and read from B, etc, no data parallelism or asynchronicity permitted that might violate this strict order (or if he's interleaving communications on four different network devices along different routes to different sub-clusters of nodes), then doing it within MPI might or might not be efficient. Raw TCP sockets (or lower level hardware-dependent I/O) would a PITA to code, but you can pretty much guarantee that the resulting code is as efficient as possible, given the requirement, and it might be the ONLY way to accomplish a complicated interleave of node I/O for some very specific set of reasons. If you do the considerable work required to make it so, of course, copy of the complete works of Stevens in hand...;-) > Joachim had some good options for MPI. I agree. I don't even disagree with what you say above -- I understand what you mean. I just think that we need more data before concluding that those options were enough. He described his design goal but not his motivation. For some design goals there are probably lots of good ways to do it in MPI. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Thu Mar 3 10:40:07 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <42273D20.2000702@mcs.anl.gov> References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> <42273D20.2000702@mcs.anl.gov> Message-ID: On Thu, 3 Mar 2005, Rob Ross wrote: OK, having re-reread everything, I conclude that you were completely right after all. I misunderstood what his question was. I'm still not certain that I understand, but if Bill has answered it it definitely isn't what I though. So double-whomp. I'll go sleep now. rgb > > > William Gropp wrote: > > At 12:44 AM 3/1/2005, Michael Gauckler wrote: > > > >> Dear List, > >> > >> I would like to gather the data from several processes. > >> Instead of the comonly used stride, I want to interleave > >> the data: > >> > >> Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > >> Rank 1: BBBBB ----^---^---^---^---^ > >> Rank 2: CCCCC -----^---^---^---^---^ > >> Rank 3: DDDDD ------^---^---^---^---^ > >> > >> Since the stride of the receive type is indicated > >> in multpiles of its mpi_type, no interleaving is > >> possible (the smallest striping factor leads to > >> AAAAABBBBBBCCCCCDDDDD). > >> > >> Is there a way to achieve this behaviour in an > >> elegant way, as MPI_Gather promises it? Or do > >> I need to do Send/Recv with self-aligned offsets? > > > > > > You should be able to do this with MPI_Gather by creating a new datatype > > on the receiving process whose extent is the size of a single item; that > > will get you the correct offset for the first element. In order to > > receive the subsequent elements into the desired location, you need to > > use a vector type containing the number of elements. And for this to be > > fast, you need an MPI implementation that will handle the "resized" > > datatype efficiently (use MPI_Type_vector to create the full datatype > > and MPI_Type_create_resized to change its effective extent). If you are > > moving large amounts of data, separate send/recvs are probably a better > > choice. > > > > Bill > > Nice! > > Rob > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From lindahl at pathscale.com Thu Mar 3 10:55:16 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: <20050303020600.GA56437@piskorski.com> Message-ID: <20050303185516.GB1453@greglaptop.internal.keyresearch.com> On Thu, Mar 03, 2005 at 08:58:33AM -0600, Don Holmgren wrote: > I was just quoted quantity 2 PCI-X HCA's (4x 2 port Mellanox) by a > reseller for $655 each. Last Fall we purchased a large quantity of > PCI-E HCA's for considerably less than that unit price. > > Supposedly the "memory free" PCI-E HCA's that use host memory, rather > than on-board sram, should move prices towards $100 sometime this > year - we'll see (landed on motherboard pricing at ~ $70, see > http://www.mellanox.com/news/press/pr_030105.html) You're mixing retail price with wholesale price. The $69 price is apparently quantity 10,000, for just the chip, and the card that is inexpensive will be PCIe 4X, which hurts performance. I hear that today's street price for Mellanox-based cards is ~$600 in cluster-sized quantities, which matches what you report. > Like other high performance network gear, it's tough to get > accurate pricing information without going out and getting quotes. One nice thing about Myricom is that their prices have always been on the web -- all you need to know in addition is your discount. -- greg From rross at mcs.anl.gov Thu Mar 3 11:17:50 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> <42273D20.2000702@mcs.anl.gov> Message-ID: <422762DE.1090209@mcs.anl.gov> Sleep is good :). Robert G. Brown wrote: > On Thu, 3 Mar 2005, Rob Ross wrote: > > OK, having re-reread everything, I conclude that you were completely > right after all. I misunderstood what his question was. I'm still not > certain that I understand, but if Bill has answered it it definitely > isn't what I though. > > So double-whomp. I'll go sleep now. > > rgb > > >> >>William Gropp wrote: >> >>>At 12:44 AM 3/1/2005, Michael Gauckler wrote: >>> >>> >>>>Dear List, >>>> >>>>I would like to gather the data from several processes. >>>>Instead of the comonly used stride, I want to interleave >>>>the data: >>>> >>>>Rank 0: AAAAA -> ABCDABCDABCDABCDABCD >>>>Rank 1: BBBBB ----^---^---^---^---^ >>>>Rank 2: CCCCC -----^---^---^---^---^ >>>>Rank 3: DDDDD ------^---^---^---^---^ >>>> >>>>Since the stride of the receive type is indicated >>>>in multpiles of its mpi_type, no interleaving is >>>>possible (the smallest striping factor leads to >>>>AAAAABBBBBBCCCCCDDDDD). >>>> >>>>Is there a way to achieve this behaviour in an >>>>elegant way, as MPI_Gather promises it? Or do >>>>I need to do Send/Recv with self-aligned offsets? >>> >>> >>>You should be able to do this with MPI_Gather by creating a new datatype >>>on the receiving process whose extent is the size of a single item; that >>>will get you the correct offset for the first element. In order to >>>receive the subsequent elements into the desired location, you need to >>>use a vector type containing the number of elements. And for this to be >>>fast, you need an MPI implementation that will handle the "resized" >>>datatype efficiently (use MPI_Type_vector to create the full datatype >>>and MPI_Type_create_resized to change its effective extent). If you are >>>moving large amounts of data, separate send/recvs are probably a better >>>choice. >>> >>>Bill >> >>Nice! >> >>Rob >>_______________________________________________ >>Beowulf mailing list, Beowulf@beowulf.org >>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > From rross at mcs.anl.gov Thu Mar 3 12:50:24 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <42273F9B.4040506@ccrl-nece.de> References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> <42273F9B.4040506@ccrl-nece.de> Message-ID: <42277890.5070509@mcs.anl.gov> Joachim Worringen wrote: > William Gropp wrote: > >> You should be able to do this with MPI_Gather by creating a new >> datatype on the receiving process whose extent is the size of a single >> item; that will get you the correct offset for the first element. In >> order to receive the subsequent elements into the desired location, >> you need to use a vector type containing the number of elements. And >> for this to be fast, you need an MPI implementation that will handle >> the "resized" datatype efficiently (use MPI_Type_vector to create the >> full datatype and MPI_Type_create_resized to change its effective >> extent). If you are moving large amounts of data, separate send/recvs >> are probably a better choice. > > Oh yes, I forgot, twiddling with LB and UB. I never liked this, esp. as > an MPI implementor. Not especially 'elegant', but it should work. Good > conformance test, BTW. It needs to have a negative extent to really test things. The positive extents are easy! Rob From hahn at physics.mcmaster.ca Thu Mar 3 17:24:06 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: Message-ID: > Supposedly the "memory free" PCI-E HCA's that use host memory, rather > than on-board sram, should move prices towards $100 sometime this > year - we'll see (landed on motherboard pricing at ~ $70, see > http://www.mellanox.com/news/press/pr_030105.html) the IB world (still consting only of Mellanox chips, right?) has done a good job pushing down adapter prices. can anyone comment on trends in switch pricing? thanks, mark hahn. From john.hearns at streamline-computing.com Fri Mar 4 01:33:18 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: <20050303163359.GJ4482@washoe.rutgers.edu> References: <20050303163359.GJ4482@washoe.rutgers.edu> Message-ID: <1109928798.5537.22.camel@Vigor45> On Thu, 2005-03-03 at 11:33 -0500, Yaroslav Halchenko wrote: > Dear Debianized beowulfers or beowulfiezed Debian users, > > Does any one has experience with > http://www.irb.hr/en/cir/projects/dcc/ > which recently was released? > > Project Goals > > We expect to integrate some existing technologies (like LDAP, System > Installation Suite, Seems strange that they haven't chosen FAI (Fully Automated Installer). As an aside, there was a poster about FAI up outside the cluster developer's room at FOSDEM last weekend. From rgb at phy.duke.edu Fri Mar 4 05:03:00 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: <1109928798.5537.22.camel@Vigor45> References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: On Fri, 4 Mar 2005, John Hearns wrote: > On Thu, 2005-03-03 at 11:33 -0500, Yaroslav Halchenko wrote: > > Dear Debianized beowulfers or beowulfiezed Debian users, > > > > Does any one has experience with > > http://www.irb.hr/en/cir/projects/dcc/ > > which recently was released? > > > > Project Goals > > > > We expect to integrate some existing technologies (like LDAP, System > > Installation Suite, > > Seems strange that they haven't chosen FAI (Fully Automated Installer). > As an aside, there was a poster about FAI up outside the cluster > developer's room at FOSDEM last weekend. Is FAI being loved by somebody at this point? There was a time a few years ago where it seemed to be lying fallow (although as always I could be mistaken about that). Toolsets like that usually need a fairly active and energetic human to care for them, if not several... rgb > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From roger at ERC.MsState.Edu Fri Mar 4 06:10:28 2005 From: roger at ERC.MsState.Edu (Roger L. Smith) Date: Tue Jan 6 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: Message-ID: On Thu, 3 Mar 2005, Mark Hahn wrote: > the IB world (still consting only of Mellanox chips, right?) > has done a good job pushing down adapter prices. > > can anyone comment on trends in switch pricing? I know at least one vendor has a 24 port model using the newer IB chipset for around $8,000. _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_ | Roger L. Smith Phone: 662-325-3625 | | Sr. Systems Administrator FAX: 662-325-7692 | | roger@ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger | | Mississippi State University | |____________________________________ERC__________________________________| From gmpc at sanger.ac.uk Fri Mar 4 06:40:01 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: > > Is FAI being loved by somebody at this point? There was a time a few > years ago where it seemed to be lying fallow (although as always I could > be mistaken about that). Toolsets like that usually need a fairly > active and energetic human to care for them, if not several... It is still alive. We're just in the process of rolling out a new cluster with it at this very moment. Works fine. Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From laytonjb at charter.net Fri Mar 4 07:43:27 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: Message-ID: <4228821F.90607@charter.net> Roger L. Smith wrote: >On Thu, 3 Mar 2005, Mark Hahn wrote: > > > >>the IB world (still consting only of Mellanox chips, right?) >>has done a good job pushing down adapter prices. >> >>can anyone comment on trends in switch pricing? >> >> > >I know at least one vendor has a 24 port model using the newer IB chipset >for around $8,000. > > > I just finished an interconnect survey article for Doug and ClusterWorld Magazine. As part of the article I have a nice table with list prices for 8 nodes and 128 nodes for the various interconnects. It should be out in the May issue so be sure to look for it. However, to match what Roger said, one IB vendor gave me a list price for 8-ports of IB for under $8,000. Jeff From landman at scalableinformatics.com Fri Mar 4 08:04:37 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <4228821F.90607@charter.net> References: <4228821F.90607@charter.net> Message-ID: <42288715.10403@scalableinformatics.com> 8 ports under 8k or 24 ports under 8k? Jeffrey B. Layton wrote: > However, to match what Roger said, one IB vendor gave me > a list price for 8-ports of IB for under $8,000. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From laytonjb at charter.net Fri Mar 4 08:16:03 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <42288715.10403@scalableinformatics.com> References: <4228821F.90607@charter.net> <42288715.10403@scalableinformatics.com> Message-ID: <422889C3.5080902@charter.net> 8 ports under 8k, but it was a 24 port switch :) This includes all of the HCA's, switches (only one), cables, and software. Jeff > 8 ports under 8k or 24 ports under 8k? > > Jeffrey B. Layton wrote: > >> However, to match what Roger said, one IB vendor gave me >> a list price for 8-ports of IB for under $8,000. > From roger at ERC.MsState.Edu Fri Mar 4 08:34:32 2005 From: roger at ERC.MsState.Edu (Roger L. Smith) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <422889C3.5080902@charter.net> References: <4228821F.90607@charter.net> <42288715.10403@scalableinformatics.com> <422889C3.5080902@charter.net> Message-ID: THe price I stated was for a 24 port switch for around $8,000 list. As a matter of fact, I just confirmed this with the vendor. This does not include cables or HCAs. On Fri, 4 Mar 2005, Jeffrey B. Layton wrote: > 8 ports under 8k, but it was a 24 port switch :) > This includes all of the HCA's, switches (only one), > cables, and software. > > Jeff > > > 8 ports under 8k or 24 ports under 8k? > > > > Jeffrey B. Layton wrote: > > > >> However, to match what Roger said, one IB vendor gave me > >> a list price for 8-ports of IB for under $8,000. > > > _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_ | Roger L. Smith Phone: 662-325-3625 | | Sr. Systems Administrator FAX: 662-325-7692 | | roger@ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger | | Mississippi State University | |____________________________________ERC__________________________________| From eugen at leitl.org Fri Mar 4 08:49:17 2005 From: eugen at leitl.org (Eugen Leitl) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] Re: OS X in a Classified Environment... (fwd from kstaats@terrasoftsolutions.com) Message-ID: <20050304164917.GR13336@leitl.org> ----- Forwarded message from Kai Staats ----- From: Kai Staats Date: Fri, 4 Mar 2005 09:22:42 -0700 To: scitech@lists.apple.com Subject: Re: OS X in a Classified Environment... Organization: Terra Soft Solutions, Inc. User-Agent: KMail/1.7 Reply-To: kstaats@terrasoftsolutions.com Bryan, [snip] > Army contractor for aerodynamics work. I would be interested to find > out what happened to the Navy sonar cluster compute project that used > G4 servers running Linux... The original 272 G4 Xserves implemented in 2003 continue to be in use on-board the subs (from what I understand). In addition, the TI04 project (this past summer) invoked the use of G5 Xserves running our 64-bit Linux OS, Y-HPC. If PowerPC continues to be used in the sonar imaging environment, Linux will continue to be the preferred OS due to its flexibility and ease of code migration to/from non-PowerPC systems that remain a part of the on-board imaging systems. More info here: http://www.terrasoftsolutions.com/realworld/showcase/dod/ ... with several other DoE/DoD customers: http://www.terrasoftsolutions.com/products/y-hpc/customers.shtml kai _______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (Scitech@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/scitech/eugen%40leitl.org This email sent to eugen@leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050304/8e9e67dc/attachment.bin From canon at nersc.gov Thu Mar 3 10:28:02 2005 From: canon at nersc.gov (Shane Canon) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: <421C2DA4.8090608@psc.edu> References: <421C2DA4.8090608@psc.edu> Message-ID: <42275732.2020306@nersc.gov> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This was my experience as well. Most of the tools worked out of the box, but the partitioning was a real hang up. The easiest way around this is to have a separate boot drive that the installer can partition in its normal manner and then to have the >2TB device be completely separate and configured after first boot. You then use the whole drive (no partitions). This worked for me. - --Shane Paul Nowoczynski wrote: | Alvin Oga wrote: | |> hi ya steve |> |> On Tue, 22 Feb 2005, Steve Cousins wrote: |> |> |> |>> That's what I'm shooting for. Anybody have good luck with volumes |>> greater |>> than 2 TB with Linux? I think LSI SCSI cards are needed (?) and the 2.6 |>> Kernel is needed with CONFIG_LBD=y. Any hints or notes about doing this |>> would be greatly appreciated. Google has not been much of a friend on |>> this unfortunatlely. I'm guessing I'd run into NFS limits too. |>> |> |> |> for files/volumes over 2TB ... it's a question of libs, apps and |> kernel everything has to work ... which is not always the case |> |> |> | We've got this working at PSC without too much pain.. even with scsi | block devices >2TB. The LBD is needed but it | doesn't solve all the problems with large disks, especially if you have | a single volume which is larger than | 2TB. The issue we ran into was that many disk related apps like mdadm | and [s]fdisk don't support | the BLKGETSIZE64 ioctl. So even though your kernel is using 64 bits, | some needed apps are not. There are also issues with disklabels for | devices >2TB. The normal dos-style disklabel used by linux | doesn't support them so you'll need a kernel patch for the "plaintext" | partition table made by Andries Brouwer. | If you're interested in running this on 2.6 I can give you the patch. | As far as cards go I think the adaptec u320 cards | are better. I've seen less scsi timeout weirdness with them (this could | be related to our disks). Performance wise | the lsi and adaptec are about the same.. we see ~400MB/sec when using | both channels - even with a sub pci-x bus. For a couple hundred bucks a | card this is really good news. | --paul | |> i don't play much with 2.6 kernels other than on suse-9.x boxes |> |> |> |>> Also, am I being overly cautious about having a spare RAID controller on |>> hand? How frequent do RAID controllers go bad compared to disks, power |>> supplies and fan modules? I'd guess that it would be very infrequent. |>> |> |> |> it's always better to have spare parts ... ( part of my requirement ) if |> they expect the systems to be available 24x7 ... |> - more importantly, how long can they wait, when silly inexpensive |> things die, before it gets replaced |> |> - dead fans is $2.oo - $15 each to keep the disks cool |> |> - power supply is $50 range ... but if one bought n+1 powersupply |> than its supposed to not be an issue anymore, but you will need to |> have its replacement handy |> |> - raid controllers should NOT die, nor cpu, mem, mb, nic, etc |> and it's not cheap to have these items floating around as spare |> parts |> |> - ethernet cables will go funky if random people have access |> to the patch panels ... ( keep the fingers away ) |> |> - ups will go bonkers too |> |> - what failure mode can one protect against and what will happen |> if "it" dies |> - best protection against downtime for users is to have an |> warm-swap server which is updated a hourly or daily ... ( my |> preference - 2nd identical or bigger-disk capacity system ) |> |> |> |>> Looking back at my own experience I think I've had to return one out |>> of 15 |>> in the last eight years, and that was bad as soon as I bought it. |>> |> |> |> seems too high of a return rate ?? 1 out of 15 ?? |> |> |> |>> If this is too off-topic let me know and I'll move it elsewhere. |>> |> |> |> ditto here |> 24x7x365 uptime compute environment is fun/frustrating stuff on tight |> budgets |> |> c ya |> alvin |> |> _______________________________________________ |> Beowulf mailing list, Beowulf@beowulf.org |> To change your subscription (digest mode or unsubscribe) visit |> http://www.beowulf.org/mailman/listinfo/beowulf |> |> | | _______________________________________________ | Beowulf mailing list, Beowulf@beowulf.org | To change your subscription (digest mode or unsubscribe) visit | http://www.beowulf.org/mailman/listinfo/beowulf -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCJ1cxZd/2zrI5CioRAnYnAJ98qtd17aPK62aCw4UNt79klUdasQCglLyo kNXL0h7KGaQfFmla33Gxfn4= =osvt -----END PGP SIGNATURE----- From djholm at fnal.gov Thu Mar 3 14:31:56 2005 From: djholm at fnal.gov (Don Holmgren) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <20050303185516.GB1453@greglaptop.internal.keyresearch.com> References: <20050303020600.GA56437@piskorski.com> <20050303185516.GB1453@greglaptop.internal.keyresearch.com> Message-ID: On Thu, 3 Mar 2005, Greg Lindahl wrote: > On Thu, Mar 03, 2005 at 08:58:33AM -0600, Don Holmgren wrote: > > > I was just quoted quantity 2 PCI-X HCA's (4x 2 port Mellanox) by a > > reseller for $655 each. Last Fall we purchased a large quantity of > > PCI-E HCA's for considerably less than that unit price. > > > > Supposedly the "memory free" PCI-E HCA's that use host memory, rather > > than on-board sram, should move prices towards $100 sometime this > > year - we'll see (landed on motherboard pricing at ~ $70, see > > http://www.mellanox.com/news/press/pr_030105.html) > > You're mixing retail price with wholesale price. The $69 price is > apparently quantity 10,000, for just the chip, and the card that > is inexpensive will be PCIe 4X, which hurts performance. Yes, it will be interesting to see what the motherboard vendors charge for an IB option. $69 (assuming they hit the 10K volume) + the price of the I/O connector + engineering cost + margin. I'm hoping that it will be < $150, but I may be too optimistic. Interesting comment about PCIe 4X hurting performance, thanks! The current PCI-E cards have two ports and use an 8X slot, but I'd guess that most cluster applications use only a single port. What's the performance penalty for using a single 4X port HCA in a 4X PCI-E slot compared with using a single port on a dual port card in an 8X slot? I believe the MemFree cards also incur a few tenths of a microsecond latency hit because of the need to access host memory, at least accordin?g to the preliminary benchmarks shown at SC'04. > > I hear that today's street price for Mellanox-based cards is ~$600 in > cluster-sized quantities, which matches what you report. We did a little better than that - for quantity 260, we paid < $450 for PCI-E HCA's. A couple of other bids were around $500. > > > Like other high performance network gear, it's tough to get > > accurate pricing information without going out and getting quotes. > > One nice thing about Myricom is that their prices have always been on > the web -- all you need to know in addition is your discount. > > -- greg Agreed. Among the many other nice things about Myricom. Don From djholm at fnal.gov Fri Mar 4 10:43:59 2005 From: djholm at fnal.gov (Don Holmgren) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: <4228821F.90607@charter.net> <42288715.10403@scalableinformatics.com> <422889C3.5080902@charter.net> Message-ID: I've made two purchases in the last 12 months of 24-port switches. Two switches last April came in at ~ $4000 each. 16 switches last Sept came in at ~ $3300 each. These were two different brands of switches, both based on the Mellanox Infiniscale III (24 port crossbar) silicon. Clearly YMMV on pricing. Don Holmgren On Fri, 4 Mar 2005, Roger L. Smith wrote: > > > THe price I stated was for a 24 port switch for around $8,000 list. As a > matter of fact, I just confirmed this with the vendor. > > This does not include cables or HCAs. > > On Fri, 4 Mar 2005, Jeffrey B. Layton wrote: > > > 8 ports under 8k, but it was a 24 port switch :) > > This includes all of the HCA's, switches (only one), > > cables, and software. > > > > Jeff > > > > > 8 ports under 8k or 24 ports under 8k? > > > > > > Jeffrey B. Layton wrote: > > > > > >> However, to match what Roger said, one IB vendor gave me > > >> a list price for 8-ports of IB for under $8,000. > > > > > > > > _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_ > | Roger L. Smith Phone: 662-325-3625 | > | Sr. Systems Administrator FAX: 662-325-7692 | > | roger@ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger | > | Mississippi State University | > |____________________________________ERC__________________________________| > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From jrajiv at hclinsys.com Thu Mar 3 22:51:09 2005 From: jrajiv at hclinsys.com (Rajiv) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] HA OSCAR for loadbalancing and failover Message-ID: <006d01c52086$8c50a8d0$0f120897@PMORND> Dear Sir, i am carrying outloadbalancing using OSCAR-3.0.We are also carrying out failover using HA-OSCAR 1.0 beta release (High Availability OSCAR). We are required to acheive loadbalancing and failover for the following services: 1. HTTP. 2. FTP. 3. TELNET. 4. DHCP. 5. SQUID. Our setup is as follows: 1 Primary server , 1 client node (using OSCAR-3.0) 1 standby server (using HA OSCAR) We have succeeded in building the cluster but am having problems regarding loadbalancing.We are trying to achieve loadbalancing using the PBS Server(Portable Batch System) which comes inbuilt with OSCAR-3.0.We are queing the services as jobs and trying to distribute these jobs between the server and client node. But the problem we are facing is that we are not able to submit the job to the PBS server. Sir,firstly, we would like you to confirm if we are going on the right track for achieving loadbalancing.We would like to know how you'll have achieved load balancing? Regards, Rajiv From maurice at harddata.com Thu Mar 3 23:56:46 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 13, Issue 4 In-Reply-To: <200503031629.j23GThuk022748@bluewest.scyld.com> References: <200503031629.j23GThuk022748@bluewest.scyld.com> Message-ID: <422814BE.1070203@harddata.com> Andrew Piskorski wrote: >From: Andrew Piskorski >Subject: Re: [Beowulf] 2.6.11 is out; with InfiBand support > > >Except, a single PCI-X Infiniband card currently costs $1000 or so, >right? (That's for a 4x 2 port card, but Froogle does not seem to >know of any cheaper cards.) > > New "name brand" ( sorry, it's under NDA) "Memory Free" cards will be selling for under $400 for PCI Express 4X, and under $600 for PCI Express 8X. Availability Q2 for production quantities. Still a lot more expensive than Myrinet, and Myri have their own surprises to reveal in that time frame as well. It's a great time for cluster computing: Opteron Rev E nForce4 dual core Opterons S-ATA2 SAS Economical 10Gb interconnects. Wow. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice@harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 From lists at subnetz.org Fri Mar 4 05:43:27 2005 From: lists at subnetz.org (Tilman Koschnick) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: <1109943807.2081.51.camel@mother.subnetz.org> On Fri, 2005-03-04 at 08:03 -0500, Robert G. Brown wrote: > Is FAI being loved by somebody at this point? There was a time a few > years ago where it seemed to be lying fallow (although as always I could > be mistaken about that). Toolsets like that usually need a fairly > active and energetic human to care for them, if not several... I think it is. The latest (fairly long) changelog entry - version 2.6.6 - dates from 21 Jan 2005. I went to a talk by the FAI maintainer a couple of months ago, and he didn't give the impression he was going to abandon it any time soon. There was talk about porting FAI to Redhat/rpm, but I don't know what the state of this is. Cheers, Til From list-beowulf at onerussian.com Fri Mar 4 06:22:08 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: <20050304142208.GC32176@washoe.rutgers.edu> On Fri, Mar 04, 2005 at 08:03:00AM -0500, Robert G. Brown wrote: > Is FAI being loved by somebody at this point? There was a time a few > years ago where it seemed to be lying fallow (although as always I could > be mistaken about that). Toolsets like that usually need a fairly > active and energetic human to care for them, if not several... I like FAI and although it is just a set of scripts, it seems to be stable. I've used it 1.5 years ago for the first time to install first 10 nodes on the cluster. I had to tweak it to make it work but then whenever we've got 15 more nodes 5 month ago, my old FAI configuration required just few adjustments to make its job. DCC seems to use the idea of "image-based installation model" opposed to FAI which is flexible cloned installation. For uniform clusters image-based installation probably is better than FAI which admits different classes of the configuration, thus is more flexible. -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050304/2795d8f4/attachment.bin From agshew at gmail.com Fri Mar 4 07:51:06 2005 From: agshew at gmail.com (Andrew Shewmaker) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: <1109928798.5537.22.camel@Vigor45> References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: On Fri, 04 Mar 2005 09:33:18 +0000, John Hearns wrote: > On Thu, 2005-03-03 at 11:33 -0500, Yaroslav Halchenko wrote: > > Dear Debianized beowulfers or beowulfiezed Debian users, > > > > Does any one has experience with > > http://www.irb.hr/en/cir/projects/dcc/ > > which recently was released? > > > > Project Goals > > > > We expect to integrate some existing technologies (like LDAP, System > > Installation Suite, > > Seems strange that they haven't chosen FAI (Fully Automated Installer). > As an aside, there was a poster about FAI up outside the cluster > developer's room at FOSDEM last weekend. If you think it is strange that they appear to have chosen System Installation Suite over FAI because you are thinking that SIS is focused on RPM distros (I was under that impression at one time), then you should know that SIS is primarily developed on Debian. -- Andrew Shewmaker From egan at sense.net Fri Mar 4 08:41:30 2005 From: egan at sense.net (Egan Ford) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] Windows Server 2003 Compute Cluster Edition In-Reply-To: <422889C3.5080902@charter.net> Message-ID: <002201c520d9$04f6c380$e2054109@oberon> Unless given away, better price/performance, or killer app, I estimate that the number Windows HPC clusters to be very small. I'd like to say zero, but I have customers today doing HPC on Windows. The apps are not available for other platforms. http://news.com.com/Windows+for+supercomputers+likely+out+by+fall/2100-1012_ 3-5598603.html?part=rss&tag=5598782&subj=news From tallpaul at speakeasy.org Fri Mar 4 12:43:12 2005 From: tallpaul at speakeasy.org (Paul English) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: On Fri, 4 Mar 2005, Robert G. Brown wrote: > On Fri, 4 Mar 2005, John Hearns wrote: > > > On Thu, 2005-03-03 at 11:33 -0500, Yaroslav Halchenko wrote: > > > Dear Debianized beowulfers or beowulfiezed Debian users, > > > > > > Does any one has experience with > > > http://www.irb.hr/en/cir/projects/dcc/ > > > which recently was released? > > > > > > Project Goals > > > > > > We expect to integrate some existing technologies (like LDAP, System > > > Installation Suite, > > > > Seems strange that they haven't chosen FAI (Fully Automated Installer). > > As an aside, there was a poster about FAI up outside the cluster > > developer's room at FOSDEM last weekend. > > Is FAI being loved by somebody at this point? There was a time a few > years ago where it seemed to be lying fallow (although as always I could > be mistaken about that). Toolsets like that usually need a fairly > active and energetic human to care for them, if not several... The list is alive, and has posts, etc. I did not have a great deal of luck getting help with my questions and the process is (if anything) more raw than Kickstart. I gave it a good try for several months because many of our machines are debian, but in the end I gave up. For clustering purposes, ROCKS has been much more useful - quick, useful responses on the mailing list, and a lot more of the lower-level crud is hidden with simpler utilities. For general network installs (workstations, general purpose servers), Kickstart was far easier to use and find answers for than FAI, although it could use some of the abstraction that ROCKS has. In the "modern" era of PXE, on "most networks" adding new machines with specific configurations could be done with a single command or GUI. We're not there yet. :-) Paul From maillists at gauckler.ch Fri Mar 4 14:30:26 2005 From: maillists at gauckler.ch (Michael Gauckler) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <1109659454.6544.2.camel@localhost.localdomain> References: <1109659454.6544.2.camel@localhost.localdomain> Message-ID: <1109975426.5116.16.camel@localhost.localdomain> Dear List, thank you for all your replies concerning my question about interleaved gathers. (Interleaved from was meant in terms of memory layout, not time of arrival of the message.) Yes, there is a solution to this problem by changing the lower and upper bounds of the datatype with the help of MPI_Type_create_resized. Trough the lam-mpi mailing list I got a reply from Josh which I like to share with you because it even includes the source of a demo application (see below). Thank you very much! Yours, Michael ___ Von: Josh Hursey Datum: Tue, 1 Mar 2005 09:50:43 -0500 (15:50 CET) Yes, this can be achieved in an elegant way with MPI_Gather, but you need to adjust the receive datatype. You will need to create a new MPI_Datatype that will stride as you need it to. The trick is to shift the lower and upper bounds on this new strided data type so it will interleave values. Something like: /* Create a datatype to receive into. */ MPI_Type_vector( NUM_LOCAL_ELE, /* # of blocks */ 1, /* # of datatypes in a block (one for this array) */ gsize, /* Stride between successive blocks */ MPI_CHAR, /* Type of each block */ &old_type); MPI_Type_commit( &old_type); /* Resize the type to allow interleaving, * so make it only one MPI_CHAR wide */ MPI_Type_create_resized(old_type, 0, /* Lower Bound */ 1, /* Uppoer Bound change to one block */ &new_type); MPI_Type_commit( &new_type); Then use the new_type as the receive type argument to the MPI_Gather function. I attached a sample code that does exactly this, and produces the following output: $ mpirun -np 4 gather_interleave Rank 0 A A A A A A A A A A A A Rank 1 B B B B B B B B B B B B Rank 2 C C C C C C C C C C C C Rank 3 D D D D D D D D D D D D Final: A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D Hope this helps. Josh -------------------- #include #include #define NUM_LOCAL_ELE 12 int main(int argc, char *argv[]){ int rank, gsize, i, j; char local_array[NUM_LOCAL_ELE]; char *collected_array; MPI_Datatype new_type, old_type; /* Initialize */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &gsize); /* Create a datatype to receive into. */ MPI_Type_vector( NUM_LOCAL_ELE, /* # of blocks */ 1, /* # of datatypes in a block (one for this array) */ gsize, /* Stride between successive blocks */ MPI_CHAR, /* Type of each block */ &old_type); MPI_Type_commit( &old_type); /* Resize the type to allow interleaving, * so make it only one MPI_CHAR wide */ MPI_Type_create_resized(old_type, 0, /* Lower Bound */ 1, /* Uppoer Bound change to one block */ &new_type); MPI_Type_commit( &new_type); /* Initialize local array with characters: * Rank 0 = A A A... * Rank 1 = B B B... * Rank 2 = C C C... * ... */ for(i = 0; i < NUM_LOCAL_ELE; ++i ) { local_array[i] = 'A' + rank; } /* Print out local array */ sleep(rank * 1); printf("Rank %d", rank); for(i = 0; i < NUM_LOCAL_ELE; ++i) { printf("\t%c", local_array[i]); } printf("\n"); if(rank == 0) collected_array = (char *)malloc(gsize * NUM_LOCAL_ELE * sizeof(char)); MPI_Gather( local_array, NUM_LOCAL_ELE, MPI_CHAR, collected_array, 1, new_type, 0, MPI_COMM_WORLD); /* Print out Gathered array */ if(rank == 0) { printf("Final:\n"); for(i = 0; i < gsize; ++i) { for(j = 0; j < NUM_LOCAL_ELE; ++j) { printf("\t%c", collected_array[i*NUM_LOCAL_ELE+j]); } printf("\n"); } } if (rank == 0) free(collected_array); MPI_Finalize(); return 0; } Am Dienstag, den 01.03.2005, 07:44 +0100 schrieb Michael Gauckler: > Dear List, > > I would like to gather the data from several processes. > Instead of the comonly used stride, I want to interleave > the data: > > Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > Rank 1: BBBBB ----^---^---^---^---^ > Rank 2: CCCCC -----^---^---^---^---^ > Rank 3: DDDDD ------^---^---^---^---^ > > Since the stride of the receive type is indicated > in multpiles of its mpi_type, no interleaving is > possible (the smallest striping factor leads to > AAAAABBBBBBCCCCCDDDDD). > > Is there a way to achieve this behaviour in an > elegant way, as MPI_Gather promises it? Or do > I need to do Send/Recv with self-aligned offsets? > > Thank you for your help! > > Michael From taj at www.linux.org.uk Thu Mar 3 14:39:28 2005 From: taj at www.linux.org.uk (Trent Jarvi) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] http://www.beowulf.org/overview/history.html Message-ID: Just a heads up. This page appears to be corrupted. While not all Beowulf clusters are supercomputers, one can build a Beowulf that is powerful enough to attract the interest of supercomputer users. Beyond the seasoned parallel programmer, Beowulf clusters have been built and used by programmers with little or no parallel programming experience. Beowulf clusters provide universities, often with limited resources, an excellent platform to teach parallel programming cNvq0ZhTgBrP kOLZWGuE0+ZiqlFOd2ml5US6LXQ/8jfnOSP4wydRdXTBOTOpewexZw1KyyFaZYgXTx5zQTNf 5QFWN4fE0H3CCkPYVhNTdPWIDurIhwMLdwxbCTM6fcG3+JA+1TpQX+s5ZlYw5+bvDqkre+1Y [...] -- Trent Jarvi taj@www.linux.org.uk From mwill at penguincomputing.com Fri Mar 4 15:52:18 2005 From: mwill at penguincomputing.com (Michael Will) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] HA OSCAR for loadbalancing and failover In-Reply-To: <006d01c52086$8c50a8d0$0f120897@PMORND> References: <006d01c52086$8c50a8d0$0f120897@PMORND> Message-ID: <200503041552.18491.mwill@penguincomputing.com> Even though it is an interesting idea to use a beowulf cluster for this, in particular when using several nodes to do loadbalancing with and automatic deployment of services, I think it is the wrong tool for the task you have set for yourself. Your requirements would probably be more easily fulfilled with a simple HA failover cluster (no oscar involved). See http://www.ultramonkey.org for details. Especially when you only have two servers, one as primary and one as standby, which is a classical active/passive config, then there is no reason to have the complexity of a beowulf style cluster. ultramonkey.org also mentions LVS which helps with loadbalancing and I believe they even have a solution for session synchronisation, which means that even when a failover of the loadbalancer occurs, a tcp/ip session does not die but gets redirected. You will not need any PBS then, but rather have a package called 'heartbeat' that defines the services to be failed over in it's own config files. Michael On Thursday 03 March 2005 10:51 pm, Rajiv wrote: > Dear Sir, > i am carrying outloadbalancing using OSCAR-3.0.We are also carrying out > failover using HA-OSCAR 1.0 beta release (High Availability OSCAR). We are > required to acheive loadbalancing > and failover for the following services: > 1. HTTP. > 2. FTP. > 3. TELNET. > 4. DHCP. > 5. SQUID. > Our setup is as follows: > 1 Primary server , 1 client node (using OSCAR-3.0) > 1 standby server (using HA OSCAR) > > We have succeeded in building the cluster but am > having problems regarding loadbalancing.We are trying to achieve > loadbalancing using the PBS Server(Portable Batch System) which comes > inbuilt with OSCAR-3.0.We are queing the services as jobs and trying to > distribute these jobs between the server and client node. But the problem > we are facing is that we are not able to submit the job to the PBS server. > Sir,firstly, we would like you to confirm if we are going on the right > track for achieving loadbalancing.We would like to know how you'll have > achieved load balancing? > > Regards, > Rajiv > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Michael Will, Linux Sales Engineer Tel: 415-954-2822 Toll Free: 888-PENGUIN Fax: 415-954-2899 www.penguincomputing.com Visit us at FOSE 2005! Washington Convention Center, Washington, DC April 5th-7th, 2005 Linux Pavilion, Booth 2225 From eugen at leitl.org Sat Mar 5 00:00:01 2005 From: eugen at leitl.org (Eugen Leitl) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] HA OSCAR for loadbalancing and failover In-Reply-To: <200503041552.18491.mwill@penguincomputing.com> References: <006d01c52086$8c50a8d0$0f120897@PMORND> <200503041552.18491.mwill@penguincomputing.com> Message-ID: <20050305075952.GH13336@leitl.org> On Fri, Mar 04, 2005 at 03:52:18PM -0800, Michael Will wrote: > Especially when you only have two servers, one as primary and one as standby, > which is a classical active/passive config, then there is no reason to have the > complexity of a beowulf style cluster. http://www.linux-ha.org/ (1.99?) supports up to 8 and beyond, but it needs some testing. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050305/e194bdd9/attachment.bin From gmpc at sanger.ac.uk Sat Mar 5 02:45:27 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: > In the "modern" era of PXE, on "most networks" adding new machines > with specific configurations could be done with a single command or GUI. > We're not there yet. :-) The only thing that ever came close was RLX's Control Tower management software. It did "one click" management and provisioning of machines. One the blade systems it could even do "zero click configuration", as you could set policy like "Any machines I put into slots 1-10 should automatically get configuration Y put on them." It was generic enough so that it could provision any operating system you could think off, and if it didn't do something you wanted it to, it was also easy to dig under the covers and hack the code. The only down side was the price-tag, which was extortionate. Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From kus at free.net Sat Mar 5 04:59:19 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: Message-ID: In message from Mark Hahn (Wed, 2 Mar 2005 18:09:09 -0500 (EST)): >> Arima and Iwill have mobos with IB LOM (Landed on Motherboard). > >given the choice between a $150 pcie IB nic and having it onboard, >I'd choose the separate card. I know the IB salesdroids always >say that getting onto the MB will change everything, but this >doesn't make sense. IB is completely different from onboard gigabit, >for instance, because there is no ubiquitous IB infrastructure >ready, waiting to be exploited. > >the problem with "if you build it onboard, they will come" is also >the marginal cost. onboard gigabit is nearly the same cost as >onboard 100bT, very low, and you pretty much always want it. >onboard IB is noticably higher than onboard GBE, noticable in >absolute terms, and you definitely have no possible use for it >on many systems. > >remember, most people don't even saturate GBE yet, Yes, I agree. But we are developing some quantum-chemical application which speedup at parallelization is bandwith-limited. And we obtain that speedup on 6 processors w/IB 4x interconnect is about 34% percent higher than for Myrinet. Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > and GBE >ports are damned cheap. GBE nics are free, and switch ports >are now down to $US 23/port: > >http://froogle.google.com/froogle?q=netgear+GS748T&btnG=Search+Froogle > >fundamentally, IB is still facing most of the same problems it always >has: > >- requires fairly expensive, unique infrastructure >- not the greatest physical layer: it's easy to wind up with > literally tons of IB cables. >- not clearly superior in performance vs alternatives. >- apparently designed by people who disliked existing technique > or were ignorant of it. >- not a drop-in replacement for alternatives. > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Sun Mar 6 06:39:06 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] http://www.beowulf.org/overview/history.html In-Reply-To: References: Message-ID: On Thu, 3 Mar 2005, Trent Jarvi wrote: > > Just a heads up. > > This page appears to be corrupted. > > While not all Beowulf clusters are supercomputers, one can build a Beowulf > that is powerful enough to attract the interest of supercomputer users. > Beyond the seasoned parallel programmer, Beowulf clusters have been built > and used by programmers with little or no parallel programming experience. > Beowulf clusters provide universities, often with limited resources, an > excellent platform to teach parallel programming cNvq0ZhTgBrP > kOLZWGuE0+ZiqlFOd2ml5US6LXQ/8jfnOSP4wydRdXTBOTOpewexZw1KyyFaZYgXTx5zQTNf > 5QFWN4fE0H3CCkPYVhNTdPWIDurIhwMLdwxbCTM6fcG3+JA+1TpQX+s5ZlYw5+bvDqkre+1Y > > [...] Corrupted and out of date, too:-) Nobody who looks at the top500 list (whatever my opinions about its basis;-) would nowadays say that one can "build a Beowulf that is powerful enough to attract the interest of supercomputer users". It's getting to be much more of a "seasoned parallel programmers (a.k.a. `old guys') can remember a time when parallel programming was carried out on `supercomputers', basically a name for a cluster with proprietary internal processor interconnects". Linux hasn't finished taking over the world, although it continues to make excellent progress with all sorts of economic and historical forces driving it. "Beowulfs" in the generic sense of COTS clusters with network interconnects for IPCs, pretty much have taken over the supercomputing world with only a few exceptions, and even those exceptions are relying less and less on anything like a custom communications bus. Not even the engineering of the dedicated systems scales, while using a "COTS" communication platform such as Myri or Dolphinics, or IB or even gigE lets you leverage all sorts of useful work done by other humans devoted to this one purpose or this purpose among others. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sun Mar 6 07:04:17 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Jan 6 01:03:54 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: On Sat, 5 Mar 2005, Guy Coates wrote: > > In the "modern" era of PXE, on "most networks" adding new machines > > with specific configurations could be done with a single command or GUI. > > We're not there yet. :-) > > The only thing that ever came close was RLX's Control Tower management > software. It did "one click" management and provisioning of machines. > One the blade systems it could even do "zero click configuration", as you > could set policy like > > "Any machines I put into slots 1-10 should automatically get configuration > Y put on them." > > It was generic enough so that it could provision any operating system you > could think off, and if it didn't do something you wanted it to, it was > also easy to dig under the covers and hack the code. > > The only down side was the price-tag, which was extortionate. > > Guy > > I agree that there is some work involved in building a PXE installable configuration for e.g. kickstart, but it isn't excessive. Take template kickstart file(s). Edit it to select package set(s) for same(different) node config(s). Put it(them) on server. Edit dhcpd.conf to to point to bootloader in /tftpboot. Edit boot.msg, pxelinux.cfg/default to point to (list of) node type(s) and point to the associated kickstart file(s) and boot option(s), respectively. Boot. There IS a GUI for building the kickstart file (under RH and FC, at least), although I suspect that most people would use this at most to build the template and then tune it up by hand -- it is actually easier to edit a flatfile with an editor once you see the layout. dhcpd doesn't have a GUI to front it AFAIK, so this remains a place one could do some work. It would be useful to build one that tests the URL paths to e.g. the kickstart files and the tftpboot paths to the initrd images. It would be even lovelier to have the same interface edit boot.msg and pxelinux.cfg/default at the same time so that all three could be consistent. This is the one place I find myself hopping from directory to directory to make matching changes, as I create an image for "fc3 workstation" or "fc3 node" for testing purposes and need to ensure that the matching kickstart file is in the right place and correctly corresponds to the dhcpd entry. This is even more true if one wants to create an image indexed per IP number (the "right" way, arguably, for tftpboot to function) so that everything becomes totally automagic. I always am of two minds about high-level front ends to low level admin commands. Yes, they are convenient and let newbies start to work with a low buy-in as far as learning curve is concerned. However, they also SHIELD a newbie from learning what they really need to know to be an effective manager, and (to my own experience anyway) they rarely work stably as the various tools upon which they are built evolve. For one thing, the GUI designer almost always omits features and capabilities of the lower level stuff, so eventually you want to do something you "know" can be done but the GUI doesn't. For another, somebody changes something really subtle, such as a default pathway in /tftpboot, and your GUI "breaks" and you have no idea why and won't until you learn, in a panic, all the things the GUI shielded you from. I tend to think GUIs work best when they are a standard part of a single administrative package co-developed by the packages maintainers. A GUI that spans multiple tools and functions simply has more issues. Hence redhat-config-kickstart has a chance to remain useful, but building its functionality into a sweeping redhat-config-px