From john.hearns at streamline-computing.com Fri Aug 1 00:28:45 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri, 01 Aug 2008 08:28:45 +0100 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> Message-ID: <1217575735.4977.1.camel@Vigor13> On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: > We'd prefer to steer clear of Kerberos, it introduces > arbitrary job limitations through ticket lives that > are not tolerable for HPC work. > Kerberos is heavily used at CERN. They have a solution for that issue - the job can ask for an extension to the tickets. Sorry, I don't have a reference handy but its worth documenting this for the list. From hahn at mcmaster.ca Fri Aug 1 07:06:17 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 1 Aug 2008 10:06:17 -0400 (EDT) Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: <488FEF6A.30003@harddata.com> References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> Message-ID: > BTW< where a lot of people are jumping on the "Get IPMI " bandwagon, I > suggest getting PDUs with remote IP controlled ports is more useful. the thing I don't like about controlled PDUs is that they're pretty harsh - don't you expect a higher failure rate of node PSUs if you go yanking the power this way? I have only seen a handful of different IPMI interfaces, but they all were reasonably reliable. > If you set your machines BIOS to start on power up, it is trivial to stop and > start machines with the PD U power, and that is definitely reliable. huh? we're talking about network-attached IPMI, which is fully independent of the controlled motherboard's bios. are you talking about those hybrid systems where the IPMI controller shares an ethernet port with the host? or IPMI through a kernel driver? > Plus , with a lot of those PDUs you can add thermal sensors and trigger power > off on high temperature conditions. IPMI normally provides all the motherboard's sensors as well. it seems like those are far more relevant than the temp of the PDU... using lm_sensors is a poor substitute for IPMI. From mathog at caltech.edu Fri Aug 1 09:11:25 2008 From: mathog at caltech.edu (David Mathog) Date: Fri, 01 Aug 2008 09:11:25 -0700 Subject: [Beowulf] reboot without passing through BIOS? Message-ID: Kilian CAVALOTTI wrote: > I may be totally missing the point, but doesn't the memory need to be > physically (as in electrically) reset in order to clean out those bad > bits? And doesn't this require a hard reboot, for the machine to be > power cycled, so that memory cells are reinitialized? The type of errors I am talking about are random bit flips, for instance, from ambient radiation. When the OS reboots it will overwrite memory and so remove those errors. The affected cells were not damaged, just in the wrong state. This should work so long as none of the damaged bits prevent kexec from doing its job. Presumably the OS will also reinitialize all memory structures stored elsewhere in hardware (as in storage controllers and NICs) since it should not trust the BIOS to have done this. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From hahn at mcmaster.ca Fri Aug 1 09:12:00 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 1 Aug 2008 12:12:00 -0400 (EDT) Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: <48932EAB.1070500@harddata.com> References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> <48932EAB.1070500@harddata.com> Message-ID: >> the thing I don't like about controlled PDUs is that they're pretty >> harsh - don't you expect a higher failure rate of node PSUs if you go >> yanking the power this way? > Why? > If nodes shutdown, on commands from the scheduler, that is good. > And, if they do not, how is cutting power by the PDU socket any different > than a power switch on the node? I don't design PSU's, but yanking the cord seems "rude" compared to simply raising the "please power off" signal. the latter is part of all PSU's these days, and is what IPMI uses (via I2C, I guess). perhaps it's superstition - I do always prefer to use the off button, rather than ranking the cord. but thinking of how a switching PSU works, perhaps it doesn't really matter - it views the input power as highly variable anyway (ie, 90-250V, and with that annoying 50-60 Hz flutter ;) >>> If you set your machines BIOS to start on power up, it is trivial to stop >>> and start machines with the PD U power, and that is definitely reliable. >> >> huh? we're talking about network-attached IPMI, which is fully independent >> of the controlled motherboard's bios. are you talking about those hybrid >> systems where the IPMI controller shares an ethernet port with the host? >> or IPMI through a kernel driver? >> > Either. > Most share a port, some have dedicated ports on board. I'm not sure about the "most" part - HP's don't, and it looks like supermicro offers options both ways. all the recent tyan boards I've looked at had dedicated IPMI/OPMA onboard. all HP machines have dedicated ports. but to me this has all the hallmarks of a religious issue, so... From kus at free.net Fri Aug 1 09:27:44 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 01 Aug 2008 20:27:44 +0400 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: Message-ID: In message from Mark Hahn (Fri, 1 Aug 2008 10:06:17 -0400 (EDT)): >> ... Plus , with a lot of those PDUs you can add thermal sensors and >>trigger power >> off on high temperature conditions. >IPMI normally provides all the motherboard's sensors as well. it >seems like those are far more relevant than the temp of the PDU... > >using lm_sensors is a poor substitute for IPMI. IMHO the only disadvantage of lm_sensors is the poroblem of building of right sensors.conf file. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow From jmdavis1 at vcu.edu Fri Aug 1 07:50:11 2008 From: jmdavis1 at vcu.edu (Mike Davis) Date: Fri, 01 Aug 2008 10:50:11 -0400 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> Message-ID: <489322A3.1010600@vcu.edu> > > the thing I don't like about controlled PDUs is that they're pretty > harsh - don't you expect a higher failure rate of node PSUs if you go > yanking the power this way? > > I have only seen a handful of different IPMI interfaces, but they all > were reasonably reliable. In using the ethernet interfaced PDU's for the past 8 years on several clusters, I haven't noticed a high PSU failure rate. In all honesty, we haven't had a unit connected to them ever lose a power supply. One benefit of using the PDU solution is one enet for X machines rather than X additional enets for X machines. That being said, IPMI offers additional functionality not provided by the PDU's. Mike Davis From hahn at mcmaster.ca Fri Aug 1 10:28:47 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 1 Aug 2008 13:28:47 -0400 (EDT) Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: Message-ID: >> using lm_sensors is a poor substitute for IPMI. > > IMHO the only disadvantage of lm_sensors is the poroblem of building of right > sensors.conf file. well, there's the little matter of being able to get data when the node is crashed, offline, busy, etc. I also very much like the ability to query status, temps, fan speeds out-of-band - that is, without stealing cycles from the job. From maurice at harddata.com Fri Aug 1 06:48:49 2008 From: maurice at harddata.com (Maurice Hilarius) Date: Fri, 01 Aug 2008 07:48:49 -0600 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan, Oleynik) In-Reply-To: <200808010608.m716894U008585@bluewest.scyld.com> References: <200808010608.m716894U008585@bluewest.scyld.com> Message-ID: <48931441.9050603@harddata.com> Chris Samuel wrote: .. >> > BTW< where a lot of people are jumping on the "Get IPMI " >> > bandwagon, I suggest getting PDUs with remote IP controlled >> > ports is more useful. >> > > Well, it depends on what you're trying to do, if it's get > the system and CPU temperatures then a PDU isn't much cop.. :) > > True, but on most boards lm_sensors will do that for you for free.. -- With our best regards, //Maurice W. Hilarius Telephone: 01-780-456-9771/ /Hard Data Ltd. FAX: 01-780-456-9772/ /11060 - 166 Avenue email:maurice at harddata.com/ /Edmonton, AB, Canada http://www.harddata.com// / T5X 1Y3/ / -------------- next part -------------- An HTML attachment was scrubbed... URL: From pal at di.fct.unl.pt Fri Aug 1 08:40:42 2008 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Fri, 1 Aug 2008 16:40:42 +0100 (WEST) Subject: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275 Message-ID: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> Dear all: Around 2/Apr I removed 2 Opterons 246 and "companion" 4x 512 MB DIMMs from two HPs DL145-G2, leaving them void, to populate other two HPs (got 2 CPUs and 4GB per node). Then, I installed 2 dual-core Opterons per DL145-G2, together with 4 sticks of 1GB (2 sticks per CPU). So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 DL145-G2 nodes with 2 dual-core 275 / 4GB each. On 18th/Apr, one of the dual-core nodes crashed with an ECC error. From IMPI, for that node, 04/18/2008 | 20:26:26 | Memory #0x02 | Uncorrectable ECC | Asserted 06/18/2008 | 12:00:16 | Memory #0x02 | Uncorrectable ECC | Asserted 06/23/2008 | 11:58:34 | Memory #0x02 | Uncorrectable ECC | Asserted 07/19/2008 | 22:41:12 | Memory #0x02 | Uncorrectable ECC | Asserted 07/22/2008 | 17:18:00 | Memory #0x02 | Uncorrectable ECC | Asserted 07/23/2008 | 22:08:15 | Memory #0x02 | Uncorrectable ECC | Asserted 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted On 07/19 the memory of CPU0 was replaced; on the 27th, the remaining memory was replaced. ECC crashes do continue, from 1 per day to 1 per week. 07/28: first ECC error on the other Opteron-275 populated node. 07/28/2008 | 18:54:23 | Memory #0x02 | Uncorrectable ECC | Asserted All nodes have IB boards, and I swapped the boards from the first crashing and second crashing nodes (that's when, a few days later, the second node crashed the very first time). I have observed that not more than 2 minutes away from the ECC there are always these events logged: 06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S0/G0: working | Asserted 06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S5/G2: soft-off | Deasserted (but they are logged also at other times) I am running Scientific Linux 5, the (lam) MPI application uses almost 100% CPU and does exchange lots of small packets through IPoIB (I have not used "native" IB yet). "Everything" is 64-bit (kernel, apps). Any thoughts? Best Regards, paulo lopes -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10763 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: pal at di.fct.unl.pt 2829-516 Caparica, PORTUGAL From maurice at harddata.com Fri Aug 1 08:41:31 2008 From: maurice at harddata.com (Maurice Hilarius) Date: Fri, 01 Aug 2008 09:41:31 -0600 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> Message-ID: <48932EAB.1070500@harddata.com> Mark Hahn wrote: >> BTW< where a lot of people are jumping on the "Get IPMI " bandwagon, >> I suggest getting PDUs with remote IP controlled ports is more useful. > > the thing I don't like about controlled PDUs is that they're pretty > harsh - don't you expect a higher failure rate of node PSUs if you go > yanking the power this way? Why? If nodes shutdown, on commands from the scheduler, that is good. And, if they do not, how is cutting power by the PDU socket any different than a power switch on the node? Obviously we want to avoid "dropping the hammer" on a mounted filesystem, at least until it has its cache cleared. That is not hard to accomplish. > > I have only seen a handful of different IPMI interfaces, but they all > were reasonably reliable. > I have used the Supermicro, Tyan, ASUS, and Dell, and they all had some tendency to choke sometimes. The thing is, at the nominal cost of $50 to $100 per machine for BMC ( IPMI) cards, one can buy a couple of network controlled PDUs, with the thermal and humidity sensors. As you are likely to at least buy "dumb" PDUs, this means the typical cost per node added by this is usually around $30 per node, resulting in a tidy savings. It also means you are "talking" tp only one device pre 10 to 30 nodes, versus 10 to 30 BMC devices. Further, these IPMI cards typically "steal" a GbE port on the nodes. >> If you set your machines BIOS to start on power up, it is trivial to >> stop and start machines with the PD U power, and that is definitely >> reliable. > > huh? we're talking about network-attached IPMI, which is fully > independent > of the controlled motherboard's bios. are you talking about those > hybrid systems where the IPMI controller shares an ethernet port with > the host? > or IPMI through a kernel driver? > Either. Most share a port, some have dedicated ports on board. >> Plus , with a lot of those PDUs you can add thermal sensors and >> trigger power off on high temperature conditions. > > IPMI normally provides all the motherboard's sensors as well. it > seems like those are far more relevant than the temp of the PDU... I would rather monitor the room temperature at the racks, and shut the whole works down in a hurry if something is wrong, such as air conditioning failure. > using lm_sensors is a poor substitute for IPMI. Yes, and no. For monitoring the temps and fans an such on nodes it is quite sufficient. For power control it is useless, of course. -- With our best regards, //Maurice W. Hilarius Telephone: 01-780-456-9771/ /Hard Data Ltd. FAX: 01-780-456-9772/ /11060 - 166 Avenue email:maurice at harddata.com/ /Edmonton, AB, Canada http://www.harddata.com// / T5X 1Y3/ / -------------- next part -------------- An HTML attachment was scrubbed... URL: From rreis at aero.ist.utl.pt Fri Aug 1 10:36:47 2008 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Fri, 1 Aug 2008 18:36:47 +0100 (WEST) Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran Message-ID: which means... segfault. Hi all I've scourged the net for answers to no avail and the fftw project seems to have grinded to a halt. Maybe someone has had this problem and can throw some light. I've coded a small program that reads a vorticity field, uses FFTW2 to send it from the physical to the spectral space and then computes its energy spectrum. Everything works in my laptop (32 bit, Linux), serial, threaded and mpi (using openmpi). In a 64 bit machine the mpi version kaputs. Any thoughts? Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt & Cultural Instigator @ R?dio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ From hahn at mcmaster.ca Fri Aug 1 11:23:13 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 1 Aug 2008 14:23:13 -0400 (EDT) Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: > I've scourged the net for answers to no avail and the fftw project seems to > have grinded to a halt. Maybe someone has had this problem and can throw some > light. I don't know the status of the project, but fftw is definitely still widely used, and definitely works in 64b. > I've coded a small program that reads a vorticity field, uses FFTW2 to send > it from the physical to the spectral space and then computes its energy > spectrum. Everything works in my laptop (32 bit, Linux), serial, threaded and > mpi (using openmpi). In a 64 bit machine the mpi version kaputs. Any > thoughts? 32-64 problems usually stem from someone conflating ints and pointers. we have fftw2 installed on all our machines, which are all 64b (for years). no reports of problems. did you compile your own fftw2, and if so, did you run the test cases? From hahn at mcmaster.ca Fri Aug 1 11:25:54 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 1 Aug 2008 14:25:54 -0400 (EDT) Subject: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275 In-Reply-To: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> References: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> Message-ID: > So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 > DL145-G2 nodes with 2 dual-core 275 / 4GB each. it's worth making sure you have current bios installed. > 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted it may also be useful to run mcelog, which will tell you about any ongoing _correctable_ ECC activity. From glen.beane at jax.org Fri Aug 1 11:34:50 2008 From: glen.beane at jax.org (Glen Beane) Date: Fri, 01 Aug 2008 14:34:50 -0400 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: <4893574A.3010405@jax.org> Mark Hahn wrote: >> I've scourged the net for answers to no avail and the fftw project >> seems to have grinded to a halt. Maybe someone has had this problem >> and can throw some light. > > I don't know the status of the project, but fftw is definitely still > widely used, and definitely works in 64b. > >> I've coded a small program that reads a vorticity field, uses FFTW2 to >> send it from the physical to the spectral space and then computes its >> energy spectrum. Everything works in my laptop (32 bit, Linux), >> serial, threaded and mpi (using openmpi). In a 64 bit machine the mpi >> version kaputs. Any thoughts? > > 32-64 problems usually stem from someone conflating ints and pointers. > we have fftw2 installed on all our machines, which are all 64b (for years). > no reports of problems. I am also using fftw2 on our 64-bit Linux cluster without any issues -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153 From jason at acm.org Fri Aug 1 12:29:08 2008 From: jason at acm.org (Jason Riedy) Date: Fri, 01 Aug 2008 15:29:08 -0400 Subject: [Beowulf] Re: fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: (Ricardo Reis's message of "Fri, 1 Aug 2008 18:36:47 +0100 (WEST)") References: Message-ID: <87proszg8r.fsf@sparse.dyndns.org> And Ricardo Reis writes: > In a 64 bit machine the mpi version kaputs. Any thoughts? I'd bet that you're calling MPI routines directly from your Fortran code somewhere, and fftw is a red herring... When calling MPI routines directly from your Fortran code, be very, very careful about the arguments being passed. Many MPI routines stuff a pointer in a "large enough" integer, but some of the MPI/Fortran "header" files make too many assumptions about the particular compiler and flags in use. You might want to use the ISO_C_BINDING module and the BIND(C, NAME="...") gizmos to declare the specific routines you're using. Jason From mark.kosmowski at gmail.com Fri Aug 1 12:45:07 2008 From: mark.kosmowski at gmail.com (Mark Kosmowski) Date: Fri, 1 Aug 2008 15:45:07 -0400 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran Message-ID: > Message: 3 > Date: Fri, 1 Aug 2008 14:23:13 -0400 (EDT) > From: Mark Hahn > Subject: Re: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran > To: Ricardo Reis > Cc: beowulf at beowulf.org > Message-ID: > > Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed > > > I've scourged the net for answers to no avail and the fftw project seems to > > have grinded to a halt. Maybe someone has had this problem and can throw some > > light. > > I don't know the status of the project, but fftw is definitely still > widely used, and definitely works in 64b. > > > I've coded a small program that reads a vorticity field, uses FFTW2 to send > > it from the physical to the spectral space and then computes its energy > > spectrum. Everything works in my laptop (32 bit, Linux), serial, threaded and > > mpi (using openmpi). In a 64 bit machine the mpi version kaputs. Any > > thoughts? > > 32-64 problems usually stem from someone conflating ints and pointers. > we have fftw2 installed on all our machines, which are all 64b (for years). > no reports of problems. > > did you compile your own fftw2, and if so, did you run the test cases? What exactly is going wrong? Is your program failing to link or does it die during execution? Have you built 64-bit versions of fftw and mpi libraries? Are you positive that you've changed from 32-bit to 64-bit all of the libraries your code links to, even the ones not related to fftw or mpi? You imply that 64-bit serial with fftw works - are you able to get a different code to run with 64-bit mpi? Good luck, Mark Kosmowski From jmdavis1 at vcu.edu Fri Aug 1 13:26:52 2008 From: jmdavis1 at vcu.edu (Mike Davis) Date: Fri, 01 Aug 2008 16:26:52 -0400 Subject: [Beowulf] Re: fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: <87proszg8r.fsf@sparse.dyndns.org> References: <87proszg8r.fsf@sparse.dyndns.org> Message-ID: <4893718C.5030109@vcu.edu> My only issue with fftw is that some of our software will only work with fftw2 and not fftw3. That being said, running both is relatively trivial. Mike From john.hearns at streamline-computing.com Fri Aug 1 15:01:18 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri, 01 Aug 2008 23:01:18 +0100 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> <48932EAB.1070500@harddata.com> Message-ID: <1217628088.4725.3.camel@Vigor13> On Fri, 2008-08-01 at 12:12 -0400, Mark Hahn wrote: > I'm not sure about the "most" part - HP's don't, and it looks like supermicro > offers options both ways. all the recent tyan boards I've looked at had > dedicated IPMI/OPMA onboard. all HP machines have dedicated ports. > > but to me this has all the hallmarks of a religious issue, so... On the contrary Mark, my honest advice - through experience - is to go for systems with the separate ethernet port if you have a choice in the matter. Yes, this involves double the cabling and the installation of a set of 10/100 switches or similar. From landman at scalableinformatics.com Fri Aug 1 16:23:34 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 01 Aug 2008 19:23:34 -0400 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> <48932EAB.1070500@harddata.com> Message-ID: <48939AF6.8010100@scalableinformatics.com> Mark Hahn wrote: > I'm not sure about the "most" part - HP's don't, and it looks like > supermicro > offers options both ways. all the recent tyan boards I've looked at had > dedicated IPMI/OPMA onboard. all HP machines have dedicated ports. > > but to me this has all the hallmarks of a religious issue, so... Hmmm... we try to take a more pragmatic approach. IPMI is great. When it doesn't get wedged. And every now and then it does in fact take an operational excursion. Not often enough to be more than an annoyance, but often enough that you want to think about redundancy. Yeah, I know, its strange, but if your data center is remote, and going over to it is hard for any reason, redundancy is a *very good idea*. Switchable PDUs don't cost much more than plain old PDUs. Network access to them is generally easy to set up. They are a good backup to IPMI. But switchable PDUs don't give you console access. IPMI 2.0 can give you SOL (serial over lan, not the other meaning) So we usually suggest a console server to back that path up. The Supermicro units give you KVM over IP on selected motherboards. IPMI is great, but when it fails, you lose control. And console access. If this is important (that you never lose control/console access) then you need alternative paths. Given the relatively low cost of these control systems, its not such a bad idea to do this. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From gus at ldeo.columbia.edu Fri Aug 1 14:25:34 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 01 Aug 2008 17:25:34 -0400 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: <48937F4E.7030400@ldeo.columbia.edu> Ricardo Reis Would you have used a 32-bit fftw library when you linked the program? I made this mistake this before. Both the 32- and 64-bit versions are installed on my Fedora Core 8 64-bit machine. 32-bit in /usr/lib/ 64-bit in /usr/lib64/ Would it be this the reason for the segmentation fault? Como dizia o A'lvaro de Campos, coitado, a compilac,a~o e' um comi'cio dentro da alma. :) Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Ricardo Reis wrote: > > which means... segfault. > > Hi all > > I've scourged the net for answers to no avail and the fftw project > seems to have grinded to a halt. Maybe someone has had this problem > and can throw some light. > > I've coded a small program that reads a vorticity field, uses FFTW2 > to send it from the physical to the spectral space and then computes > its energy spectrum. Everything works in my laptop (32 bit, Linux), > serial, threaded and mpi (using openmpi). In a 64 bit machine the mpi > version kaputs. Any thoughts? > > > > Ricardo Reis > > 'Non Serviam' > > PhD student @ Lasef > Computational Fluid Dynamics, High Performance Computing, Turbulence > http://www.lasef.ist.utl.pt > > & > > Cultural Instigator @ R?dio Zero > http://www.radiozero.pt > > http://www.flickr.com/photos/rrei > s/ > >------------------------------------------------------------------------ > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From rreis at aero.ist.utl.pt Sat Aug 2 04:25:27 2008 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Sat, 2 Aug 2008 12:25:27 +0100 (WEST) Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: Hi Thanks for replying. Answering all the questions: - This is a debian box, X86_64 native. So all that is compiled is naturally 64 bit; - I've compiled myself the fftw-2.5.1 because the fftw3 has only experimental MPI suport, without Fortran bindings. I've asked if the project has stoped because the last release (fftw, 3.2 alpha) is dated Nov. 13, 2007 - I'm using openmpi, from the debian package. I've also compiled openmpi by hand and the same problem happens. I've compiled the latest LAM (although had to explicit the 4.1 version of gcc suite because I've found a problem with the 4.3. It says g++ isn't boolean capable). I can run other mpi codes in this machine (a pseudo-spectral DNS code I've parallized myself) with this openmpi instalation; - Using LAM it works for 1 processor. It blews up for more than 2. I can run my DNS code with lam without problem. - The only 64 bit caveat on the fftw notes relates to the declaration of the plan variables that should be integer(8). I've carefully done that. I even got to the extreme of placing -fdefault-integer-8 in the compilation flags of this code; - I can run this code as serial or threaded without problems; - The 32 bit test was my laptop, a 32 bit machine. The 64 bit on the 64 bit machine. No libraries are transported (svn co and make and so on...) - Yes, I've managed to run the tests (but they are C programs allas!). - The program only blows up when going to do the fft r2c (my first transform). Before that it is able to do another mpi functions. - Gus, Ode Triunfal by Alvaro de Campos is one of my favourite poems. The early XX century machine emotion fever of electricity. The furious hunger to be alive and eating the world full :) - I've tried it on another debian box, X86_64, with openmpi from debian and the same problem happens... - if I compile with -fdefault-integer-8 this is the error message 5068.0 $ mpirun -np 2 ~/bin/spec2.mpi Launching MPI program with 2 proc. [tenorio:21099] *** Process received signal *** [tenorio:21100] *** Process received signal *** [tenorio:21099] Signal: Segmentation fault (11) [tenorio:21099] Signal code: (128) [tenorio:21099] Failing at address: (nil) [tenorio:21099] [ 0] /lib/libpthread.so.0 [0x7f13ca893a90] [tenorio:21099] [ 1] /usr/lib/libopen-pal.so.0(_int_malloc+0x962) [0x7f13cb3057c2] [tenorio:21099] [ 2] /usr/lib/libopen-pal.so.0(malloc+0x8f) [0x7f13cb3068ef] [tenorio:21099] [ 3] /home/rreis/bin/spec2.mpi(MAIN__+0x79a) [0x40eb0a] [tenorio:21099] [ 4] /home/rreis/bin/spec2.mpi(main+0x2c) [0x46d3cc] [tenorio:21099] [ 5] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f13ca5501a6] [tenorio:21099] [ 6] /home/rreis/bin/spec2.mpi [0x407d59] [tenorio:21099] *** End of error message *** [tenorio:21100] Signal: Segmentation fault (11) [tenorio:21100] Signal code: (128) [tenorio:21100] Failing at address: (nil) [tenorio:21100] [ 0] /lib/libpthread.so.0 [0x7f858af35a90] [tenorio:21100] [ 1] /usr/lib/libopen-pal.so.0(_int_malloc+0x962) [0x7f858b9a77c2] [tenorio:21100] [ 2] /usr/lib/libopen-pal.so.0(malloc+0x8f) [0x7f858b9a88ef] [tenorio:21100] [ 3] /home/rreis/bin/spec2.mpi(MAIN__+0x79a) [0x40eb0a] [tenorio:21100] [ 4] /home/rreis/bin/spec2.mpi(main+0x2c) [0x46d3cc] [tenorio:21100] [ 5] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f858abf21a6] [tenorio:21100] [ 6] /home/rreis/bin/spec2.mpi [0x407d59] [tenorio:21100] *** End of error message *** mpirun noticed that job rank 0 with PID 21099 on node tenorio exited on signal 11 (Segmentation fault). 1 additional process aborted (not shown) - if I take the flag out 5070.0 $ mpirun -np 2 ~/bin/spec2.mpi Launching MPI program with 2 proc. Read field (DONE) [tenorio:21234] *** Process received signal *** [tenorio:21234] Signal: Segmentation fault (11) [tenorio:21234] Signal code: Address not mapped (1) [tenorio:21234] Failing at address: 0x4840 [tenorio:21234] [ 0] /lib/libpthread.so.0 [0x7fd57da65a90] [tenorio:21234] [ 1] /home/rreis/bin/spec2.mpi(rfftwnd_f77_mpi_+0x16) [0x40f676] [tenorio:21234] [ 2] /home/rreis/bin/spec2.mpi(MAIN__+0xb69) [0x40f1fe] [tenorio:21234] [ 3] /home/rreis/bin/spec2.mpi(main+0x2c) [0x46d6bc] [tenorio:21234] [ 4] /lib/libc.so.6(__libc_start_main+0xe6) [0x7fd57d7221a6] [tenorio:21234] [ 5] /home/rreis/bin/spec2.mpi [0x407d59] [tenorio:21234] *** End of error message *** mpirun noticed that job rank 0 with PID 21234 on node tenorio exited on signal 11 (Segmentation fault). 1 additional process aborted (not shown) Maybe I should try mpich or compile the openmpi with all bells and whistles and give it another run... greets, Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt & Cultural Instigator @ R?dio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ From pal at di.fct.unl.pt Sat Aug 2 04:57:37 2008 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Sat, 2 Aug 2008 12:57:37 +0100 (WEST) Subject: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275 In-Reply-To: References: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> Message-ID: <20670.89.180.225.196.1217678257.squirrel@www.di.fct.unl.pt> Thanks, Mark >> So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 >> DL145-G2 nodes with 2 dual-core 275 / 4GB each. > > it's worth making sure you have current bios installed. > Not the latest, but the previous; according to "Fixes" just a single, unrelated fix. Anyway I'm upgrading it... > >> 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted > > it may also be useful to run mcelog, which will tell you about > any ongoing _correctable_ ECC activity. No output in any of the 4 hosts; tried with/without --k8, --dmi, etc. (Just a side note, as it is being pursued in another thread): I have been quite happy with DL145-G2's IPMI and BMC board: I was able to power it remotely in every occasion, including after crashes. -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10763 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: pal at di.fct.unl.pt 2829-516 Caparica, PORTUGAL From rreis at aero.ist.utl.pt Sat Aug 2 05:49:11 2008 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Sat, 2 Aug 2008 13:49:11 +0100 (WEST) Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: Hi all After backtracing and lots of going around I found out the problem. The routine to calculate the fft had a parameter for using or not using a buffer array which I wasn't passing through. Thanks all for your help and sorry to disturbe. there should be a way to force the fortan compiler to check every variable is passed in the interface... damn it. Greets, Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt & Cultural Instigator @ R?dio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ From csamuel at vpac.org Sun Aug 3 16:12:02 2008 From: csamuel at vpac.org (Chris Samuel) Date: Mon, 4 Aug 2008 09:12:02 +1000 (EST) Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <1217575735.4977.1.camel@Vigor13> Message-ID: <1669319801.11171217805122430.JavaMail.root@mail.vpac.org> ----- "John Hearns" wrote: > On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: > > > We'd prefer to steer clear of Kerberos, it introduces > > arbitrary job limitations through ticket lives that > > are not tolerable for HPC work. > > Kerberos is heavily used at CERN. They have a solution for > that issue - the job can ask for an extension to the tickets. That's useful to know, though it doesn't help in this situation due to the fact that there is no GSSAPI support in the mainline Torque at present. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From gerry.creager at tamu.edu Mon Aug 4 05:04:15 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon, 04 Aug 2008 07:04:15 -0500 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> Message-ID: <4896F03F.3080803@tamu.edu> Chris Samuel wrote: > ----- "Bogdan Costescu" wrote: > >> On Tue, 29 Jul 2008, Chris Samuel wrote: >> >>> 1) Use a mainline kernel, we've found benefit of that >>> over stock CentOS kernels. >> Care to comment on this statement ? > > a) We found that we got better performance out of > the mainline kernels than the CentOS ones; we guess > because they handle newer hardware better (RHEL is > meant to aim for stability over performance) Hadn't thought about this, but it makes a lot of sense. > b) We can use XFS for scratch space rather than being > tied to the RHEL One True Filesystem (ext3) which > (in our experience) can't handle large amounts of disk > I/O. Mirrors our experience, too. -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From landman at scalableinformatics.com Mon Aug 4 05:31:48 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Mon, 04 Aug 2008 08:31:48 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> Message-ID: <4896F6B4.60602@scalableinformatics.com> Chris Samuel wrote: > ----- "Bogdan Costescu" wrote: > >> On Tue, 29 Jul 2008, Chris Samuel wrote: >> >>> 1) Use a mainline kernel, we've found benefit of that >>> over stock CentOS kernels. >> Care to comment on this statement ? > > a) We found that we got better performance out of > the mainline kernels than the CentOS ones; we guess > because they handle newer hardware better (RHEL is > meant to aim for stability over performance) This mirrors our experience, though RHEL stability under intense loads is questionable IMO (talking about the kernel BTW). We find that the missing drivers, the omitted drivers, the backported drivers along with some odd and often useless "features" (4k stacks anyone?) render the RHEL default kernels (and by definition the Centos kernels) less useful for HPC and storage tasks than what we build. Our current standard is a 2.6.23.14 kernel which is rock solid under load. Working on a 2.6.26 based version now (even though I am on vacation/holiday, I just updated it to 2.6.26.1 to address an observed crashing issue with the RDMA server) > b) We can use XFS for scratch space rather than being > tied to the RHEL One True Filesystem (ext3) which > (in our experience) can't handle large amounts of disk > I/O. Combine this with the small upper limit of ext3 partition sizes, the file size limits in ext3, the serialization in the journaling code (ext4 is extents based to help deal with this), ext3 just doesn't make much sense in a storage/HPC system (apart from possibly boot/root file system where performance is less critical). Yeah I have seen studies from folks whom had done 1E6 removes, file creates, and other things who claim xfs is slower than ext3. Yeah, those are bad benchmarks in that they really don't touch on real end user use cases for the most part (apart from possible large scale mail servers and other things like that). > > YMMV! Always ... and wish gas in ~$4USD region, you need to conserve . Having been in London a few months ago, seeing almost $10USD/gallon (3.75 liters), I am gonna stop complaining about our price over here. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From mark.kosmowski at gmail.com Mon Aug 4 13:37:30 2008 From: mark.kosmowski at gmail.com (Mark Kosmowski) Date: Mon, 4 Aug 2008 16:37:30 -0400 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran Message-ID: > > Message: 4 > Date: Sat, 2 Aug 2008 13:49:11 +0100 (WEST) > From: Ricardo Reis > Subject: Re: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran > To: beowulf at beowulf.org > Message-ID: > Content-Type: text/plain; charset="iso-8859-15" > > > Hi all > > After backtracing and lots of going around I found out the problem. The > routine to calculate the fft had a parameter for using or not using a > buffer array which I wasn't passing through. > > Thanks all for your help and sorry to disturbe. > > there should be a way to force the fortan compiler to check every > variable is passed in the interface... damn it. > > Greets, > > Ricardo Reis > So, why did the 32-bit test case work? Shouldn't the same problem crash both systems if it is a code issue? In any event, I am glad you got your problem sorted out. Mark E. Kosmowski From matt at technoronin.com Mon Aug 4 13:54:19 2008 From: matt at technoronin.com (Matt Lawrence) Date: Mon, 4 Aug 2008 15:54:19 -0500 (CDT) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <4896F6B4.60602@scalableinformatics.com> References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> Message-ID: On Mon, 4 Aug 2008, Joe Landman wrote: > This mirrors our experience, though RHEL stability under intense loads is > questionable IMO (talking about the kernel BTW). We find that the missing > drivers, the omitted drivers, the backported drivers along with some odd and > often useless "features" (4k stacks anyone?) render the RHEL default kernels > (and by definition the Centos kernels) less useful for HPC and storage tasks > than what we build. Our current standard is a 2.6.23.14 kernel which is rock > solid under load. Working on a 2.6.26 based version now (even though I am on > vacation/holiday, I just updated it to 2.6.26.1 to address an observed > crashing issue with the RDMA server) Since I plan to continue running CentOS, it sounds like building a much later kernel rpm is the way I want to approach the problem. Will going to a much later kernel break any of the utilities? Other problems I can expect to see? What do you recommend for the kernel config? > Combine this with the small upper limit of ext3 partition sizes, the file > size limits in ext3, the serialization in the journaling code (ext4 is > extents based to help deal with this), ext3 just doesn't make much sense in a > storage/HPC system (apart from possibly boot/root file system where > performance is less critical). Yeah I have seen studies from folks whom had > done 1E6 removes, file creates, and other things who claim xfs is slower than > ext3. Yeah, those are bad benchmarks in that they really don't touch on real > end user use cases for the most part (apart from possible large scale mail > servers and other things like that). I have never had any problems with ext3. I had dinner with a friend who is an expert Linux sysadmin who was warning me to stay away from xfs. He cited lots of fragmentation problems that routinely locked up his systems. I am willing to be convinced otherwise, but he is a very sharp fellow. -- Matt It's not what I know that counts. It's what I can remember in time to use. From landman at scalableinformatics.com Mon Aug 4 15:02:17 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Mon, 04 Aug 2008 18:02:17 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> Message-ID: <48977C69.1000703@scalableinformatics.com> Matt Lawrence wrote: > On Mon, 4 Aug 2008, Joe Landman wrote: > >> This mirrors our experience, though RHEL stability under intense loads >> is questionable IMO (talking about the kernel BTW). We find that the >> missing drivers, the omitted drivers, the backported drivers along >> with some odd and often useless "features" (4k stacks anyone?) render >> the RHEL default kernels (and by definition the Centos kernels) less >> useful for HPC and storage tasks than what we build. Our current >> standard is a 2.6.23.14 kernel which is rock solid under load. >> Working on a 2.6.26 based version now (even though I am on >> vacation/holiday, I just updated it to 2.6.26.1 to address an observed >> crashing issue with the RDMA server) > > Since I plan to continue running CentOS, it sounds like building a much > later kernel rpm is the way I want to approach the problem. Will going > to a much later kernel break any of the utilities? Other problems I can > expect to see? Doesn't break most things. We usually insert a new RPM and off it goes. > > What do you recommend for the kernel config? > >> Combine this with the small upper limit of ext3 partition sizes, the >> file size limits in ext3, the serialization in the journaling code >> (ext4 is extents based to help deal with this), ext3 just doesn't make >> much sense in a storage/HPC system (apart from possibly boot/root file >> system where performance is less critical). Yeah I have seen studies >> from folks whom had done 1E6 removes, file creates, and other things >> who claim xfs is slower than ext3. Yeah, those are bad benchmarks in >> that they really don't touch on real end user use cases for the most >> part (apart from possible large scale mail servers and other things >> like that). > > I have never had any problems with ext3. I had dinner with a friend who > is an expert Linux sysadmin who was warning me to stay away from xfs. > He cited lots of fragmentation problems that routinely locked up his > systems. I am willing to be convinced otherwise, but he is a very sharp > fellow. I haven't seen or heard anyone claim xfs 'routinely locks up their system'. I won't comment on your friends "sharpness". I will point out that several very large data stores/large cluster sites use xfs. By definition, no large data store can be built with ext3 (16 TB limit with patches, 8 TB in practice), so if your sharp friend is advising you to do this ... -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From matt at technoronin.com Mon Aug 4 17:35:47 2008 From: matt at technoronin.com (Matt Lawrence) Date: Mon, 4 Aug 2008 19:35:47 -0500 (CDT) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <48977C69.1000703@scalableinformatics.com> References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> <48977C69.1000703@scalableinformatics.com> Message-ID: On Mon, 4 Aug 2008, Joe Landman wrote: > I haven't seen or heard anyone claim xfs 'routinely locks up their system'. > I won't comment on your friends "sharpness". I will point out that several > very large data stores/large cluster sites use xfs. By definition, no large > data store can be built with ext3 (16 TB limit with patches, 8 TB in > practice), so if your sharp friend is advising you to do this ... He currently works for a phone company, so the amount of data is quite large, but the usage pattern is probably quite different. As far as skill level, I would rate him much higher than any of the folks I work with as far as being a sysadmin. So, any good info on kernel configuration when I go to build a new rpm? There are a huge number of options and you have obviously gone through them much more recently than I have. -- Matt It's not what I know that counts. It's what I can remember in time to use. From landman at scalableinformatics.com Mon Aug 4 17:47:36 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Mon, 04 Aug 2008 20:47:36 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> <48977C69.1000703@scalableinformatics.com> Message-ID: <4897A328.9070004@scalableinformatics.com> Matt Lawrence wrote: > So, any good info on kernel configuration when I go to build a new rpm? Don't start with the distro .src.rpm for the kernel. Build your own, and integrate your patches manually. Best way is take the barebones kernel from kernel.org, do a 'make rpm-pkg' on it (will generate a source RPM and spec file for you). Then install this source rpm, and voila, you have a working spec file. Integrate your patches into this, and use this to build. Decide what you need to support on your machines to spec your kernel version. Late model 2.6.25.x support NFS over RDMA so if you want that, you need the latest flavor of this. Decide which file system options you want, and make sure to integrate them (as modules). Remove things that you wont use (ISDN, Telephony, ARCNET, ...) > There are a huge number of options and you have obviously gone through > them much more recently than I have. A make xconfig can be quite helpful in changing the .config. > > -- Matt > It's not what I know that counts. > It's what I can remember in time to use. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From gus at ldeo.columbia.edu Mon Aug 4 21:37:38 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 05 Aug 2008 00:37:38 -0400 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: <4897D912.6040304@ldeo.columbia.edu> Salve Ricardo Reis and list Ricardo Reis wrote: > > Hi all > > After backtracing and lots of going around I found out the problem. > The routine to calculate the fft had a parameter for using or not > using a buffer array which I wasn't passing through. > > Thanks all for your help and sorry to disturbe. > > there should be a way to force the fortan compiler to check every > variable is passed in the interface... damn it. > Fortran 90 (and later) has this capability with module interfaces, which resemble to C function prototypes. However, I don't think FFTW uses it, although I haven't used FFTW in a while to be sure about it. Smuggling various parameter types across the same subroutine interface seem to have been a desired feature of older Fortran. Are there any Fortran compilers that can actually check the subroutine parameter number, type, array dimensions, etc, through mere compilation flags? I don't remember any, the language features are probably not enough to ensure such checks (except for Fortran 90 as noted above). Glad to know your code now works! Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > Greets, > > Ricardo Reis > > 'Non Serviam' > > PhD student @ Lasef > Computational Fluid Dynamics, High Performance Computing, Turbulence > http://www.lasef.ist.utl.pt > > & > > Cultural Instigator @ R?dio Zero > http://www.radiozero.pt > > http://www.flickr.com/photos/rrei > s/ > >------------------------------------------------------------------------ > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From prentice at ias.edu Tue Aug 5 07:15:57 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 05 Aug 2008 10:15:57 -0400 Subject: [Beowulf] Kerberos + HPC In-Reply-To: <1217575735.4977.1.camel@Vigor13> References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> Message-ID: <4898609D.7060405@ias.edu> John Hearns wrote: > On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: > >> We'd prefer to steer clear of Kerberos, it introduces >> arbitrary job limitations through ticket lives that >> are not tolerable for HPC work. >> > Kerberos is heavily used at CERN. They have a solution for that issue - > the job can ask for an extension to the tickets. > Sorry, I don't have a reference handy but its worth documenting this for > the list. > If ANYONE has more information on how this is done at CERN, I'd be very interested in hearing about it. I know, I know... GIYF... -- Prentice From kus at free.net Tue Aug 5 09:34:22 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Tue, 05 Aug 2008 20:34:22 +0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: Message-ID: In message from Matt Lawrence (Mon, 4 Aug 2008 19:35:47 -0500 (CDT)): >On Mon, 4 Aug 2008, Joe Landman wrote: >> I haven't seen or heard anyone claim xfs 'routinely locks up their >>system'. >> I won't comment on your friends "sharpness". I will point out that >>several >> very large data stores/large cluster sites use xfs. By definition, >>no large >> data store can be built with ext3 (16 TB limit with patches, 8 TB in >> practice), so if your sharp friend is advising you to do this ... > >He currently works for a phone company, so the amount of data is >quite large, but the usage pattern is probably quite different. As >far as skill level, I would rate him much higher than any of the >folks I work with as far as being a sysadmin. I work w/xfs for HPC since 1995: I used xfs w/SGI SMP servers under IRIX, and then on Linux/x86 clusters. I didn't have any hang-ups because of xfs. But xfs is optimal for work w/large files; when you work w/a lot of relative small files, xfs isn't the better choice. The question about fragmentation itself is more interesting. We have in xfs filesystem a set of small files (1st of all, input data) in addition to large (usually temporary) files. So the fragmentation may be present. xfs has a rich set of utilities, but AFAIK no defragmentation tools (I don't know what will be after xfsdump/xfsrestore). But which modern linux filesystems have defragmentation possibilities ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow From perry at piermont.com Tue Aug 5 09:59:30 2008 From: perry at piermont.com (Perry E. Metzger) Date: Tue, 05 Aug 2008 12:59:30 -0400 Subject: [Beowulf] Kerberos + HPC In-Reply-To: <4898609D.7060405@ias.edu> (Prentice Bisbal's message of "Tue\, 05 Aug 2008 10\:15\:57 -0400") References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> Message-ID: <87hc9zjt3h.fsf@snark.cb.piermont.com> Prentice Bisbal writes: > John Hearns wrote: >> On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: >> >>> We'd prefer to steer clear of Kerberos, it introduces >>> arbitrary job limitations through ticket lives that >>> are not tolerable for HPC work. >> >> Kerberos is heavily used at CERN. They have a solution for that issue - >> the job can ask for an extension to the tickets. >> Sorry, I don't have a reference handy but its worth documenting this for >> the list. > > If ANYONE has more information on how this is done at CERN, I'd be very > interested in hearing about it. I know, I know... GIYF... I doubt they're dong anything unusual -- this is a completely normal thing any Kerberos setup deals with. You just stash the private key on the server, request a long ticket lifetime and refresh reasonably tickets frequently. Standard documentation can tell you how to do it -- just read the manuals. Perry -- Perry E. Metzger perry at piermont.com From prentice at ias.edu Tue Aug 5 10:07:03 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 05 Aug 2008 13:07:03 -0400 Subject: [Beowulf] Kerberos + HPC In-Reply-To: <87hc9zjt3h.fsf@snark.cb.piermont.com> References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> <87hc9zjt3h.fsf@snark.cb.piermont.com> Message-ID: <489888B7.10806@ias.edu> Perry E. Metzger wrote: > Prentice Bisbal writes: >> John Hearns wrote: >>> On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: >>> >>>> We'd prefer to steer clear of Kerberos, it introduces >>>> arbitrary job limitations through ticket lives that >>>> are not tolerable for HPC work. >>> Kerberos is heavily used at CERN. They have a solution for that issue - >>> the job can ask for an extension to the tickets. >>> Sorry, I don't have a reference handy but its worth documenting this for >>> the list. >> If ANYONE has more information on how this is done at CERN, I'd be very >> interested in hearing about it. I know, I know... GIYF... > > I doubt they're dong anything unusual -- this is a completely normal > thing any Kerberos setup deals with. You just stash the private key on > the server, request a long ticket lifetime and refresh reasonably > tickets frequently. Standard documentation can tell you how to do > it -- just read the manuals. > > Perry I don't believe it. It sounds way to simple! -- Prentice From jlb17 at duke.edu Tue Aug 5 11:10:33 2008 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Tue, 5 Aug 2008 14:10:33 -0400 (EDT) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: Message-ID: On Tue, 5 Aug 2008 at 8:34pm, Mikhail Kuzminsky wrote > xfs has a rich set of utilities, but AFAIK no defragmentation tools (I don't > know what will be after xfsdump/xfsrestore). But which modern linux Not true -- see xfs_fsr(8). Back in the IRIX days, it was recommended to run this regularly. However, ISTR that the current recommendation is "as needed, but it really shouldn't be needed". -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From prentice at ias.edu Tue Aug 5 13:38:08 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 05 Aug 2008 16:38:08 -0400 Subject: [Beowulf] Kerberos + HPC In-Reply-To: <48989873.2090402@tuffmail.us> References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> <87hc9zjt3h.fsf@snark.cb.piermont.com> <489888B7.10806@ias.edu> <48989873.2090402@tuffmail.us> Message-ID: <4898BA30.1020201@ias.edu> Alan Louis Scheinine wrote: >> I don't believe it. It sounds way to simple! > > Perhaps the tricky part begins with the seemingly innocent > phrase "Standard documentation can tell you how to do > it -- just read the manuals." Are you saying RTFM? I've read the O'Reilly book on Kerberos several times, and I'm well-versed in Kerberos administration. I know how to adjust ticket TTLs. Perry suggests stashing the private key on the server and then refreshing the ticket automatically. How do you refresh the ticket automatically for a user while a job is waiting to run? That would have to be done by the queuing system, so the queuing system would have to be GSSAPI-aware. Someone already pointed out that Torque is NOT GSSAPI-aware so that leaves SGE and commercial applications. -- Prentice From gus at ldeo.columbia.edu Tue Aug 5 14:25:52 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 05 Aug 2008 17:25:52 -0400 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? Message-ID: <4898C560.6040606@ldeo.columbia.edu> Hello Beowulf fans Is anybody using Infiniband to provide both MPI connection and parallel file system services on a Beowulf cluster? I thought to have a storage node that would serve a parallel file system to the beowulf nodes over IB (something like a NFS on steroids). The same IB net would also work as the MPI interconnect. Is this design possible? On a small cluster, does it require two separate IB physical networks (cards and switch), or can it be done with a single IB card per node and one switch? Is this design efficient? Are there other practical and cost effective alternatives to this idea? Would this type of design work with GigE instead of IB? I confess I know nothing about parallel file systems and IB. So, please forgive me if my questions are nonsense. I also appreciate any links to readings that would mitigate my ignorance on these subjects. Thank you, Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From kus at free.net Tue Aug 5 15:38:23 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Wed, 06 Aug 2008 02:38:23 +0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: Message-ID: In message from Joshua Baker-LePain (Tue, 5 Aug 2008 14:10:33 -0400 (EDT)): >On Tue, 5 Aug 2008 at 8:34pm, Mikhail Kuzminsky wrote > >> xfs has a rich set of utilities, but AFAIK no defragmentation tools >>(I don't >> know what will be after xfsdump/xfsrestore). But which modern linux > >Not true -- see xfs_fsr(8). Thanks !! I didn't look to xfs details many years :-( - it's my mistake. > Back in the IRIX days, it was >recommended to run this regularly. I don't remember that xfs_fsr was included in IRIX 6.1-6.4 we used. Mikhail > However, ISTR that the current >recommendation is "as needed, but it really shouldn't be needed". > >-- >Joshua Baker-LePain >QB3 Shared Cluster Sysadmin >UCSF From hahn at mcmaster.ca Tue Aug 5 16:37:39 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 5 Aug 2008 19:37:39 -0400 (EDT) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <4898C560.6040606@ldeo.columbia.edu> References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: > Is anybody using Infiniband to provide both > MPI connection and parallel file system services on a Beowulf cluster? of course! many people have a strong opinion that sharing networks with file and mpi traffic is a bad thing, but I haven't seen anyone actually produce numbers. obviously contention increases the chances that a latency-sensitive operation (say, small synchronous mpi message) will be hurt by a stream of large file packets. and moreso when the fabric is not full-bisection - even if it's multiple cores sharing a node's single interface. but consider gigabit - 1500-byte packets consume <20 us of wire time, and most people who are using gb for mpi are expecting zero-byte latency quite a lot higher than that (say 50 us). by contrast, a max-size packet on old-gen SDR IB is about 4 us wire time, about the same as 0B latency. as has been pointed out here recently, the fabric will drop pretty significantly in performance once links become contended; this would make the latency-vs-bandwidth conflict more painful. (it also affects certain networks more than others - depending on their ability to adjust routes dynamically.) IMO, you have to ponder in your heart whether your expected workload will suffer from these issues. there is really no general rule, since workloads vary so widely in latency sensitivity and in bandwidth demands, all convolved with the fabric properties... if you have a well-defined workload, why not measure it? run an mpi app that has some sort of performance feedback while applying an increasing large-transfer NFS load... regards, mark hahn. From csamuel at vpac.org Tue Aug 5 16:44:15 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed, 6 Aug 2008 09:44:15 +1000 (EST) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: Message-ID: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> ----- "Matt Lawrence" wrote: > I have never had any problems with ext3. I suspect you're not doing a lot of disk I/O, we found NFS servers using ext3 as a back end would crumble under the weight of lots of writes as ext3 is single threaded through the journal daemon. That means that you end up with all your NFS daemons blocking on that, stalling everything else. :-( > I had dinner with a friend who is an expert Linux > sysadmin who was warning me to stay away from xfs. There have been occasional bugs in XFS in older kernel releases, but then there have been bugs in other filesystems too. > He cited lots of fragmentation problems that routinely > locked up his systems. Never had that problem here. Does he know that he can use xfs_fsr to defragment XFS filesystems online ? Is he sure he's not hitting another kernel bug ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Tue Aug 5 16:47:57 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed, 6 Aug 2008 09:47:57 +1000 (EST) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <4896F03F.3080803@tamu.edu> Message-ID: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> ----- "Gerry Creager" wrote: > Chris Samuel wrote: > > > b) We can use XFS for scratch space rather than being > > tied to the RHEL One True Filesystem (ext3) which > > (in our experience) can't handle large amounts of disk > > I/O. > > Mirrors our experience, too. I should point out that our actual NFS servers run Debian Linux not CentOS. Those who want to run the stock CentOS kernel might like to know that the "plus" repository includes an RPM for the XFS kernel module for the mainline kernel. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Tue Aug 5 16:56:28 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed, 6 Aug 2008 09:56:28 +1000 (EST) Subject: [Beowulf] Kerberos + HPC In-Reply-To: <4898BA30.1020201@ias.edu> Message-ID: <188691817.38081217980588864.JavaMail.root@mail.vpac.org> ----- "Prentice Bisbal" wrote: > Someone already pointed out that Torque is NOT > GSSAPI-aware so that leaves SGE and commercial > applications. To be fair to the Torque devs I did say that the release versions don't have GSSAPI support, but there is a GSSAPI branch in SVN. But I don't think it gets much development and probably even less testing which leads to a chicken/egg situation. Nobody is going to deploy Kerberos on a cluster if it'll break the queueing system and nobody will get GSSAPI support into a queueing system if there's not clusters needing it which can tolerate downtime and lost jobs to get it working. cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From perry at piermont.com Tue Aug 5 17:06:33 2008 From: perry at piermont.com (Perry E. Metzger) Date: Tue, 05 Aug 2008 20:06:33 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> (Chris Samuel's message of "Wed\, 6 Aug 2008 09\:44\:15 +1000 $EST$") References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> Message-ID: <87d4knhura.fsf@snark.cb.piermont.com> Chris Samuel writes: > ----- "Matt Lawrence" wrote: > >> I have never had any problems with ext3. > > I suspect you're not doing a lot of disk I/O, we > found NFS servers using ext3 as a back end would > crumble under the weight of lots of writes as ext3 > is single threaded through the journal daemon. > > That means that you end up with all your NFS daemons > blocking on that, stalling everything else. :-( Put your journal onto a battery backed RAM card or the equivalent on the RAID controller and it significantly speeds up dealing with a journal. Perry -- Perry E. Metzger perry at piermont.com From gerry.creager at tamu.edu Tue Aug 5 17:16:24 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue, 05 Aug 2008 19:16:24 -0500 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> References: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> Message-ID: <4898ED58.8020001@tamu.edu> Chris Samuel wrote: > ----- "Gerry Creager" wrote: > >> Chris Samuel wrote: >> >>> b) We can use XFS for scratch space rather than being >>> tied to the RHEL One True Filesystem (ext3) which >>> (in our experience) can't handle large amounts of disk >>> I/O. >> Mirrors our experience, too. > > I should point out that our actual NFS servers run > Debian Linux not CentOS. > > Those who want to run the stock CentOS kernel might > like to know that the "plus" repository includes an > RPM for the XFS kernel module for the mainline kernel. And, of course, we do. Good point. -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From matt at technoronin.com Tue Aug 5 19:27:24 2008 From: matt at technoronin.com (Matt Lawrence) Date: Tue, 5 Aug 2008 21:27:24 -0500 (CDT) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> References: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> Message-ID: On Wed, 6 Aug 2008, Chris Samuel wrote: > Those who want to run the stock CentOS kernel might > like to know that the "plus" repository includes an > RPM for the XFS kernel module for the mainline kernel. It works well as long as you remember to install the xfs-progs package. I spent five minutes today going "where the heck is mkfs.xfs and the man pages?". -- Matt It's not what I know that counts. It's what I can remember in time to use. From matt at technoronin.com Tue Aug 5 19:37:13 2008 From: matt at technoronin.com (Matt Lawrence) Date: Tue, 5 Aug 2008 21:37:13 -0500 (CDT) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> Message-ID: On Wed, 6 Aug 2008, Chris Samuel wrote: > I suspect you're not doing a lot of disk I/O, we > found NFS servers using ext3 as a back end would > crumble under the weight of lots of writes as ext3 > is single threaded through the journal daemon. > > That means that you end up with all your NFS daemons > blocking on that, stalling everything else. :-( Could be. Given the long and sordid history of NFS, I prefer to not use it whenever there are practical alternatives. I'm also not a Solaris fanboy. So, different mindset that a lot of unix sysadmins. > There have been occasional bugs in XFS in older kernel > releases, but then there have been bugs in other filesystems > too. That could be it, he does spend a fair amount of time cleaning up systems that others have built. > Never had that problem here. > > Does he know that he can use xfs_fsr to defragment > XFS filesystems online ? He certainly does. He was talking about using OpenNMS to determine the best time to run it. He had lots of good things to say about how easy it is to track through performance data with it. > Is he sure he's not hitting another kernel bug ? It wouldn't surprise me. This is someone who I trust enough that if he warns me of something, I make a real effort to doublecheck if it is currently a problem. It doesn't mean he is always right, just that I think the research effort is a really good idea. -- Matt It's not what I know that counts. It's what I can remember in time to use. From landman at scalableinformatics.com Tue Aug 5 19:43:47 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 05 Aug 2008 22:43:47 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> Message-ID: <48990FE3.3080500@scalableinformatics.com> As a note: I was pointed to a recent lockup (double lock acquisition) in XFS with NFS. I don't think I have seen this one in the wild myself. Right now I am fighting an NFS over RDMA crash in 2.6.26 which seems to have been cured in 2.6.26.1 . .2 is almost out, so will test with that as well. This said, our experience with xfs has been quite good (performance, reliability, etc). Some vendors kernels (2.6.18 ahem!) have some issues with xfs (and a bunch of other things), so we usually update them anyway. Joe Matt Lawrence wrote: > On Wed, 6 Aug 2008, Chris Samuel wrote: > >> I suspect you're not doing a lot of disk I/O, we >> found NFS servers using ext3 as a back end would >> crumble under the weight of lots of writes as ext3 >> is single threaded through the journal daemon. >> >> That means that you end up with all your NFS daemons >> blocking on that, stalling everything else. :-( > > Could be. Given the long and sordid history of NFS, I prefer to not use > it whenever there are practical alternatives. I'm also not a Solaris > fanboy. So, different mindset that a lot of unix sysadmins. > >> There have been occasional bugs in XFS in older kernel >> releases, but then there have been bugs in other filesystems >> too. > > That could be it, he does spend a fair amount of time cleaning up > systems that others have built. > >> Never had that problem here. >> >> Does he know that he can use xfs_fsr to defragment >> XFS filesystems online ? > > He certainly does. He was talking about using OpenNMS to determine the > best time to run it. He had lots of good things to say about how easy > it is to track through performance data with it. > >> Is he sure he's not hitting another kernel bug ? > > It wouldn't surprise me. > > This is someone who I trust enough that if he warns me of something, I > make a real effort to doublecheck if it is currently a problem. It > doesn't mean he is always right, just that I think the research effort > is a really good idea. > > -- Matt > It's not what I know that counts. > It's what I can remember in time to use. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From john.hearns at streamline-computing.com Tue Aug 5 23:57:26 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed, 06 Aug 2008 07:57:26 +0100 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <4898C560.6040606@ldeo.columbia.edu> References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: <1218005856.5116.5.camel@Vigor13> On Tue, 2008-08-05 at 17:25 -0400, Gus Correa wrote: > Hello Beowulf fans > > Is anybody using Infiniband to provide both > MPI connection and parallel file system services on a Beowulf cluster? > > I thought to have a storage node that would > serve a parallel file system to the beowulf nodes over IB > (something like a NFS on steroids). > The same IB net would also work as the MPI interconnect. > > Is this design possible? > Yes - just look at the fastest cluster in the world, Roadrunner. It uses Infiniband to access the Panasas parallel file system. In that architecture there are storage routers between the Infiniband and the Panasas. I'd imagine that TACC Ranger runs Lustre over Infiniband. From jiteshbdundas at gmail.com Wed Aug 6 01:07:37 2008 From: jiteshbdundas at gmail.com (jitesh dundas) Date: Wed, 6 Aug 2008 13:37:37 +0530 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <1218005856.5116.5.camel@Vigor13> References: <4898C560.6040606@ldeo.columbia.edu> <1218005856.5116.5.camel@Vigor13> Message-ID: Dear Sir, I have this query for which I request your reply. It is possible to transfer data from one node to another using parrallel computing. However, I wish to know if it is possible to migrate the settings of one node to another, assuming we are using multiple computers, not necessarily in the same network. Would not security and performance issues come into picture here? Please excuse me for my ignorance, I have just started getting involved in Beowulf and I am trying to clear out my concepts. If you have any articles on Beowulf that could help, please do share them with me. Thanks & Regards, Jitesh Dundas On 8/6/08, John Hearns wrote: > On Tue, 2008-08-05 at 17:25 -0400, Gus Correa wrote: >> Hello Beowulf fans >> >> Is anybody using Infiniband to provide both >> MPI connection and parallel file system services on a Beowulf cluster? >> >> I thought to have a storage node that would >> serve a parallel file system to the beowulf nodes over IB >> (something like a NFS on steroids). >> The same IB net would also work as the MPI interconnect. >> >> Is this design possible? >> > > Yes - just look at the fastest cluster in the world, Roadrunner. > It uses Infiniband to access the Panasas parallel file system. In that > architecture there are storage routers between the Infiniband and the > Panasas. > > I'd imagine that TACC Ranger runs Lustre over Infiniband. > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From andrew at moonet.co.uk Wed Aug 6 02:26:27 2008 From: andrew at moonet.co.uk (andrew holway) Date: Wed, 6 Aug 2008 10:26:27 +0100 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <4898C560.6040606@ldeo.columbia.edu> References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: Gus, It works(ish) and people are doing it but my research has shown that it is not yet stable. I have been talking to various companies offering lustre support. They have all told me that they can do it but none have been able to offer a reference site. I bet you the chaps behind roadrunner aren't going to be publishing any downtime figures. as mentioned by mark, If you try and force lots of stuff down the tubes you are going to break something. I guess its a _bit_ like torrents on a naff home router, lots and lots of small torrent connections filling up the nat table which cannot purge itself fast enough which mean that larger downloads time out as they fall off the bottom of the table. Data on how hard IB switches have to work would be interesting. I have a feeling that many people are taking their fabric to the very edge and back again! Perhaps someone can shed more light on the problems? ta Andy On Tue, Aug 5, 2008 at 10:25 PM, Gus Correa wrote: > Hello Beowulf fans > > Is anybody using Infiniband to provide both > MPI connection and parallel file system services on a Beowulf cluster? > > I thought to have a storage node that would > serve a parallel file system to the beowulf nodes over IB > (something like a NFS on steroids). > The same IB net would also work as the MPI interconnect. > > Is this design possible? > > On a small cluster, does it require two separate IB physical networks (cards > and switch), > or can it be done with a single IB card per node and one switch? > > Is this design efficient? > > Are there other practical and cost effective alternatives to this idea? > > Would this type of design work with GigE instead of IB? > > I confess I know nothing about parallel file systems and IB. > So, please forgive me if my questions are nonsense. > > I also appreciate any links to readings that would mitigate my ignorance > on these subjects. > > Thank you, > Gus Correa > > -- > --------------------------------------------------------------------- > Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu > Lamont-Doherty Earth Observatory - Columbia University > P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From perry at piermont.com Wed Aug 6 06:41:35 2008 From: perry at piermont.com (Perry E. Metzger) Date: Wed, 06 Aug 2008 09:41:35 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: (Matt Lawrence's message of "Tue\, 5 Aug 2008 21\:37\:13 -0500 $CDT$") References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> Message-ID: <87k5eu45ww.fsf@snark.cb.piermont.com> Matt Lawrence writes: > Could be. Given the long and sordid history of NFS, I prefer to not > use it whenever there are practical alternatives. NFS is a fine protocol and works very well. However, traditionally the Linux implementation of NFS has been of less than perfect quality. You shouldn't confuse NFS with NFS on Linux. Perry -- Perry E. Metzger perry at piermont.com From hahn at mcmaster.ca Wed Aug 6 07:15:22 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 6 Aug 2008 10:15:22 -0400 (EDT) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: > It works(ish) and people are doing it but my research has shown that > it is not yet stable. "not stable" sounds like a bit of a smear. file and mpi activity _do_ coexist on a single network - the only issue is possible contention. it's not like NFS somehow ionizes the wires so MPI packets sort out ;) > I have been talking to various companies > offering lustre support. They have all told me that they can do it but > none have been able to offer a reference site. my organization has at least 4 production clusters which use the interconnect for both MPI and file (lustre) traffic. ironically, our one IB cluster has no local filestore, but two are quadrics, one is myri 2g and one is plain old gigabit. actually, now that I think of it, we have ~6 other myri 2g clusters that also share the IC between MPI and NFS. > as mentioned by mark, If you try and force lots of stuff down the > tubes you are going to break something. I guess its a _bit_ like contention is possible, but mixing NFS+MPI doesn't change anything. you can still run into fabric contention with pure MPI - after all, it's not as if _every_ MPI program was equally latency-tolerant or only used sparse tinygrams. there is NOTHING wrong with using a single network for NFS and MPI - just consider, preferably measure, your workload's traffic beforehand. if you can handle NFS purely via gigabit (ie, ~80 MB/s), it's probably very cheap to add a decent gigabit switch. of course, you can just as easily see the same contention with the right mix of MPI traffic - no panacea. From gerry.creager at tamu.edu Wed Aug 6 07:59:59 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed, 06 Aug 2008 09:59:59 -0500 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> <4898ED58.8020001@tamu.edu> Message-ID: <4899BC6F.8030404@tamu.edu> Robert Kubrick wrote: > Or use solid-state data disks? Does anybody here have experience with > SSD disk in HPC? Not on OUR budget! ;-) > On Aug 5, 2008, at 8:16 PM, Gerry Creager wrote: > >> Chris Samuel wrote: >>> ----- "Gerry Creager" wrote: >>>> Chris Samuel wrote: >>>> >>>>> b) We can use XFS for scratch space rather than being >>>>> tied to the RHEL One True Filesystem (ext3) which >>>>> (in our experience) can't handle large amounts of disk >>>>> I/O. >>>> Mirrors our experience, too. >>> I should point out that our actual NFS servers run >>> Debian Linux not CentOS. >>> Those who want to run the stock CentOS kernel might >>> like to know that the "plus" repository includes an >>> RPM for the XFS kernel module for the mainline kernel. >> >> And, of course, we do. Good point. -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From kus at free.net Wed Aug 6 08:39:43 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Wed, 06 Aug 2008 19:39:43 +0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <4899BC6F.8030404@tamu.edu> Message-ID: In message from Gerry Creager (Wed, 06 Aug 2008 09:59:59 -0500): >Robert Kubrick wrote: >> Or use solid-state data disks? Does anybody here have experience >>with >> SSD disk in HPC? > >Not on OUR budget! ;-) It was the proposal for journal part only ;-) SSD/flash disks (for increasing of lifetime) attempt not to erase data really - if it's physically possible. But if I use practically whole HDD partition for scratch files (and therefore whole SSD) - IMHO it'll be impossible not to erase flash RAM. What will be w/SSD disk lifetime in that case ? Mikhail Kuzminsky Computer Assistance to Chemical Research Zelinsky Institute of Organic Chemistry Moscow From Craig.Tierney at noaa.gov Wed Aug 6 11:47:03 2008 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 06 Aug 2008 12:47:03 -0600 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: <4899F1A7.805@noaa.gov> andrew holway wrote: > Gus, > > It works(ish) and people are doing it but my research has shown that > it is not yet stable. I have been talking to various companies > offering lustre support. They have all told me that they can do it but > none have been able to offer a reference site. > We are running our filesystems and MPI traffic over the same IB network. We are having no problems with this configuration. The system consists of two trees (each with ~70% bisection bandwidth) connected via an top level tree to share IB communications between the filesystems and the compute nodes. One side of the tree has ~350 woodcrest nodes, the other ~250 harpertown nodes. We don't run jobs between the two systems, but both systems share the same filesystems. Just to complicate matters, we are supporting both Rapidscale and Lustre (v1.6.5.1) on our nodes. The most obvious job contention we have seen on the IB network is at the filesystem, not between the filesystem traffic and the MPI traffic. We had some issues with the subnet manager initially, but we have worked through them. The latest version of Lustre has been quite stable in our environment. As another posted already stated, I suspect that this configuration would be of issue with codes that are very latency sensitive. Our codes are more latency sensitive than bandwidth sensitive, and we haven't seen any significant issues (and the configuration has been stable so far). Craig > I bet you the chaps behind roadrunner aren't going to be publishing > any downtime figures. > > as mentioned by mark, If you try and force lots of stuff down the > tubes you are going to break something. I guess its a _bit_ like > torrents on a naff home router, lots and lots of small torrent > connections filling up the nat table which cannot purge itself fast > enough which mean that larger downloads time out as they fall off the > bottom of the table. > > Data on how hard IB switches have to work would be interesting. I have > a feeling that many people are taking their fabric to the very edge > and back again! > > Perhaps someone can shed more light on the problems? > > ta > > Andy > > On Tue, Aug 5, 2008 at 10:25 PM, Gus Correa wrote: >> Hello Beowulf fans >> >> Is anybody using Infiniband to provide both >> MPI connection and parallel file system services on a Beowulf cluster? >> >> I thought to have a storage node that would >> serve a parallel file system to the beowulf nodes over IB >> (something like a NFS on steroids). >> The same IB net would also work as the MPI interconnect. >> >> Is this design possible? >> >> On a small cluster, does it require two separate IB physical networks (cards >> and switch), >> or can it be done with a single IB card per node and one switch? >> >> Is this design efficient? >> >> Are there other practical and cost effective alternatives to this idea? >> >> Would this type of design work with GigE instead of IB? >> >> I confess I know nothing about parallel file systems and IB. >> So, please forgive me if my questions are nonsense. >> >> I also appreciate any links to readings that would mitigate my ignorance >> on these subjects. >> >> Thank you, >> Gus Correa >> >> -- >> --------------------------------------------------------------------- >> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu >> Lamont-Doherty Earth Observatory - Columbia University >> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA >> --------------------------------------------------------------------- >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Craig Tierney (craig.tierney at noaa.gov) From rreis at aero.ist.utl.pt Tue Aug 5 02:57:42 2008 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Tue, 5 Aug 2008 10:57:42 +0100 (WEST) Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: On Mon, 4 Aug 2008, Mark Kosmowski wrote: > So, why did the 32-bit test case work? Shouldn't the same problem > crash both systems if it is a code issue? I asked the same question myself... The function interface is: call rfftwnd_f77_mpi(plan_c2r, & 1, local_data, work, use_work, FFTW_NORMAL_ORDER) where use_work is an integer, value 1 if you use the work temporary array, 0 otherwise. This was the variable I wasn't passing. FFTW_NORMAL_ORDER instructs fftw to return a proper ordering of the array and FFTW_TRANSPOSE_ORDER cuts some comm steps making it more efficient (and then you have to workout the array ordering yourself). The wrapper function for this is (from rfftw_f77_mpi.c): void F77_FUNC_(rfftwnd_f77_mpi,RFFTWND_F77_MPI) (rfftwnd_mpi_plan *p, int *n_fields, fftw_real *local_data, fftw_real *work, int *use_work, int *ioutput_order) { fftwnd_mpi_output_order output_order = *ioutput_order ? FFTW_TRANSPOSED_ORDER : FFTW_NORMAL_ORDER; rfftwnd_mpi(*p, *n_fields, local_data, *use_work ? work : NULL, output_order); } and the code was blocking in the fftwnd_mpi_output_order output_order = *ioutput_order ? FFTW_TRANSPOSED_ORDER : FFTW_NORMAL_ORDER; line. So it must be a pointer issue revealed by the 64 bit, no? When I wasn't doing it "properly" the value of *ioutput_order wasn't set. greets, Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt & Cultural Instigator @ R?dio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ From robertkubrick at gmail.com Tue Aug 5 17:34:51 2008 From: robertkubrick at gmail.com (Robert Kubrick) Date: Tue, 5 Aug 2008 20:34:51 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <4898ED58.8020001@tamu.edu> References: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> <4898ED58.8020001@tamu.edu> Message-ID: Or use solid-state data disks? Does anybody here have experience with SSD disk in HPC? On Aug 5, 2008, at 8:16 PM, Gerry Creager wrote: > Chris Samuel wrote: >> ----- "Gerry Creager" wrote: >>> Chris Samuel wrote: >>> >>>> b) We can use XFS for scratch space rather than being >>>> tied to the RHEL One True Filesystem (ext3) which >>>> (in our experience) can't handle large amounts of disk >>>> I/O. >>> Mirrors our experience, too. >> I should point out that our actual NFS servers run >> Debian Linux not CentOS. >> Those who want to run the stock CentOS kernel might >> like to know that the "plus" repository includes an >> RPM for the XFS kernel module for the mainline kernel. > > And, of course, we do. Good point. > -- > Gerry Creager -- gerry.creager at tamu.edu > Texas Mesonet -- AATLT, Texas A&M University > Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 > Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From jimqiao at hotmail.com Wed Aug 6 09:29:26 2008 From: jimqiao at hotmail.com (Lei Qiao) Date: Wed, 6 Aug 2008 12:29:26 -0400 Subject: [Beowulf] Torque manager Message-ID: Hello, forks, I am currently building a small Beowulf cluster and using Torque manager to schedule batch jobs. Now I have a problems with Torque command 'qstat'. as the manual said, all users' jobs can be seen when the command is issued. But for my case, I only see the my own jobs when i login with a regular user but i can see it when login with a root user ( before check the jobs, I remotely login with other regular users to run some jobs with 'qsub -l nodes=1 ./run.sh') I guess the reason is ssh and permission configuration. the root can access any user in any machine without password, whereas the regular user can not access the information of other users. Does anyone has similar experience or have the solution? Thanks for your help in advance. by the way, my cluster is Fedora Core 6 based, and good when running C program with MPI. Firewall is enabled and ssh, NFS and NIS is allowed. Lei Qiao Department of Electrical and Computer Engineering School of Engineering and Applied Sciences University of Rochester Rochester, NY 14627 _________________________________________________________________ Get more from your digital life. Find out how. http://www.windowslive.com/default.html?ocid=TXT_TAGLM_WL_Home2_082008 From jclinton at advancedclustering.com Wed Aug 6 11:31:09 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Wed, 6 Aug 2008 13:31:09 -0500 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <4898C560.6040606@ldeo.columbia.edu> References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: <588c11220808061131l6c3ecf49hc45b5da7151e64b8@mail.gmail.com> On Tue, Aug 5, 2008 at 4:25 PM, Gus Correa wrote: > Is anybody using Infiniband to provide both > MPI connection and parallel file system services on a Beowulf cluster? > > I thought to have a storage node that would > serve a parallel file system to the beowulf nodes over IB > (something like a NFS on steroids). > The same IB net would also work as the MPI interconnect. > > Is this design possible? We have customers doing Lustre and MPI with IB successfully. They still have a good-old gigabit management network to fall back on: it makes sense to keep this around because gigabit is so low-cost by comparison and it's rock-solid. But, you should know that you need more than a single node to provide disk I/O before you start to see the performance benefit. I/O from a single node can--generally--barely fill a gigabit link. To exceed that gigabit level of performance, you'd need more than one storage node delivering storage to the Lustre network. > On a small cluster, does it require two separate IB physical networks (cards > and switch), > or can it be done with a single IB card per node and one switch? It can be done with a single IB network. > Is this design efficient? Generally speaking, MPI programs will not be fetching/writing data from/to storage at the same time they are doing MPI calls so there tends to not be very much contention to worry about at the node level. > Are there other practical and cost effective alternatives to this idea? If the cluster is small enough, using gigabit with a shared filesystem is preferred since IB's low latency has relatively little affect on the big source of latency in any storage system: the physical disks. It's not until you cross the gigabit bandwidth barrier that IB really starts to make sense--and that's a barrier that's not crossed that often in a small cluster. > Would this type of design work with GigE instead of IB? Yes, but you'd still want IB for low latency MPI traffic. > I confess I know nothing about parallel file systems and IB. > So, please forgive me if my questions are nonsense. Lustre and Panassas are certainly both stable options in this area. -- Jason D. Clinton Advanced Clustering Technologies, Inc. From rgb at phy.duke.edu Wed Aug 6 12:34:52 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 6 Aug 2008 15:34:52 -0400 (EDT) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <87k5eu45ww.fsf@snark.cb.piermont.com> References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> <87k5eu45ww.fsf@snark.cb.piermont.com> Message-ID: On Wed, 6 Aug 2008, Perry E. Metzger wrote: > > Matt Lawrence writes: >> Could be. Given the long and sordid history of NFS, I prefer to not >> use it whenever there are practical alternatives. > > NFS is a fine protocol and works very well. However, traditionally the > Linux implementation of NFS has been of less than perfect quality. You > shouldn't confuse NFS with NFS on Linux. And even on Linux machines, NFS has been, well, "functional" is a good way to describe it. For its primary original purpose, which is serving home directories or remote mount e.g. binaries in midsize and smaller workstation LANS, it is adequate and has worked well for us for almost ten years (not without some pain, mind you, but with no more pain than anythng else). For the last five or six years even most of the pain has gone away and things like automounting work most of the time with only rare hangs or stale mount problems (on highly reliable server hardware and with a very reliable network). Once upon a time, running NFS in a LAN that wasn't controlled at the port level was basically openly inviting anyone that could plug into a wired port to have open access to all exported files, and I'm not sure that has fundamentally changed as to change it would be very difficult. A host that is permitted to mount a directory is typically known only by IP number (which of course anybody can set to masquerade as any host) and no hard authentication tokens are required. Also, traffic is typically not encrypted IIRC so anybody can snoop the wire if they're on it. I once upon a time had a few lovely cracking tools that let me just mount any user's home directory with no special privileges from userspace -- it didn't even require rootspace. I think things are better now, but still think of it as a tool to use primarily on trusted internal networks for primarily bandwidth-limited (few larger files) and not stat-limited (man smaller files) traffic. rgb > > Perry > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From perry at piermont.com Wed Aug 6 12:44:23 2008 From: perry at piermont.com (Perry E. Metzger) Date: Wed, 06 Aug 2008 15:44:23 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: (Robert G. Brown's message of "Wed\, 6 Aug 2008 15\:34\:52 -0400 $EDT$") References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> <87k5eu45ww.fsf@snark.cb.piermont.com> Message-ID: <87wsiuylm0.fsf@snark.cb.piermont.com> "Robert G. Brown" writes: > Once upon a time, running NFS in a LAN that wasn't controlled at the > port level was basically openly inviting anyone that could plug into a > wired port to have open access to all exported files, and I'm not sure > that has fundamentally changed as to change it would be very difficult. NFSv4 changes the security situation a bunch, but it is not widely implemented and deployed. One can also use IPSec with NFSv3 -- appropriate IPSec policies will assure that you get reasonable security, at the price of some performance because of the crypto. Perry -- Perry E. Metzger perry at piermont.com From gus at ldeo.columbia.edu Wed Aug 6 13:08:05 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 06 Aug 2008 16:08:05 -0400 Subject: [Beowulf] Torque manager In-Reply-To: References: Message-ID: <489A04A5.9060803@ldeo.columbia.edu> Hi Lei Qiao and list It may be just the Torque/PBS configuration. As root, try: qmgr -c 'set server your-pbs-server-name query_other_jobs = True' Then try qstat again as a regular user. I hope this helps. Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Lei Qiao wrote: >Hello, forks, > >I am currently building a small Beowulf cluster and using Torque manager to schedule batch jobs. Now I have a problems with Torque command 'qstat'. >as the manual said, all users' jobs can be seen when the command is issued. But for my case, I only see the my own jobs when i login with a regular user but i can see it when login with a root user ( before check the jobs, I remotely login with other regular users to run some jobs with 'qsub -l nodes=1 ./run.sh') > >I guess the reason is ssh and permission configuration. the root can access any user in any machine without password, whereas the regular user can not access the information of other users. Does anyone has similar experience or have the solution? Thanks for your help in advance. > >by the way, my cluster is Fedora Core 6 based, and good when running C program with MPI. Firewall is enabled and ssh, NFS and NIS is allowed. > > >Lei Qiao >Department of Electrical and Computer Engineering >School of Engineering and Applied Sciences >University of Rochester >Rochester, NY 14627 > >_________________________________________________________________ >Get more from your digital life. Find out how. >http://www.windowslive.com/default.html?ocid=TXT_TAGLM_WL_Home2_082008 >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From jclinton at advancedclustering.com Wed Aug 6 12:56:51 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Wed, 6 Aug 2008 14:56:51 -0500 Subject: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275 In-Reply-To: <20670.89.180.225.196.1217678257.squirrel@www.di.fct.unl.pt> References: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> <20670.89.180.225.196.1217678257.squirrel@www.di.fct.unl.pt> Message-ID: <588c11220808061256o7140c9aekdb9b88f7047d4dde@mail.gmail.com> On Sat, Aug 2, 2008 at 6:57 AM, Paulo Afonso Lopes wrote: > Thanks, Mark > >>> So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 >>> DL145-G2 nodes with 2 dual-core 275 / 4GB each. >> >> it's worth making sure you have current bios installed. >> > Not the latest, but the previous; according to "Fixes" just a single, > unrelated fix. Anyway I'm upgrading it... >> >>> 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted >> >> it may also be useful to run mcelog, which will tell you about >> any ongoing _correctable_ ECC activity. > > No output in any of the 4 hosts; tried with/without --k8, --dmi, etc. We have a tool on our website called "breakin" that is Linux 2.6.25.9 patched with K8 and K10f Opteron EDAC reporting facilities. It can usually find and identify failed RAM in fifteen minutes (two hours at most). The EDAC patches to the kernel aren't that great about naming the correct memory rank, though. Make sure you have multibit (sometimes says 4-bit) ECC enabled in your BIOS. http://www.advancedclustering.com/software/breakin.html From dnlombar at ichips.intel.com Wed Aug 6 14:56:01 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed, 6 Aug 2008 14:56:01 -0700 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: <20080806215601.GA2375@nlxdcldnl2.cl.intel.com> On Tue, Aug 05, 2008 at 02:57:42AM -0700, Ricardo Reis wrote: > On Mon, 4 Aug 2008, Mark Kosmowski wrote: > > > So, why did the 32-bit test case work? Shouldn't the same problem > > crash both systems if it is a code issue? Not necessarily given the error described below. > I asked the same question myself... The function interface is: > > call rfftwnd_f77_mpi(plan_c2r, & > 1, local_data, work, use_work, FFTW_NORMAL_ORDER) > > where use_work is an integer, value 1 if you use the work temporary > array, 0 otherwise. This was the variable I wasn't passing. ... > The wrapper function for this is (from rfftw_f77_mpi.c): > > void F77_FUNC_(rfftwnd_f77_mpi,RFFTWND_F77_MPI) > (rfftwnd_mpi_plan *p, int *n_fields, fftw_real *local_data, > fftw_real *work, int *use_work, int *ioutput_order) > .... So it must be a pointer issue revealed by the 64 bit, no? When I > wasn't doing it "properly" the value of *ioutput_order wasn't set. The value of the first element of local_data was used for the n_fields scalar. The work array was being laid down starting at the location of the use_work scalar. The FFTW_NORMAL_ORDER value was being interpreted as use_work scalar. Finally, ioutput_order scalar was some random value. So, a lot was going wrong there. It's just one of life's little, um, pleasures that it looked like it was working for your 32-bit test case. Don't worry, you'll likely do this again, as likely *every* one of us on this list has, too. BTW, Fortran passes by reference; that's why all args are pointers. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From csamuel at vpac.org Wed Aug 6 16:48:32 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 7 Aug 2008 09:48:32 +1000 (EST) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <1675460398.47611218066218095.JavaMail.root@mail.vpac.org> Message-ID: <382428251.47851218066512841.JavaMail.root@mail.vpac.org> ----- "Robert G. Brown" wrote: > And even on Linux machines, NFS has been, well, "functional" > is a good way to describe it. It actually seems to work pretty well these days, our general config is: 1) No automounter 2) Hard mounts (so jobs just hang if they loose contact) 3) NFS over TCP (NFS over UDP is sooo 1990's :-)) 4) Jumbo frames (9000 byte MTUs) on the NFS network 5) NFS file server has hardwired fsid's to prevent stale file handles on a reboot 6) Debian, not RHEL on the server 7) XFS for /home on the server cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From kyron at neuralbs.com Wed Aug 6 17:44:39 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Wed, 06 Aug 2008 20:44:39 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> Message-ID: <489A4577.5040508@neuralbs.com> Bogdan Costescu wrote: > On Tue, 29 Jul 2008, Chris Samuel wrote: > >> 1) Use a mainline kernel, we've found benefit of that >> over stock CentOS kernels. > > Care to comment on this statement ? I do ;) Simply download a kernel from kernel.org and build the kernel yourself and set: CONFIG_HZ_100=y CONFIG_HZ=100 CONFIG_PREEMPT_NONE=y And select the main stuff (HDD drivers) as built in and don't fsck around with the initrd stuff, that's only usefull for kernels that need to be generic and adapt to all hardware (ie: install CDs)...other than that, monolithic a kernel works fine ;) ...and such. I'd tell you to use the Gentoo Clustering LiveCD but that's work in progress...you could still build the cluster using Gentoo...if you're performance savvy...and want things like OpenMP capable compiler (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish afterthought of an RPM that pulls in a new glibc that breaks the install anyways ;) ...but, then again...no distribution war, seems people want the easy install solution and veil that fact with "it has to be supported" catch phrase Eh! Eric Thibodeau From landman at scalableinformatics.com Wed Aug 6 18:05:01 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 06 Aug 2008 21:05:01 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A4577.5040508@neuralbs.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> Message-ID: <489A4A3D.1070000@scalableinformatics.com> Eric Thibodeau wrote: > And select the main stuff (HDD drivers) as built in and don't fsck > around with the initrd stuff, that's only usefull for kernels that need > to be generic and adapt to all hardware (ie: install CDs)...other than > that, monolithic a kernel works fine ;) Advantage of modules is you can upgrade them without upgrading the kernel. Go ahead, build in that e1000 driver. I dare yah... :( More to the point it does give some good flexibility for end users with a need to keep the core "separate" from the drivers for maintenance. Initrd is subtle and quick to anger. One must use burnt offerings to placate the spirits of initrd. Well, it would be a heck of a lot nicer if the tools were a little more forgiving ... Oh you don't have this driver in your initrd ... ok ... PANIC (mwahahahaha) > > > ...and such. I'd tell you to use the Gentoo Clustering LiveCD but that's > work in progress...you could still build the cluster using Gentoo...if > you're performance savvy...and want things like OpenMP capable compiler I have been hearing claims like this for a long time. I have not seen any real tests that back these claims up. Do you have any? Most of the arguments I have heard are "oh but its compiled with -O3" or whatever. Any decent HPC code person will tell you that that is most definitely not a guaranteed way to a faster system ... > (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish Er... We often use several different compilers in several different trees. Several gccs, pgi, icc, eieio ... you name it. All are integrated. > afterthought of an RPM that pulls in a new glibc that breaks the install Er ... not the slightest clue as to what you are talking about. I haven't seen gcc, icc, pgi, ... touch our glibc. Maybe I am missing the fun. Which ICC version is this? Which gcc is this, which glibc is this? -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From kyron at neuralbs.com Wed Aug 6 18:07:01 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Wed, 06 Aug 2008 21:07:01 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> Message-ID: <489A4AB5.9090704@neuralbs.com> Matt Lawrence wrote: > On Mon, 4 Aug 2008, Joe Landman wrote: > >> This mirrors our experience, though RHEL stability under intense >> loads is questionable IMO (talking about the kernel BTW). We find >> that the missing drivers, the omitted drivers, the backported drivers >> along with some odd and often useless "features" (4k stacks anyone?) >> render the RHEL default kernels (and by definition the Centos >> kernels) less useful for HPC and storage tasks than what we build. >> Our current standard is a 2.6.23.14 kernel which is rock solid under >> load. Working on a 2.6.26 based version now (even though I am on >> vacation/holiday, I just updated it to 2.6.26.1 to address an >> observed crashing issue with the RDMA server) > > Since I plan to continue running CentOS, it sounds like building a > much later kernel rpm is the way I want to approach the problem. Will > going to a much later kernel break any of the utilities? Other > problems I can expect to see? > > What do you recommend for the kernel config? > >> Combine this with the small upper limit of ext3 partition sizes, the >> file size limits in ext3, the serialization in the journaling code >> (ext4 is extents based to help deal with this), ext3 just doesn't >> make much sense in a storage/HPC system (apart from possibly >> boot/root file system where performance is less critical). Yeah I >> have seen studies from folks whom had done 1E6 removes, file creates, >> and other things who claim xfs is slower than ext3. Yeah, those are >> bad benchmarks in that they really don't touch on real end user use >> cases for the most part (apart from possible large scale mail servers >> and other things like that). > > I have never had any problems with ext3. I had dinner with a friend > who is an expert Linux sysadmin who was warning me to stay away from > xfs. He cited lots of fragmentation problems that routinely locked up > his systems. I am willing to be convinced otherwise, but he is a very > sharp fellow. Check the kernel mailing list for XFS problems with RAID5 if you use mdadm...jsut a gentle suggestion ;) > > -- Matt > It's not what I know that counts. > It's what I can remember in time to use. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From kyron at neuralbs.com Wed Aug 6 18:33:10 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Wed, 06 Aug 2008 21:33:10 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A4A3D.1070000@scalableinformatics.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> Message-ID: <489A50D6.4060302@neuralbs.com> Joe Landman wrote: > > > Eric Thibodeau wrote: > >> And select the main stuff (HDD drivers) as built in and don't fsck >> around with the initrd stuff, that's only usefull for kernels that >> need to be generic and adapt to all hardware (ie: install >> CDs)...other than that, monolithic a kernel works fine ;) > > Advantage of modules is you can upgrade them without upgrading the > kernel. Go ahead, build in that e1000 driver. I dare yah... :( Ok...I didn't put enought emphasis on "main" stuff....as in, _all you need to get the system booted, which essentially means HDD chipset drivers, the rest I do build as a module (NIC, video and such). > > More to the point it does give some good flexibility for end users > with a need to keep the core "separate" from the drivers for maintenance. > > Initrd is subtle and quick to anger. One must use burnt offerings to > placate the spirits of initrd. LOL! > > Well, it would be a heck of a lot nicer if the tools were a little > more forgiving ... Oh you don't have this driver in your initrd ... ok > ... PANIC (mwahahahaha) Pahahahahah... Point in case, I am building a CD-only cluster system (based on Gentoo) and I am currently _NOT_ using initrd because all that really needs to be built in is NFSroot support an all NICs I care to put in. Obviously this is a deprecated approach but it's proven to be the most effective and easy to maintain in my case. >> >> >> ...and such. I'd tell you to use the Gentoo Clustering LiveCD but >> that's work in progress...you could still build the cluster using >> Gentoo...if you're performance savvy...and want things like OpenMP >> capable compiler > > I have been hearing claims like this for a long time. I have not seen > any real tests that back these claims up. Do you have any? I'm actually working on such benchmarks. Did you know that compiling with the default ICC optimization will cause your bridge to crumble due to floating point assumptions?... Ok, so my computation have diverged horribly mostly because I am computing 47(vector size)*5000(K-Means clusters)*6,787,955(learning dataset)*5(iterations to convergence) for a total of 7,975,847,125,000 FLOPS (or about 8Tera FLOPS) as part of an iterative learning process, the error adds up. So performance is very sensitive to what your intended goal is too ;) > Most of the arguments I have heard are "oh but its compiled with > -O3" or whatever. Any decent HPC code person will tell you that that > is most definitely not a guaranteed way to a faster system ... Hey...as I stated above, one would have to be quite silly to claim -O3 as the all well and all good optimization solution. At least you can rest assured your solutions will add up correctly with GCC. To get a "faster" system, you really have to look at your app, use strace, ltrace and gprof, then you can play with that. What I _am_ saying though is that Gentoo _does_ empower the administrator by giving him the ability to customize the OS if a bottleneck is to be identified. > >> (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish > > Er... We often use several different compilers in several different > trees. Several gccs, pgi, icc, eieio ... you name it. All are > integrated. Are-you currently able to run GCC-4.3.x versions on your current setup, I'm actually eager to know. I'm still living under the ASSumption od binary distributions not coping too well with multi-library environments. Point in case, one of my colleagues _really_ wanted firefox 3 on his ubuntu system. The installer trickled down to having to uninstall glibc...and he forced it to YES (and this is just a browser, not something that is used to _make_ code and would be tied to glibc) > >> afterthought of an RPM that pulls in a new glibc that breaks the install > > Er ... not the slightest clue as to what you are talking about. I > haven't seen gcc, icc, pgi, ... touch our glibc. > > Maybe I am missing the fun. Which ICC version is this? Which gcc is > this, which glibc is this? > Sorry about that I might have been misleading, GCC is generally the one most sensitive to glibc, not the other ones although the latest ICC (10.1.x series) do claim compatibility with the GNU environment so it might get a little more dependency there. Cheers! Eric From landman at scalableinformatics.com Wed Aug 6 19:01:17 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 06 Aug 2008 22:01:17 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A50D6.4060302@neuralbs.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> <489A50D6.4060302@neuralbs.com> Message-ID: <489A576D.4000005@scalableinformatics.com> Eric Thibodeau wrote: >> Advantage of modules is you can upgrade them without upgrading the >> kernel. Go ahead, build in that e1000 driver. I dare yah... :( > Ok...I didn't put enought emphasis on "main" stuff....as in, _all you > need to get the system booted, which essentially means HDD chipset > drivers, the rest I do build as a module (NIC, video and such). >> >> More to the point it does give some good flexibility for end users >> with a need to keep the core "separate" from the drivers for maintenance. >> >> Initrd is subtle and quick to anger. One must use burnt offerings to >> placate the spirits of initrd. > LOL! ... now I don't mean hardware burnt offerings ... smoke rising from your motherboard may not placate the spirits of initrd, they definitely may impede further operations ... >> >> Well, it would be a heck of a lot nicer if the tools were a little >> more forgiving ... Oh you don't have this driver in your initrd ... ok >> ... PANIC (mwahahahaha) > Pahahahahah... Point in case, I am building a CD-only cluster system > (based on Gentoo) and I am currently _NOT_ using initrd because all that > really needs to be built in is NFSroot support an all NICs I care to put > in. Obviously this is a deprecated approach but it's proven to be the > most effective and easy to maintain in my case. We build an integrated NFSroot and e1000 and a few other things for a customer. Fixed hardware for their cluster. From bare-metal-off to operational infiniband compute node in ~45-60 seconds (I say 45, but a few things took a little longer to start, like SGE). >>> >>> >>> ...and such. I'd tell you to use the Gentoo Clustering LiveCD but >>> that's work in progress...you could still build the cluster using >>> Gentoo...if you're performance savvy...and want things like OpenMP >>> capable compiler >> >> I have been hearing claims like this for a long time. I have not seen >> any real tests that back these claims up. Do you have any? > I'm actually working on such benchmarks. Did you know that compiling > with the default ICC optimization will cause your bridge to crumble due > to floating point assumptions?... > > Ok, so my computation have diverged horribly mostly because I am > computing 47(vector size)*5000(K-Means clusters)*6,787,955(learning > dataset)*5(iterations to convergence) for a total of 7,975,847,125,000 > FLOPS (or about 8Tera FLOPS) as part of an iterative learning process, > the error adds up. So performance is very sensitive to what your > intended goal is too ;) Hmmm.... sounds like a fun computation. Error definitely adds up. Renormalization is your friend (well, some times, assuming a linear system). >> Most of the arguments I have heard are "oh but its compiled with >> -O3" or whatever. Any decent HPC code person will tell you that that >> is most definitely not a guaranteed way to a faster system ... > Hey...as I stated above, one would have to be quite silly to claim -O3 > as the all well and all good optimization solution. At least you can > rest assured your solutions will add up correctly with GCC. To get a Well, sometimes. You still need to be careful with it. This said, I am not sure icc/pgi/... are uniformly better than gcc. I did an admittedly tiny study of this http://scalability.org/?p=470 some time ago. What I found was the gcc really held its own. It did a very good job on a very simple test case. Then again, the fortran version was simply faster than the C version, but that can be explained ... by ... er ... ah ... something. > "faster" system, you really have to look at your app, use strace, ltrace > and gprof, then you can play with that. What I _am_ saying though is > that Gentoo _does_ empower the administrator by giving him the ability > to customize the OS if a bottleneck is to be identified. Yup. There is nothing like a profile of an app running the code, to see where it is spending its time to decide between code shifts and algorithmic shifts. >> >>> (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish >> >> Er... We often use several different compilers in several different >> trees. Several gccs, pgi, icc, eieio ... you name it. All are >> integrated. > Are-you currently able to run GCC-4.3.x versions on your current setup, Currently running 4.2.3-2ubuntu7 on my laptop. Other machines (development box) has something like 4 different gccs there. I haven't tried 4.3.x yet ... had planned to, but work gets in the way. > I'm actually eager to know. I'm still living under the ASSumption od > binary distributions not coping too well with multi-library > environments. Point in case, one of my colleagues _really_ wanted No, our systems (Ubuntu, SuSE, Centos) seem to have no real problems apart from the occasional broken hard wired /usr/lib with the wrong ABI in a configure/make file. Usually easy to fix. > firefox 3 on his ubuntu system. The installer trickled down to having to > uninstall glibc...and he forced it to YES (and this is just a browser, > not something that is used to _make_ code and would be tied to glibc) Hmmm... I have firefox 3 on this system (64 bit) and I run icecat for 32 bit access (java and other things). No glibc changes (apart from security patches). He must have done something horribly wrong. We have multiple mixed ABI ubuntu/centos/suse systems, and haven't had issues. >> >>> afterthought of an RPM that pulls in a new glibc that breaks the install >> >> Er ... not the slightest clue as to what you are talking about. I >> haven't seen gcc, icc, pgi, ... touch our glibc. >> >> Maybe I am missing the fun. Which ICC version is this? Which gcc is >> this, which glibc is this? >> > Sorry about that I might have been misleading, GCC is generally the one > most sensitive to glibc, not the other ones although the latest ICC > (10.1.x series) do claim compatibility with the GNU environment so it > might get a little more dependency there. We have installed the 10.1.015 on customer machines from Centos 5.2 through SuSE 10.x through Ubuntu with nary a problem. Very different glibc's. No issues with code generation. Binary distributions aren't evil. They do work, quite well in most cases. > > Cheers! > > Eric -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From matt at technoronin.com Wed Aug 6 19:24:00 2008 From: matt at technoronin.com (Matt Lawrence) Date: Wed, 6 Aug 2008 21:24:00 -0500 (CDT) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> <87k5eu45ww.fsf@snark.cb.piermont.com> Message-ID: On Wed, 6 Aug 2008, Robert G. Brown wrote: > On Wed, 6 Aug 2008, Perry E. Metzger wrote: > > And even on Linux machines, NFS has been, well, "functional" is a good > way to describe it. For its primary original purpose, which is serving > home directories or remote mount e.g. binaries in midsize and smaller > workstation LANS, it is adequate and has worked well for us for almost > ten years (not without some pain, mind you, but with no more pain than > anythng else). For the last five or six years even most of the pain has > gone away and things like automounting work most of the time with only > rare hangs or stale mount problems (on highly reliable server hardware > and with a very reliable network). Youngsters these days..... I still have painful memories of an environment with too many filesystems cross mounted between workstations and (at the time big) minicomputers. All too often someone would shut down a workstation that was serving a filesystem and everything would crash. Just like dominos. Like I said, a sordid history. -- Matt It's not what I know that counts. It's what I can remember in time to use. From gerry.creager at tamu.edu Wed Aug 6 19:33:07 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed, 06 Aug 2008 21:33:07 -0500 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> <87k5eu45ww.fsf@snark.cb.piermont.com> Message-ID: <489A5EE3.4040100@tamu.edu> Matt Lawrence wrote: > On Wed, 6 Aug 2008, Robert G. Brown wrote: > >> On Wed, 6 Aug 2008, Perry E. Metzger wrote: >> >> And even on Linux machines, NFS has been, well, "functional" is a good >> way to describe it. For its primary original purpose, which is serving >> home directories or remote mount e.g. binaries in midsize and smaller >> workstation LANS, it is adequate and has worked well for us for almost >> ten years (not without some pain, mind you, but with no more pain than >> anythng else). For the last five or six years even most of the pain has >> gone away and things like automounting work most of the time with only >> rare hangs or stale mount problems (on highly reliable server hardware >> and with a very reliable network). > > Youngsters these days..... > > I still have painful memories of an environment with too many > filesystems cross mounted between workstations and (at the time big) > minicomputers. All too often someone would shut down a workstation that > was serving a filesystem and everything would crash. Just like dominos. > > Like I said, a sordid history. Whiling away my misspent youth at Johnson Space Center's Software Technology Branch, while I wasn't an official system administrator (we had few "official" sys-admins) I did back-stop them for some functions, and had root access. We had a rather complicated cross-mount system requiring carefully timed boot/power-up cycling, lest we spend the whole day randomly rebooting things to fix cross-mount dependencies. No, that wasn't by design of the incumbents at the time, but it was so pervasive (and big Sun servers capable of taking over the whole fileserver load were so expensive relative to our budget) that we fixed these issued a little bit at a time. This was also the era where a run-away ping sweep took down a few routers in the US... and elsewhere... when router tables filled up. And, yeah, that originated from my piece of the Branch, too. -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From gus at ldeo.columbia.edu Wed Aug 6 19:53:15 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 06 Aug 2008 22:53:15 -0400 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: <20080806215601.GA2375@nlxdcldnl2.cl.intel.com> References: <20080806215601.GA2375@nlxdcldnl2.cl.intel.com> Message-ID: <489A639B.8000500@ldeo.columbia.edu> Hi Ricardo, David, Mark, and list If as Ricardo says, he suppressed the 5th parameter ("use_work") on the call to rfftwnd_f77_mpi, which has 6 parameters, wouldn't it start mismatching pointers on the 5th parameter, instead of on the 2nd parameter ("n_fields")? I.e. "use_work" would take the value of "FFTW_NORMAL_ORDER", and "FFTW_NORMAL_ORDER" would get a random value (OS permitting), but the initial 4 parameters would be correct, right? In any case, there is little difference between this and what David said, the point of failure is different, the nature is the same. However, it is interesting that somehow at runtime the program segfaults in 64-bits, but doesn't fail in 32-bits, although it most likely computes wrong stuff. Ricardo have you ever QCd' the 32-bit output before you fixed/inserted "use_work"? If you were in a big lucky strike the random value left on the FFTW_NORMAL_ORDER address matched your needs, and the result may be correct! :) Anyway, somehow the program seems to behave differently, with the OS superego being more compliant (in a nasty sense) in 32-bits than it is in 64-bits. Does the OS paradoxically give less memory room for the stack in 64-bits, leading to the segfault? Or does it give the same room, but because the pointers are bigger the segfault is more likely? Or does the segfault happen somewhere else, not on the stack? Where? Why in 64-bits? Why not in 32 bits? Yes, as David noted about programming, here I also got and continue to get these bugs, particularly in Fortran programs where no parameter checking is enforced. And the nastier ones are those that don't segfault, then come back to haunt you when somebody looks at the output, if you are not careful enough to look at it before anybody else does. Cheers, Gus Correa Compilar e' preciso, rodar e' impreciso! ... mais uma do vosso alter-ego P'ssoa ... :) -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Lombard, David N wrote: >On Tue, Aug 05, 2008 at 02:57:42AM -0700, Ricardo Reis wrote: > > >>On Mon, 4 Aug 2008, Mark Kosmowski wrote: >> >> >> >>>So, why did the 32-bit test case work? Shouldn't the same problem >>>crash both systems if it is a code issue? >>> >>> > >Not necessarily given the error described below. > > > >>I asked the same question myself... The function interface is: >> >> call rfftwnd_f77_mpi(plan_c2r, & >> 1, local_data, work, use_work, FFTW_NORMAL_ORDER) >> >>where use_work is an integer, value 1 if you use the work temporary >>array, 0 otherwise. This was the variable I wasn't passing. >> >> >... > > >>The wrapper function for this is (from rfftw_f77_mpi.c): >> >>void F77_FUNC_(rfftwnd_f77_mpi,RFFTWND_F77_MPI) >>(rfftwnd_mpi_plan *p, int *n_fields, fftw_real *local_data, >> fftw_real *work, int *use_work, int *ioutput_order) >> >> > > > >> .... So it must be a pointer issue revealed by the 64 bit, no? When I >>wasn't doing it "properly" the value of *ioutput_order wasn't set. >> >> > >The value of the first element of local_data was used for the n_fields scalar. > >The work array was being laid down starting at the location of the use_work scalar. > >The FFTW_NORMAL_ORDER value was being interpreted as use_work scalar. > >Finally, ioutput_order scalar was some random value. > >So, a lot was going wrong there. It's just one of life's little, um, pleasures >that it looked like it was working for your 32-bit test case. Don't worry, you'll >likely do this again, as likely *every* one of us on this list has, too. > >BTW, Fortran passes by reference; that's why all args are pointers. > > > From gerry.creager at tamu.edu Wed Aug 6 21:10:26 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed, 06 Aug 2008 23:10:26 -0500 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <382428251.47851218066512841.JavaMail.root@mail.vpac.org> References: <382428251.47851218066512841.JavaMail.root@mail.vpac.org> Message-ID: <489A75B2.2050009@tamu.edu> Chris Samuel wrote: > ----- "Robert G. Brown" wrote: > >> And even on Linux machines, NFS has been, well, "functional" >> is a good way to describe it. > > It actually seems to work pretty well these days, our > general config is: > > 1) No automounter > 2) Hard mounts (so jobs just hang if they loose contact) > 3) NFS over TCP (NFS over UDP is sooo 1990's :-)) > 4) Jumbo frames (9000 byte MTUs) on the NFS network > 5) NFS file server has hardwired fsid's to prevent stale file handles on a reboot > 6) Debian, not RHEL on the server > 7) XFS for /home on the server Speaking of jumbo frames, I'm seeing a problem on a Broadcom 57xx chipset on CentOS 4.3, 2.6.9-67 kernel (yeah, I know) and a tg3 driver. I can't make the thing recognize the ability to use jumbo frames. Anyone got a fix? -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From djholm at fnal.gov Wed Aug 6 15:37:40 2008 From: djholm at fnal.gov (Don Holmgren) Date: Wed, 06 Aug 2008 17:37:40 -0500 (CDT) Subject: [Beowulf] Torque manager In-Reply-To: References: Message-ID: Very likely, the server parameter "query_other_jobs" is set to false in your Torque configuration. Try setting this to true, via for example qmgr -c 'set server query_other_jobs=true' Don Holmgren Fermilab On Wed, 6 Aug 2008, Lei Qiao wrote: > > Hello, forks, > > I am currently building a small Beowulf cluster and using Torque manager to > schedule batch jobs. Now I have a problems with Torque command 'qstat'. as the > manual said, all users' jobs can be seen when the command is issued. But for > my case, I only see the my own jobs when i login with a regular user but i can > see it when login with a root user ( before check the jobs, I remotely login > with other regular users to run some jobs with 'qsub -l nodes=1 ./run.sh') > > I guess the reason is ssh and permission configuration. the root can access > any user in any machine without password, whereas the regular user can not > access the information of other users. Does anyone has similar experience or > have the solution? Thanks for your help in advance. > > by the way, my cluster is Fedora Core 6 based, and good when running C program > with MPI. Firewall is enabled and ssh, NFS and NIS is allowed. > > > Lei Qiao > Department of Electrical and Computer Engineering > School of Engineering and Applied Sciences > University of Rochester > Rochester, NY 14627 From landman at scalableinformatics.com Thu Aug 7 04:26:05 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 07 Aug 2008 07:26:05 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A75B2.2050009@tamu.edu> References: <382428251.47851218066512841.JavaMail.root@mail.vpac.org> <489A75B2.2050009@tamu.edu> Message-ID: <489ADBCD.9090307@scalableinformatics.com> Gerry Creager wrote: > Speaking of jumbo frames, I'm seeing a problem on a Broadcom 57xx > chipset on CentOS 4.3, 2.6.9-67 kernel (yeah, I know) and a tg3 driver. > I can't make the thing recognize the ability to use jumbo frames. > Anyone got a fix? Had a very similar issue some time ago with tg3 on Broadcom chipset NICs (57xx). The only solution we could find was to not use jumbo frames. The preventative measure we take as a result of this is also to recommend using Intel NICs (and a few others) over Broadcom whenever possible. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From mark.kosmowski at gmail.com Thu Aug 7 04:36:40 2008 From: mark.kosmowski at gmail.com (Mark Kosmowski) Date: Thu, 7 Aug 2008 07:36:40 -0400 Subject: [Beowulf] Building new cluster - estimate Message-ID: > > Message: 7 > Date: Wed, 06 Aug 2008 22:01:17 -0400 > From: Joe Landman > Subject: Re: [Beowulf] Building new cluster - estimate > To: kyron at neuralbs.com > Cc: Bogdan Costescu , Beowulf > List , Chris Samuel > Message-ID: <489A576D.4000005 at scalableinformatics.com> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Eric Thibodeau wrote: > > >> Advantage of modules is you can upgrade them without upgrading the > >> kernel. Go ahead, build in that e1000 driver. I dare yah... :( > > Ok...I didn't put enought emphasis on "main" stuff....as in, _all you > > need to get the system booted, which essentially means HDD chipset > > drivers, the rest I do build as a module (NIC, video and such). > >> > >> More to the point it does give some good flexibility for end users > >> with a need to keep the core "separate" from the drivers for maintenance. > >> > >> Initrd is subtle and quick to anger. One must use burnt offerings to > >> placate the spirits of initrd. > > LOL! > > ... now I don't mean hardware burnt offerings ... smoke rising from > your motherboard may not placate the spirits of initrd, they definitely > may impede further operations ... You beat me to it - I was going to ask whether initrd preferred power supplies or motherboards. ;) > > >> > >> Well, it would be a heck of a lot nicer if the tools were a little > >> more forgiving ... Oh you don't have this driver in your initrd ... ok > >> ... PANIC (mwahahahaha) > > Pahahahahah... Point in case, I am building a CD-only cluster system > > (based on Gentoo) and I am currently _NOT_ using initrd because all that > > really needs to be built in is NFSroot support an all NICs I care to put > > in. Obviously this is a deprecated approach but it's proven to be the > > most effective and easy to maintain in my case. > > We build an integrated NFSroot and e1000 and a few other things for a > customer. Fixed hardware for their cluster. From bare-metal-off to > operational infiniband compute node in ~45-60 seconds (I say 45, but a > few things took a little longer to start, like SGE). > > >>> > >>> > >>> ...and such. I'd tell you to use the Gentoo Clustering LiveCD but > >>> that's work in progress...you could still build the cluster using > >>> Gentoo...if you're performance savvy...and want things like OpenMP > >>> capable compiler > >> > >> I have been hearing claims like this for a long time. I have not seen > >> any real tests that back these claims up. Do you have any? > > I'm actually working on such benchmarks. Did you know that compiling > > with the default ICC optimization will cause your bridge to crumble due > > to floating point assumptions?... > > > > Ok, so my computation have diverged horribly mostly because I am > > computing 47(vector size)*5000(K-Means clusters)*6,787,955(learning > > dataset)*5(iterations to convergence) for a total of 7,975,847,125,000 > > FLOPS (or about 8Tera FLOPS) as part of an iterative learning process, > > the error adds up. So performance is very sensitive to what your > > intended goal is too ;) > > Hmmm.... sounds like a fun computation. Error definitely adds up. > Renormalization is your friend (well, some times, assuming a linear system). > > >> Most of the arguments I have heard are "oh but its compiled with > >> -O3" or whatever. Any decent HPC code person will tell you that that > >> is most definitely not a guaranteed way to a faster system ... > > Hey...as I stated above, one would have to be quite silly to claim -O3 > > as the all well and all good optimization solution. At least you can > > rest assured your solutions will add up correctly with GCC. To get a > > Well, sometimes. You still need to be careful with it. > > This said, I am not sure icc/pgi/... are uniformly better than gcc. I > did an admittedly tiny study of this http://scalability.org/?p=470 some > time ago. What I found was the gcc really held its own. It did a very > good job on a very simple test case. > > Then again, the fortran version was simply faster than the C version, > but that can be explained ... by ... er ... ah ... something. I have heard that in many codes the choice of math library (fftw and atlas for example) makes a far, far greater difference in compiled application speed than choice of compiler. Can anyone comment on this? > > > "faster" system, you really have to look at your app, use strace, ltrace > > and gprof, then you can play with that. What I _am_ saying though is > > that Gentoo _does_ empower the administrator by giving him the ability > > to customize the OS if a bottleneck is to be identified. > > Yup. There is nothing like a profile of an app running the code, to see > where it is spending its time to decide between code shifts and > algorithmic shifts. > > >> > >>> (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish > >> > >> Er... We often use several different compilers in several different > >> trees. Several gccs, pgi, icc, eieio ... you name it. All are > >> integrated. > > Are-you currently able to run GCC-4.3.x versions on your current setup, > > Currently running 4.2.3-2ubuntu7 on my laptop. Other machines > (development box) has something like 4 different gccs there. I haven't > tried 4.3.x yet ... had planned to, but work gets in the way. Speaking of gcc 4.3 and math libraries, has anyone had issues with acml 4.1.0? I had some undefined reference errors when I tried. I'm not a programmer and got my code to compile using a different set of math libraries so haven't wrestled with this much. The undefined references are given below and more details can be found in my post at the OpenSUSE forum: http://forums.opensuse.org/programming-scripting/390838-opensuse-11-0-acml-4-1-0-a.html dgemv.F:(.text+0x3fd): undefined reference to `_gfortran_allocate64' dgemv.F:(.text+0x451): undefined reference to `_gfortran_internal_free' dgemv.F:(.text+0x516): undefined reference to `_gfortran_deallocate' > > > I'm actually eager to know. I'm still living under the ASSumption od > > binary distributions not coping too well with multi-library > > environments. Point in case, one of my colleagues _really_ wanted > > No, our systems (Ubuntu, SuSE, Centos) seem to have no real problems > apart from the occasional broken hard wired /usr/lib with the wrong ABI > in a configure/make file. Usually easy to fix. > > > firefox 3 on his ubuntu system. The installer trickled down to having to > > uninstall glibc...and he forced it to YES (and this is just a browser, > > not something that is used to _make_ code and would be tied to glibc) > > Hmmm... I have firefox 3 on this system (64 bit) and I run icecat for 32 > bit access (java and other things). No glibc changes (apart from > security patches). He must have done something horribly wrong. We have > multiple mixed ABI ubuntu/centos/suse systems, and haven't had issues. I am starting to really appreciate the OpenSUSE community repositories. Lots of stuff is prebuilt and known to work. Yast makes it easy to see what the dependancies are as well. I'm currently running OpenSUSE 11.0 (but with KDE 3.5.9 - the 4.x is a bit too bleeding edge for me). > > >> > >>> afterthought of an RPM that pulls in a new glibc that breaks the install > >> > >> Er ... not the slightest clue as to what you are talking about. I > >> haven't seen gcc, icc, pgi, ... touch our glibc. > >> > >> Maybe I am missing the fun. Which ICC version is this? Which gcc is > >> this, which glibc is this? > >> > > Sorry about that I might have been misleading, GCC is generally the one > > most sensitive to glibc, not the other ones although the latest ICC > > (10.1.x series) do claim compatibility with the GNU environment so it > > might get a little more dependency there. > > We have installed the 10.1.015 on customer machines from Centos 5.2 > through SuSE 10.x through Ubuntu with nary a problem. Very different > glibc's. No issues with code generation. > > Binary distributions aren't evil. They do work, quite well in most cases. > > > > > > Cheers! > > > > Eric > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics LLC, > email: landman at scalableinformatics.com > web : http://www.scalableinformatics.com > http://jackrabbit.scalableinformatics.com > phone: +1 734 786 8423 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > Mark Kosmowski From gerry.creager at tamu.edu Thu Aug 7 04:57:58 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu, 07 Aug 2008 06:57:58 -0500 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489ADBCD.9090307@scalableinformatics.com> References: <382428251.47851218066512841.JavaMail.root@mail.vpac.org> <489A75B2.2050009@tamu.edu> <489ADBCD.9090307@scalableinformatics.com> Message-ID: <489AE346.9050902@tamu.edu> Joe Landman wrote: > Gerry Creager wrote: > >> Speaking of jumbo frames, I'm seeing a problem on a Broadcom 57xx >> chipset on CentOS 4.3, 2.6.9-67 kernel (yeah, I know) and a tg3 >> driver. I can't make the thing recognize the ability to use jumbo >> frames. Anyone got a fix? > > Had a very similar issue some time ago with tg3 on Broadcom chipset NICs > (57xx). The only solution we could find was to not use jumbo frames. > The preventative measure we take as a result of this is also to > recommend using Intel NICs (and a few others) over Broadcom whenever > possible. I was afraid of that. The tg3 driver and the specific chipset both claim it should work. I've not looked at driver source, and I'm thinking of downloading Broadcom's proprietary driver for linux. I can't add Intel to this box: physically outta room (2u chassis, nVidia 8800 for remote graphics eats up the physical riser real estate). -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From kyron at neuralbs.com Thu Aug 7 07:48:53 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Thu, 07 Aug 2008 10:48:53 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A576D.4000005@scalableinformatics.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> <489A50D6.4060302@neuralbs.com> <489A576D.4000005@scalableinformatics.com> Message-ID: <489B0B55.7030502@neuralbs.com> Joe Landman wrote: > Eric Thibodeau wrote: > >>> Advantage of modules is you can upgrade them without upgrading the >>> kernel. Go ahead, build in that e1000 driver. I dare yah... :( >> Ok...I didn't put enought emphasis on "main" stuff....as in, _all you >> need to get the system booted, which essentially means HDD chipset >> drivers, the rest I do build as a module (NIC, video and such). >>> >>> More to the point it does give some good flexibility for end users >>> with a need to keep the core "separate" from the drivers for >>> maintenance. >>> >>> Initrd is subtle and quick to anger. One must use burnt offerings >>> to placate the spirits of initrd. >> LOL! > > ... now I don't mean hardware burnt offerings ... smoke rising from > your motherboard may not placate the spirits of initrd, they > definitely may impede further operations ... Oh...you mean something like this: http://wiki.neuralbs.com/~kyron/WrongSpecs/dsc00883.jpg > >>> >>> Well, it would be a heck of a lot nicer if the tools were a little >>> more forgiving ... Oh you don't have this driver in your initrd ... >>> ok ... PANIC (mwahahahaha) >> Pahahahahah... Point in case, I am building a CD-only cluster system >> (based on Gentoo) and I am currently _NOT_ using initrd because all >> that really needs to be built in is NFSroot support an all NICs I >> care to put in. Obviously this is a deprecated approach but it's >> proven to be the most effective and easy to maintain in my case. > > We build an integrated NFSroot and e1000 and a few other things for a > customer. Fixed hardware for their cluster. From bare-metal-off to > operational infiniband compute node in ~45-60 seconds (I say 45, but a > few things took a little longer to start, like SGE). Hey, weren't you the one complaining about e1000 "Go ahead, build in that e1000 driver. I dare yah"? I haven't seen "moving hardware"...oh, wait, yes I have, our cluster is on wheels (dig a little and you'll see it)! How many nodes? > [...snip...] >>> Most of the arguments I have heard are "oh but its compiled with >>> -O3" or whatever. Any decent HPC code person will tell you that that >>> is most definitely not a guaranteed way to a faster system ... >> Hey...as I stated above, one would have to be quite silly to claim >> -O3 as the all well and all good optimization solution. At least you >> can rest assured your solutions will add up correctly with GCC. To get a > Well, sometimes. You still need to be careful with it. > > This said, I am not sure icc/pgi/... are uniformly better than gcc. I > did an admittedly tiny study of this http://scalability.org/?p=470 > some time ago. What I found was the gcc really held its own. It did > a very good job on a very simple test case. This is worth a new thread ;) > > Then again, the fortran version was simply faster than the C version, > but that can be explained ... by ... er ... ah ... something. > >> "faster" system, you really have to look at your app, use strace, >> ltrace and gprof, then you can play with that. What I _am_ saying >> though is that Gentoo _does_ empower the administrator by giving him >> the ability to customize the OS if a bottleneck is to be identified. > > Yup. There is nothing like a profile of an app running the code, to > see where it is spending its time to decide between code shifts and > algorithmic shifts. > >>> >>>> (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish >>> >>> Er... We often use several different compilers in several different >>> trees. Several gccs, pgi, icc, eieio ... you name it. All are >>> integrated. >> Are-you currently able to run GCC-4.3.x versions on your current setup, > > Currently running 4.2.3-2ubuntu7 on my laptop. Other machines > (development box) has something like 4 different gccs there. I > haven't tried 4.3.x yet ... had planned to, but work gets in the way. Tell me when you get it going, it's for 4.3.x that I had to upgrade glibc. As a ref: http://bugs.gentoo.org/show_bug.cgi?id=218603 > >> I'm actually eager to know. I'm still living under the ASSumption od >> binary distributions not coping too well with multi-library >> environments. Point in case, one of my colleagues _really_ wanted > > No, our systems (Ubuntu, SuSE, Centos) seem to have no real problems > apart from the occasional broken hard wired /usr/lib with the wrong > ABI in a configure/make file. Usually easy to fix. Ok, those are the general problems I would hit and I had switched to Gentoo before starting to use SRPMs. > >> firefox 3 on his ubuntu system. The installer trickled down to having >> to uninstall glibc...and he forced it to YES (and this is just a >> browser, not something that is used to _make_ code and would be tied >> to glibc) > > Hmmm... I have firefox 3 on this system (64 bit) and I run icecat for > 32 bit access (java and other things). No glibc changes (apart from > security patches). He must have done something horribly wrong. We > have multiple mixed ABI ubuntu/centos/suse systems, and haven't had > issues. Curious...maybe he has an _old_ Ubuntu install...something like 6.0 series. > >>>> afterthought of an RPM that pulls in a new glibc that breaks the >>>> install >>> >>> Er ... not the slightest clue as to what you are talking about. I >>> haven't seen gcc, icc, pgi, ... touch our glibc. >>> >>> Maybe I am missing the fun. Which ICC version is this? Which gcc >>> is this, which glibc is this? >>> >> Sorry about that I might have been misleading, GCC is generally the >> one most sensitive to glibc, not the other ones although the latest >> ICC (10.1.x series) do claim compatibility with the GNU environment >> so it might get a little more dependency there. > > We have installed the 10.1.015 on customer machines from Centos 5.2 > through SuSE 10.x through Ubuntu with nary a problem. Very different > glibc's. No issues with code generation. I am sorry I mixed up glibc with GCC whilst talking about ICC's compatibility, this one is specific to gcc and icc on the same system and the (re)definition of atomic functions which ICC couldn't follow http://bugs.gentoo.org/show_bug.cgi?id=201596 Never hit that? > > Binary distributions aren't evil. They do work, quite well in most > cases. I switched to Gentoo in 2004 and never looked back, and I should because 4years is a long time in the distribution world. I did switch my laptop users to Kubuntu but I still find the distribution annoys me. > >> >> Cheers! >> >> Eric From peter.st.john at gmail.com Thu Aug 7 07:58:01 2008 From: peter.st.john at gmail.com (Peter St. John) Date: Thu, 7 Aug 2008 10:58:01 -0400 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: <489A639B.8000500@ldeo.columbia.edu> References: <20080806215601.GA2375@nlxdcldnl2.cl.intel.com> <489A639B.8000500@ldeo.columbia.edu> Message-ID: Maybe in the 32-bit compile, a value is stored in a 64-bit register, and when it gets "robbed" (to populate the missing value for an adjacent variable) the 32 bits of backfill are taken, so the remaining value is good; but in a 64-bit compile, all 64 bits are taken so the remaininder is rubbish. It would depend on both the compiler and the hardware, and the takeaway is to not do that :-) Peter On 8/6/08, Gus Correa wrote: > > Hi Ricardo, David, Mark, and list > > If as Ricardo says, he suppressed the 5th parameter ("use_work") on the > call > to rfftwnd_f77_mpi, which has 6 parameters, wouldn't it start mismatching > pointers > on the 5th parameter, instead of on the 2nd parameter ("n_fields")? > I.e. "use_work" would take the value of "FFTW_NORMAL_ORDER", > and "FFTW_NORMAL_ORDER" would get a random value (OS permitting), > but the initial 4 parameters would be correct, right? > In any case, there is little difference between this and what David said, > the point of failure is different, the nature is the same. > > However, it is interesting that somehow > at runtime the program segfaults in 64-bits, but doesn't fail in 32-bits, > although it most likely computes wrong stuff. > Ricardo have you ever QCd' the 32-bit output before you fixed/inserted > "use_work"? > If you were in a big lucky strike the random value left on the > FFTW_NORMAL_ORDER > address matched your needs, and the result may be correct! :) > > Anyway, somehow the program seems to behave differently, > with the OS superego being more compliant (in a nasty sense) in 32-bits > than it is in 64-bits. > Does the OS paradoxically give less memory room for the stack in 64-bits, > leading to the segfault? > Or does it give the same room, but because the pointers are bigger the > segfault is more likely? > Or does the segfault happen somewhere else, not on the stack? > Where? > Why in 64-bits? > Why not in 32 bits? > > Yes, as David noted about programming, here I also got and continue to get > these bugs, > particularly in Fortran programs where no parameter checking is enforced. > And the nastier ones are those that don't segfault, > then come back to haunt you when somebody looks at the output, > if you are not careful enough to look at it before anybody else does. > > Cheers, > Gus Correa > > Compilar e' preciso, > rodar e' impreciso! > > ... mais uma do vosso alter-ego P'ssoa ... :) > > -- > --------------------------------------------------------------------- > Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu > Lamont-Doherty Earth Observatory - Columbia University > P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > > Lombard, David N wrote: > > On Tue, Aug 05, 2008 at 02:57:42AM -0700, Ricardo Reis wrote: >> >> >>> On Mon, 4 Aug 2008, Mark Kosmowski wrote: >>> >>> >>> >>>> So, why did the 32-bit test case work? Shouldn't the same problem >>>> crash both systems if it is a code issue? >>>> >>>> >>> >> Not necessarily given the error described below. >> >> >> >>> I asked the same question myself... The function interface is: >>> >>> call rfftwnd_f77_mpi(plan_c2r, & >>> 1, local_data, work, use_work, FFTW_NORMAL_ORDER) >>> >>> where use_work is an integer, value 1 if you use the work temporary >>> array, 0 otherwise. This was the variable I wasn't passing. >>> >>> >> ... >> >> >>> The wrapper function for this is (from rfftw_f77_mpi.c): >>> >>> void F77_FUNC_(rfftwnd_f77_mpi,RFFTWND_F77_MPI) >>> (rfftwnd_mpi_plan *p, int *n_fields, fftw_real *local_data, >>> fftw_real *work, int *use_work, int *ioutput_order) >>> >>> >> >> >> >>> .... So it must be a pointer issue revealed by the 64 bit, no? When I >>> wasn't doing it "properly" the value of *ioutput_order wasn't set. >>> >>> >> >> The value of the first element of local_data was used for the n_fields >> scalar. >> >> The work array was being laid down starting at the location of the >> use_work scalar. >> >> The FFTW_NORMAL_ORDER value was being interpreted as use_work scalar. >> >> Finally, ioutput_order scalar was some random value. >> >> So, a lot was going wrong there. It's just one of life's little, um, >> pleasures >> that it looked like it was working for your 32-bit test case. Don't >> worry, you'll >> likely do this again, as likely *every* one of us on this list has, too. >> >> BTW, Fortran passes by reference; that's why all args are pointers. >> >> >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eagles051387 at gmail.com Thu Aug 7 08:00:46 2008 From: eagles051387 at gmail.com (Jon Aquilina) Date: Thu, 7 Aug 2008 10:00:46 -0500 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489B0B55.7030502@neuralbs.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> <489A50D6.4060302@neuralbs.com> <489A576D.4000005@scalableinformatics.com> <489B0B55.7030502@neuralbs.com> Message-ID: my 2 cents bout ssd and i bet alot of you would agree. they are not worth the money yet for the amount of storage space that you are getting. i have seen at fry's electronics yesterday 1tb hdd for 200 dollars? why go for something that u get 32gb or 64gb max On Thu, Aug 7, 2008 at 9:48 AM, Eric Thibodeau wrote: > Joe Landman wrote: > >> Eric Thibodeau wrote: >> >> Advantage of modules is you can upgrade them without upgrading the >>>> kernel. Go ahead, build in that e1000 driver. I dare yah... :( >>>> >>> Ok...I didn't put enought emphasis on "main" stuff....as in, _all you >>> need to get the system booted, which essentially means HDD chipset drivers, >>> the rest I do build as a module (NIC, video and such). >>> >>>> >>>> More to the point it does give some good flexibility for end users with >>>> a need to keep the core "separate" from the drivers for maintenance. >>>> >>>> Initrd is subtle and quick to anger. One must use burnt offerings to >>>> placate the spirits of initrd. >>>> >>> LOL! >>> >> >> ... now I don't mean hardware burnt offerings ... smoke rising from your >> motherboard may not placate the spirits of initrd, they definitely may >> impede further operations ... >> > Oh...you mean something like this: > http://wiki.neuralbs.com/~kyron/WrongSpecs/dsc00883.jpg > >> >> >>>> Well, it would be a heck of a lot nicer if the tools were a little more >>>> forgiving ... Oh you don't have this driver in your initrd ... ok ... PANIC >>>> (mwahahahaha) >>>> >>> Pahahahahah... Point in case, I am building a CD-only cluster system >>> (based on Gentoo) and I am currently _NOT_ using initrd because all that >>> really needs to be built in is NFSroot support an all NICs I care to put in. >>> Obviously this is a deprecated approach but it's proven to be the most >>> effective and easy to maintain in my case. >>> >> >> We build an integrated NFSroot and e1000 and a few other things for a >> customer. Fixed hardware for their cluster. From bare-metal-off to >> operational infiniband compute node in ~45-60 seconds (I say 45, but a few >> things took a little longer to start, like SGE). >> > Hey, weren't you the one complaining about e1000 "Go ahead, build in that > e1000 driver. I dare yah"? I haven't seen "moving hardware"...oh, wait, yes > I have, our cluster is on wheels (dig a little and you'll see it)! How many > nodes? > >> [...snip...] >> >>> Most of the arguments I have heard are "oh but its compiled with -O3" or >>>> whatever. Any decent HPC code person will tell you that that is most >>>> definitely not a guaranteed way to a faster system ... >>>> >>> Hey...as I stated above, one would have to be quite silly to claim -O3 as >>> the all well and all good optimization solution. At least you can rest >>> assured your solutions will add up correctly with GCC. To get a >>> >> Well, sometimes. You still need to be careful with it. >> >> This said, I am not sure icc/pgi/... are uniformly better than gcc. I did >> an admittedly tiny study of this http://scalability.org/?p=470 some time >> ago. What I found was the gcc really held its own. It did a very good job >> on a very simple test case. >> > This is worth a new thread ;) > >> >> Then again, the fortran version was simply faster than the C version, but >> that can be explained ... by ... er ... ah ... something. >> >> "faster" system, you really have to look at your app, use strace, ltrace >>> and gprof, then you can play with that. What I _am_ saying though is that >>> Gentoo _does_ empower the administrator by giving him the ability to >>> customize the OS if a bottleneck is to be identified. >>> >> >> Yup. There is nothing like a profile of an app running the code, to see >> where it is spending its time to decide between code shifts and algorithmic >> shifts. >> >> >>>> (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish >>>>> >>>> >>>> Er... We often use several different compilers in several different >>>> trees. Several gccs, pgi, icc, eieio ... you name it. All are integrated. >>>> >>> Are-you currently able to run GCC-4.3.x versions on your current setup, >>> >> >> Currently running 4.2.3-2ubuntu7 on my laptop. Other machines >> (development box) has something like 4 different gccs there. I haven't >> tried 4.3.x yet ... had planned to, but work gets in the way. >> > Tell me when you get it going, it's for 4.3.x that I had to upgrade glibc. > As a ref: http://bugs.gentoo.org/show_bug.cgi?id=218603 > >> >> I'm actually eager to know. I'm still living under the ASSumption od >>> binary distributions not coping too well with multi-library environments. >>> Point in case, one of my colleagues _really_ wanted >>> >> >> No, our systems (Ubuntu, SuSE, Centos) seem to have no real problems apart >> from the occasional broken hard wired /usr/lib with the wrong ABI in a >> configure/make file. Usually easy to fix. >> > Ok, those are the general problems I would hit and I had switched to Gentoo > before starting to use SRPMs. > >> >> firefox 3 on his ubuntu system. The installer trickled down to having to >>> uninstall glibc...and he forced it to YES (and this is just a browser, not >>> something that is used to _make_ code and would be tied to glibc) >>> >> >> Hmmm... I have firefox 3 on this system (64 bit) and I run icecat for 32 >> bit access (java and other things). No glibc changes (apart from security >> patches). He must have done something horribly wrong. We have multiple >> mixed ABI ubuntu/centos/suse systems, and haven't had issues. >> > Curious...maybe he has an _old_ Ubuntu install...something like 6.0 series. > > >> >> afterthought of an RPM that pulls in a new glibc that breaks the install >>>>> >>>>> >>>> >>>> Er ... not the slightest clue as to what you are talking about. I >>>> haven't seen gcc, icc, pgi, ... touch our glibc. >>>> >>>> Maybe I am missing the fun. Which ICC version is this? Which gcc is >>>> this, which glibc is this? >>>> >>>> Sorry about that I might have been misleading, GCC is generally the one >>> most sensitive to glibc, not the other ones although the latest ICC (10.1.x >>> series) do claim compatibility with the GNU environment so it might get a >>> little more dependency there. >>> >> >> We have installed the 10.1.015 on customer machines from Centos 5.2 >> through SuSE 10.x through Ubuntu with nary a problem. Very different >> glibc's. No issues with code generation. >> > I am sorry I mixed up glibc with GCC whilst talking about ICC's > compatibility, this one is specific to gcc and icc on the same system and > the (re)definition of atomic functions which ICC couldn't follow > > http://bugs.gentoo.org/show_bug.cgi?id=201596 > > Never hit that? > >> >> Binary distributions aren't evil. They do work, quite well in most cases. >> > I switched to Gentoo in 2004 and never looked back, and I should because > 4years is a long time in the distribution world. I did switch my laptop > users to Kubuntu but I still find the distribution annoys me. > >> >> >>> Cheers! >>> >>> Eric >>> >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Thu Aug 7 08:14:45 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 07 Aug 2008 11:14:45 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489B0B55.7030502@neuralbs.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> <489A50D6.4060302@neuralbs.com> <489A576D.4000005@scalableinformatics.com> <489B0B55.7030502@neuralbs.com> Message-ID: <489B1165.3070000@scalableinformatics.com> Eric Thibodeau wrote: > Joe Landman wrote: >> ... now I don't mean hardware burnt offerings ... smoke rising from >> your motherboard may not placate the spirits of initrd, they >> definitely may impede further operations ... > Oh...you mean something like this: > http://wiki.neuralbs.com/~kyron/WrongSpecs/dsc00883.jpg Owie ... [...] >> We build an integrated NFSroot and e1000 and a few other things for a >> customer. Fixed hardware for their cluster. From bare-metal-off to >> operational infiniband compute node in ~45-60 seconds (I say 45, but a >> few things took a little longer to start, like SGE). > Hey, weren't you the one complaining about e1000 "Go ahead, build in > that e1000 driver. I dare yah"? I haven't seen "moving hardware"...oh, Yeah. Thats where it came from. We had to get the internal e1000 up, but then we needed to upgrade ... [pause] D'oh! [...] >> Currently running 4.2.3-2ubuntu7 on my laptop. Other machines >> (development box) has something like 4 different gccs there. I >> haven't tried 4.3.x yet ... had planned to, but work gets in the way. > Tell me when you get it going, it's for 4.3.x that I had to upgrade > glibc. As a ref: http://bugs.gentoo.org/show_bug.cgi?id=218603 Hmmm [...] >> We have installed the 10.1.015 on customer machines from Centos 5.2 >> through SuSE 10.x through Ubuntu with nary a problem. Very different >> glibc's. No issues with code generation. > I am sorry I mixed up glibc with GCC whilst talking about ICC's > compatibility, this one is specific to gcc and icc on the same system > and the (re)definition of atomic functions which ICC couldn't follow > > http://bugs.gentoo.org/show_bug.cgi?id=201596 > > Never hit that? Looked and no. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From john.hearns at streamline-computing.com Thu Aug 7 08:35:40 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Thu, 07 Aug 2008 16:35:40 +0100 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> <489A50D6.4060302@neuralbs.com> <489A576D.4000005@scalableinformatics.com> <489B0B55.7030502@neuralbs.com> Message-ID: <1218123351.5309.60.camel@Vigor13> On Thu, 2008-08-07 at 10:00 -0500, Jon Aquilina wrote: > my 2 cents bout ssd and i bet alot of you would agree. they are not > worth the money yet for the amount of storage space that you are > getting. i have seen at fry's electronics yesterday 1tb hdd for 200 > dollars? why go for something that u get 32gb or 64gb max Because, in (almost) the words of Soho clip joint doormen: "Naked - and they don't move" From kyron at neuralbs.com Thu Aug 7 10:02:58 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Thu, 07 Aug 2008 13:02:58 -0400 Subject: [Beowulf] Fastest way to compute Euclediant distance [spin off from: Building new cluster - estimate] Message-ID: <489B2AC2.60300@neuralbs.com> > >>> Most of the arguments I have heard are "oh but its compiled with >>> -O3" or whatever. Any decent HPC code person will tell you that that >>> is most definitely not a guaranteed way to a faster system ... >> Hey...as I stated above, one would have to be quite silly to claim >> -O3 as the all well and all good optimization solution. At least you >> can rest assured your solutions will add up correctly with GCC. To get a > Well, sometimes. You still need to be careful with it. > > This said, I am not sure icc/pgi/... are uniformly better than gcc. I > did an admittedly tiny study of this http://scalability.org/?p=470 > some time ago. What I found was the gcc really held its own. It did > a very good job on a very simple test case. Very nice post, thanks for that, it so happens I am going through the exact same steps trying to optimize a very simple piece of code computing the Euclidean distance and I was a little stomped to find out the simople C code outperforms BLAS (both GOTO and MKL). If you have gnuplot, a BLAS library with cblas interface, and icc installed, all you have to do is run `make` with the three attached files in the same dir and you'll get nice plots of what's going on. I'm also attaching an example run with: icc 10.1.017 gcc 4.3.1 GOTO BLAS 1.24 Eric PS: regular disclaimers about crappy code writing apply ;) -------------- next part -------------- A non-text attachment was scrubbed... Name: EuclideanDist.c Type: text/x-csrc Size: 3596 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Makefile URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Plot.gp URL: From perry at piermont.com Thu Aug 7 12:45:20 2008 From: perry at piermont.com (Perry E. Metzger) Date: Thu, 07 Aug 2008 15:45:20 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <1218123351.5309.60.camel@Vigor13> (John Hearns's message of "Thu\, 07 Aug 2008 16\:35\:40 +0100") References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> <489A50D6.4060302@neuralbs.com> <489A576D.4000005@scalableinformatics.com> <489B0B55.7030502@neuralbs.com> <1218123351.5309.60.camel@Vigor13> Message-ID: <87sktgpq27.fsf@snark.cb.piermont.com> John Hearns writes: > On Thu, 2008-08-07 at 10:00 -0500, Jon Aquilina wrote: >> my 2 cents bout ssd and i bet alot of you would agree. they are not >> worth the money yet for the amount of storage space that you are >> getting. i have seen at fry's electronics yesterday 1tb hdd for 200 >> dollars? why go for something that u get 32gb or 64gb max > > Because, in (almost) the words of Soho clip joint doormen: > "Naked - and they don't move" The name of the game is price/performance. If you're paying for power, a cluster is garbage after a few years anyway because the power costs alone justify buying new machines. If you're using boxes only for 2 or 3 years, the MTBF of hard drives is low enough that you're better off with el cheapo $30/80GB hard drives and a few extras to keep on the shelf and throw in when the ones in use break, rather than an expensive flash based SSD. There may be rare exceptions where the low latency seeks of the SSD provide enough extra performance to justify the cost, but they're not common in these sorts of apps. (Not common does not mean "do not exist".) Perry From matt at technoronin.com Thu Aug 7 14:03:53 2008 From: matt at technoronin.com (Matt Lawrence) Date: Thu, 7 Aug 2008 16:03:53 -0500 (CDT) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A4A3D.1070000@scalableinformatics.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> Message-ID: On Wed, 6 Aug 2008, Joe Landman wrote: > Advantage of modules is you can upgrade them without upgrading the kernel. > Go ahead, build in that e1000 driver. I dare yah... :( > > More to the point it does give some good flexibility for end users with a > need to keep the core "separate" from the drivers for maintenance. > > Initrd is subtle and quick to anger. One must use burnt offerings to placate > the spirits of initrd. Ok, I am trying to follow your advice. However, "make rpm" does not generate a package that includes initrd or updates to /etc/grub.conf. I have started looking at how to make that work, but the 9K lines of spec file for the CentOS kernel rpm are rather daunting. Since y'all have obviously been dealing with these sorts of problems longer than I have, I could really use some help here. -- Matt It's not what I know that counts. It's what I can remember in time to use. From csamuel at vpac.org Thu Aug 7 17:13:44 2008 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 8 Aug 2008 10:13:44 +1000 (EST) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489B0B55.7030502@neuralbs.com> Message-ID: <34938079.59421218154424268.JavaMail.root@mail.vpac.org> ----- "Eric Thibodeau" wrote: > Tell me when you get it going, it's for 4.3.x that > I had to upgrade glibc. We've got GCC 4.3 builds and a 4.4 snapshot on our AMD64 CentOS 5 cluster, no complaints that they don't work. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From landman at scalableinformatics.com Thu Aug 7 18:25:07 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 07 Aug 2008 21:25:07 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <34938079.59421218154424268.JavaMail.root@mail.vpac.org> References: <34938079.59421218154424268.JavaMail.root@mail.vpac.org> Message-ID: <489BA073.30407@scalableinformatics.com> Chris Samuel wrote: > ----- "Eric Thibodeau" wrote: > >> Tell me when you get it going, it's for 4.3.x that >> I had to upgrade glibc. > > We've got GCC 4.3 builds and a 4.4 snapshot on our AMD64 > CentOS 5 cluster, no complaints that they don't work. I just tried building 4.3.1 on Centos 5.2 this afternoon. Hit a problem in the build. Will try it again tomorrow. Joe > > cheers, > Chris -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Thu Aug 7 18:36:48 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 07 Aug 2008 21:36:48 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> Message-ID: <489BA330.1020907@scalableinformatics.com> Matt Lawrence wrote: > On Wed, 6 Aug 2008, Joe Landman wrote: > >> Advantage of modules is you can upgrade them without upgrading the >> kernel. Go ahead, build in that e1000 driver. I dare yah... :( >> >> More to the point it does give some good flexibility for end users >> with a need to keep the core "separate" from the drivers for maintenance. >> >> Initrd is subtle and quick to anger. One must use burnt offerings to >> placate the spirits of initrd. > > Ok, I am trying to follow your advice. However, "make rpm" does not > generate a package that includes initrd or updates to /etc/grub.conf. I I update these by hand/script. The mkinitrd isn't too painful. I have it on a machine at the lab (which happens to be off now, and I am at home). Will put that out in the morning. The grub update is fairly simple. > have started looking at how to make that work, but the 9K lines of spec > file for the CentOS kernel rpm are rather daunting. Hmmm... I normally recommend avoiding their spec file unless you want to use only their kernel and do minor tweaks from there. This said, I really recommend using make binrpm-pkg to generate the kernel/modules RPM and SRPM. Then the grub update and the mkinitrd can be scripted. If you are daring, you can include those scripts in the %post sections of the generated spec file (the make binrpm-pkg will generate a spec file for you). > > Since y'all have obviously been dealing with these sorts of problems > longer than I have, I could really use some help here. I'll be in the office tomorrow with the machines, I can send you the exact mkinitrd line. Bug me offline if you have a baseline kernel you want to get started with and we can walk our way through this. Joe > > -- Matt > It's not what I know that counts. > It's what I can remember in time to use. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From matt at technoronin.com Thu Aug 7 20:52:55 2008 From: matt at technoronin.com (Matt Lawrence) Date: Thu, 7 Aug 2008 22:52:55 -0500 (CDT) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489BA330.1020907@scalableinformatics.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> <489BA330.1020907@scalableinformatics.com> Message-ID: On Thu, 7 Aug 2008, Joe Landman wrote: > Hmmm... I normally recommend avoiding their spec file unless you want to use > only their kernel and do minor tweaks from there. > > This said, I really recommend using > > make binrpm-pkg > > to generate the kernel/modules RPM and SRPM. Then the grub update and the > mkinitrd can be scripted. If you are daring, you can include those scripts > in the %post sections of the generated spec file (the make binrpm-pkg will > generate a spec file for you). I really would like to have it all in one package. It's too easy to get things out of sync when doing changes by hand. I was figuring to find the appropriate sections of the CentOS spec file and use them. As always, other suggestions are welcome. -- Matt It's not what I know that counts. It's what I can remember in time to use. From peter.st.john at gmail.com Fri Aug 8 07:15:40 2008 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 8 Aug 2008 10:15:40 -0400 Subject: [Beowulf] computer Go Message-ID: The American Go Association (which has a free e-newsletter) at http://www.usgo.org/ reports that a machine won an exhibition game with a master last night at the US Go Congress. This isn't really historic; the master, Myungwan Kim, is an 8 dan professional, and gave 9 handicap stones to the machine. Very roughly, 8 dan pro would be comparable to 9 dan amateur; and very roughly, Kim would be able to give me 9 stones too (I'm 1 dan amateur and amateur handicaps equate one stone to one rank, and the mathematician Don Weiner 6d beats me easily at 6 stones, althugh I should be able to cope at 5). 9 stones is very roughly comparable to queen odds at chess, but the statistical distributions of Go and Chess are not the same; a Grandmaster of chess could maybe give me rook odds, not queen odds; knight odds is roughly comparable to two standard deviations, a rating difference of about 400 points, and I'm about 800 below the world champion (and I"m comparable in go and chess). But again speaking very roughly, this result is in the ballpark of achieving amateur 1 dan status, about the level that Ken Thompson achieved with Belle in the mid-80's (the first USCF Expert machine). Odds games in chess do not have the same probabilistic qualities as in Go; we almost never play odds games in chess anymore (it was popular for money in the 19th century) but can't get along without handicapping in Go, games between quite disparate players can be made interesting. I haven't found specifics for the machine or the team yet, but to quote the article: 800 processors, at 4.7 Ghz, 15 Teraflops on borrowed supercomputers A related article said the machine(s) was sited in Europe. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From robl at mcs.anl.gov Fri Aug 8 07:41:38 2008 From: robl at mcs.anl.gov (Robert Latham) Date: Fri, 8 Aug 2008 09:41:38 -0500 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <87k5eu45ww.fsf@snark.cb.piermont.com> References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> <87k5eu45ww.fsf@snark.cb.piermont.com> Message-ID: <20080808144137.GJ22642@mcs.anl.gov> On Wed, Aug 06, 2008 at 09:41:35AM -0400, Perry E. Metzger wrote: > > Matt Lawrence writes: > > Could be. Given the long and sordid history of NFS, I prefer to not > > use it whenever there are practical alternatives. > > NFS is a fine protocol and works very well. However, traditionally the > Linux implementation of NFS has been of less than perfect quality. You > shouldn't confuse NFS with NFS on Linux. It is exceedingly difficult to perform correct parallel I/O on NFS, let alone achieve high performance. NFSv4 promises to make this situation better. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From henning.fehrmann at aei.mpg.de Fri Aug 8 08:37:13 2008 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Fri, 8 Aug 2008 17:37:13 +0200 Subject: [Beowulf] copying big files Message-ID: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Hi everybody, Coping a big file onto all nodes in a cluster is a rather common problem. I would have thought that there might be a standard tool for distributing the files in an efficient way. So far, I haven't found one. Assuming one has a network design which allows non blocking full duplex wire-speed connections between N/2 pairs of nodes where N is the number of nodes in the cluster. It is basically a non blocking coreswitch. In this case the following scheme would be convenient and rather simple: The file is placed on node n1 and one builds a chain of nodes n1 , n2 .... nN. One splits the file into many packages (p1..pM), lets say a fragment fits into one TCP package. In the first step n1 transmits the package p1 to node n2. In the second step n1 transmits the package p2 to n2 and n2 transmits p1 to node n3. The transmission of a single package is fast. The time of passing a particular package through the whole chain of nodes is short compared with time of the entire copying process. E.g., using jumbo frames a package can have the size of ca 10kB. In Gb network the transmission time of a single package between nodes is of the order of 0.1 ms. Even in a cluster with 1024 nodes it takes in an ideal case just 0.1s to pass a package from node n1 through all nodes to n1024. On each node the package is stored and, in the end, one reassembles the file. For big files (size >> 10Mb) the required time is approximately the same as one needs for copying the file between two nodes plus 0.1s. One needs basically a daemon which handles copying requests and establishes the connection to next node in the chain. Has somebody written such a tool? Cheers, Henning Fehrmann From jan.heichler at gmx.net Fri Aug 8 08:52:40 2008 From: jan.heichler at gmx.net (Jan Heichler) Date: Fri, 8 Aug 2008 17:52:40 +0200 Subject: [Beowulf] copying big files In-Reply-To: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Message-ID: <12010303044.20080808175240@gmx.net> Hallo Henning, Freitag, 8. August 2008, meintest Du: HF> Hi everybody, HF> One needs basically a daemon which handles copying requests and establishes HF> the connection to next node in the chain. Why a daemon? Just MPI that starts up the processes on the remote nodes during programm startup. Advantage is that you can use any high-speed-interconnect which you have an MPI for. HF> Has somebody written such a tool? I wrote a tool in Java in 2004 or 2005 during an internship for IBM at the IPK in Gatersleben that implemented your strategy. Worked good but had some flaws since i was using some Java specific remote procedure calls and daemons. Implementing it with C/C++ and MPI can't take more than a couple of hours and should be easy to do. You could use tar and pipe the datastream to copy whole directories without worrying to much about file attributes. I always wanted to implement it myself but i can't find time with my current job :-( Could be a good project for a student to do. Regards, Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Fri Aug 8 08:59:24 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 08 Aug 2008 11:59:24 -0400 Subject: [Beowulf] copying big files In-Reply-To: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Message-ID: <489C6D5C.4020507@scalableinformatics.com> Henning Fehrmann wrote: > Hi everybody, > > Coping a big file onto all nodes in a cluster is a rather common problem. > I would have thought that there might be a standard tool for > distributing the files in an efficient way. So far, I haven't found one. > > Assuming one has a network design which allows non blocking full duplex > wire-speed connections between N/2 pairs of nodes where N is the number > of nodes in the cluster. It is basically a non blocking coreswitch. > > In this case the following scheme would be convenient and rather simple: > > The file is placed on node n1 and one builds a chain of nodes n1 , n2 .... nN. > > One splits the file into many packages (p1..pM), lets say a fragment fits > into one TCP package. In the first step n1 transmits the package p1 to node n2. > In the second step n1 transmits the package p2 to n2 and n2 transmits p1 to node n3. Someone has implemented this bucket brigade model for data transfer. Its not the only one available, as each NIC has two neighbors to communicate with, and thus winds up at effectively 1/2 the bandwidth, or a serialization of the packets. Not that this is a bad thing, but for big file distribution, this could be a problem. > > The transmission of a single package is fast. The time of passing a particular > package through the whole chain of nodes is short compared with time of the > entire copying process. E.g., using jumbo frames a package can have the size of ca 10kB. > In Gb network the transmission time of a single package between nodes is > of the order of 0.1 ms. Even in a cluster with 1024 nodes it takes > in an ideal case just 0.1s to pass a package from node n1 through all nodes to n1024. > > On each node the package is stored and, in the end, one reassembles the file. > For big files (size >> 10Mb) the required time is approximately > the same as one needs for copying the file between two nodes plus 0.1s. > > One needs basically a daemon which handles copying requests and establishes > the connection to next node in the chain. > > Has somebody written such a tool? I saw something like this several years ago. We were working on a different type of tool that exploited the fact that you have N/2 pairs, and tried to maximize the flow to these N/2 pairs. It included error correction and a few other nice things (multi-sourcing was on the roadmap). Never could find interested customers/users for it, so it fell off the radar. We called it xcp, and you used it as xcp [set of files] cluster://name/path/to/deposit/files/into and it handled it all for you. Prior to that, we had a system that used multicast, but after seeing what this did to other traffic on the gigabit switches, we went away from that. That was mcp, and was dated around 2000-ish or so. You can use bittorrent to do something approximately like xcp though at lower performance. Joe > > Cheers, > Henning Fehrmann > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From mathog at caltech.edu Fri Aug 8 09:11:46 2008 From: mathog at caltech.edu (David Mathog) Date: Fri, 08 Aug 2008 09:11:46 -0700 Subject: [Beowulf] copying big files (Henning Fehrmann) Message-ID: Henning Fehrmann wrote: > Coping a big file onto all nodes in a cluster is a rather > common problem. I would have thought that there might be a > standard tool for distributing the files in an efficient way. > So far, I haven't found one. This is what I use: http://saf.bio.caltech.edu/nettee.html The production version is pretty much what you described. The development version is more flexible, allowing processing on each data chunk, and data flow in either direction along the chain. The biggest problem with chain methods is that it is difficult to recover if something breaks in the middle during the transfer. My cluster is only 20 nodes and it has not been an issue, but on a 2000 node cluster it probably would be. It is of course also important that all of the nodes in the distribution chain have sufficient free network and CPU resources. If there are any slow nodes the whole chain will be slow since the slow nodes will be rate limiting. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From i.kozin at dl.ac.uk Fri Aug 8 09:21:01 2008 From: i.kozin at dl.ac.uk (Kozin, I (Igor)) Date: Fri, 8 Aug 2008 17:21:01 +0100 Subject: [Beowulf] copying big files In-Reply-To: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Message-ID: > Has somebody written such a tool? eXludus have a product called Replicator http://www.exludus.com/ We have benchmarked an early version of it and it was pretty good at copying large files from an NFS server over gigabit to compute nodes. From mark.kosmowski at gmail.com Fri Aug 8 09:53:43 2008 From: mark.kosmowski at gmail.com (Mark Kosmowski) Date: Fri, 8 Aug 2008 12:53:43 -0400 Subject: [Beowulf] gcc 4.3 and acml 4.1.0 do not play together Message-ID: After a bit of struggling I found a post at the AMD Developer Forum stating that gcc 4.3 and acml 4.1.0 are not compatible. I've been reading a bunch of folks here using gcc 4.3 and just wanted to make everyone aware, hopefully prevent some time lost to futility. Have a great weekend! Mark E. Kosmowski From perry at piermont.com Fri Aug 8 09:54:23 2008 From: perry at piermont.com (Perry E. Metzger) Date: Fri, 08 Aug 2008 12:54:23 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <20080808144137.GJ22642@mcs.anl.gov> (Robert Latham's message of "Fri\, 8 Aug 2008 09\:41\:38 -0500") References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> <87k5eu45ww.fsf@snark.cb.piermont.com> <20080808144137.GJ22642@mcs.anl.gov> Message-ID: <87abfnphvk.fsf@snark.cb.piermont.com> Robert Latham writes: > On Wed, Aug 06, 2008 at 09:41:35AM -0400, Perry E. Metzger wrote: >> >> Matt Lawrence writes: >> > Could be. Given the long and sordid history of NFS, I prefer to not >> > use it whenever there are practical alternatives. >> >> NFS is a fine protocol and works very well. However, traditionally the >> Linux implementation of NFS has been of less than perfect quality. You >> shouldn't confuse NFS with NFS on Linux. > > It is exceedingly difficult to perform correct parallel I/O on NFS, > let alone achieve high performance. Not really true. There are whole firms that base their business on doing it well, like Netapp. > NFSv4 promises to make this situation better. It promises to improve all sorts of things in NFS. -- Perry E. Metzger perry at piermont.com From perry at piermont.com Fri Aug 8 09:57:16 2008 From: perry at piermont.com (Perry E. Metzger) Date: Fri, 08 Aug 2008 12:57:16 -0400 Subject: [Beowulf] copying big files In-Reply-To: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> (Henning Fehrmann's message of "Fri\, 8 Aug 2008 17\:37\:13 +0200") References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Message-ID: <8763qbphqr.fsf@snark.cb.piermont.com> Henning Fehrmann writes: > Coping a big file onto all nodes in a cluster is a rather common problem. > I would have thought that there might be a standard tool for > distributing the files in an efficient way. So far, I haven't found one. bittorrent works quite well, and is trivial to use. It will use all your IO bandwidth between all nodes (which is what you want I presume.) Perry -- Perry E. Metzger perry at piermont.com From perry at piermont.com Fri Aug 8 09:58:07 2008 From: perry at piermont.com (Perry E. Metzger) Date: Fri, 08 Aug 2008 12:58:07 -0400 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: (David Mathog's message of "Fri\, 08 Aug 2008 09\:11\:46 -0700") References: Message-ID: <871w0zphpc.fsf@snark.cb.piermont.com> "David Mathog" writes: > The biggest problem with chain methods is that it is difficult to > recover if something breaks in the middle during the transfer. My > cluster is only 20 nodes and it has not been an issue, but on a 2000 > node cluster it probably would be. It is of course also important that > all of the nodes in the distribution chain have sufficient free network > and CPU resources. If there are any slow nodes the whole chain will be > slow since the slow nodes will be rate limiting. Is there a reason bittorrent isn't suited to this application? -- Perry E. Metzger perry at piermont.com From peter.st.john at gmail.com Fri Aug 8 10:01:50 2008 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 8 Aug 2008 13:01:50 -0400 Subject: [Beowulf] Re: computer Go In-Reply-To: References: Message-ID: The program was MoGo, http://www.lri.fr/~gelly/MoGo.htm, but I don't know anything about the "borrowed" hardware. Peter On 8/8/08, Peter St. John wrote: > > The American Go Association (which has a free e-newsletter) at > http://www.usgo.org/ reports that a machine won an exhibition game with a > master last night at the US Go Congress. This isn't really historic; the > master, Myungwan Kim, is an 8 dan professional, and gave 9 handicap stones > to the machine. > > Very roughly, 8 dan pro would be comparable to 9 dan amateur; and very > roughly, Kim would be able to give me 9 stones too (I'm 1 dan amateur and > amateur handicaps equate one stone to one rank, and the mathematician Don > Weiner 6d beats me easily at 6 stones, althugh I should be able to cope at > 5). > > 9 stones is very roughly comparable to queen odds at chess, but the > statistical distributions of Go and Chess are not the same; a Grandmaster of > chess could maybe give me rook odds, not queen odds; knight odds is roughly > comparable to two standard deviations, a rating difference of about 400 > points, and I'm about 800 below the world champion (and I"m comparable in > go and chess). But again speaking very roughly, this result is in the > ballpark of achieving amateur 1 dan status, about the level that Ken > Thompson achieved with Belle in the mid-80's (the first USCF Expert > machine). Odds games in chess do not have the same probabilistic qualities > as in Go; we almost never play odds games in chess anymore (it was popular > for money in the 19th century) but can't get along without handicapping in > Go, games between quite disparate players can be made interesting. > > I haven't found specifics for the machine or the team yet, but to quote the > article: > > 800 processors, at 4.7 Ghz, 15 Teraflops on borrowed supercomputers > > A related article said the machine(s) was sited in Europe. > > Peter > -------------- next part -------------- An HTML attachment was scrubbed... URL: From smulcahy at aplpi.com Fri Aug 8 10:13:11 2008 From: smulcahy at aplpi.com (stephen mulcahy) Date: Fri, 08 Aug 2008 18:13:11 +0100 Subject: [Beowulf] copying big files In-Reply-To: <8763qbphqr.fsf@snark.cb.piermont.com> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <8763qbphqr.fsf@snark.cb.piermont.com> Message-ID: <489C7EA7.3030602@aplpi.com> Perry E. Metzger wrote: > bittorrent works quite well, and is trivial to use. It will use all > your IO bandwidth between all nodes (which is what you want I > presume.) What is the simplest app for setting up this kind of local torrenting setup? I've used public torrents before but assumed you needed a tracker or something - or is that just for sharing them with the public? -stephen -- Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center, GMIT, Dublin Rd, Galway, Ireland. +353.91.751262 http://www.aplpi.com Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway) From apittman at concurrent-thinking.com Fri Aug 8 10:13:38 2008 From: apittman at concurrent-thinking.com (Ashley Pittman) Date: Fri, 08 Aug 2008 18:13:38 +0100 Subject: [Beowulf] copying big files In-Reply-To: <489C6D5C.4020507@scalableinformatics.com> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <489C6D5C.4020507@scalableinformatics.com> Message-ID: <1218215618.7731.21.camel@bruce.priv.wark.uk.streamline-computing.com> On Fri, 2008-08-08 at 11:59 -0400, Joe Landman wrote: > > The transmission of a single package is fast. The time of passing a particular > > package through the whole chain of nodes is short compared with time of the > > entire copying process. E.g., using jumbo frames a package can have the size of ca 10kB. > > In Gb network the transmission time of a single package between nodes is > > of the order of 0.1 ms. Even in a cluster with 1024 nodes it takes > > in an ideal case just 0.1s to pass a package from node n1 through all nodes to n1024. > > > > On each node the package is stored and, in the end, one reassembles the file. > > For big files (size >> 10Mb) the required time is approximately > > the same as one needs for copying the file between two nodes plus 0.1s. > > > > One needs basically a daemon which handles copying requests and establishes > > the connection to next node in the chain. > > > > Has somebody written such a tool? > > I saw something like this several years ago. I've written a couple of them over the years, one as a way of copying a new OS between nodes and I started one more recently as a general purpose copy tool. I've always thought that something like this should be written in MPI and simply make use of MPI_Bcast() and MPI_Addreduce() to do the hard part, it's very easy to write a version that copies a single file, slightly harder to do multiple files but still not rocket science. Basically efficient broadcast isn't as easy to make generic as it seems, why waste time even trying when you can get MPI to do all tricky bits like work out toplogy/starting deamons/security and in all probability do it better to boot? Ashley Pittman. From geoff at galitz.org Fri Aug 8 10:27:41 2008 From: geoff at galitz.org (Geoff Galitz) Date: Fri, 8 Aug 2008 19:27:41 +0200 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: References: Message-ID: <52CB0AA0F8DF40C48E60D6079B698751@geoffPC> I use dolly (http://www.cs.inf.ethz.ch/CoPs/patagonia/ and search for dolly) from which nettee is forked and pdsh (http://sourceforge.net/projects/pdsh). Both are great but dolly has certain advantages for my environment. In my case, I wrap it up in a service delivery tool for a 50 node cluster where rapid deployment is key. It is just some perl code that essentially looks like this: foreach list_of_machines if they answer, add them to the list done login to the nodes via pdsh to start the dolly processes perform the transfer done! There is actually a lot more to my code, but it is all environment specific and deals with managing our custom apps. I can push 1.6G (which includes an svn checkout over the wire) to all nodes in approx 15 minutes. The checkout is the longest part, the actual file copy to the nodes is less than 5 minutes with our GigE network. The nodes are busy processing while the data transfer is in progress (this is an HA cluster). When I researched this initially, I found there were actually a lot of environment specific questions I needed answered, hence a lack of real standardization on how these things are done. Scaling is often the hardest part... at least IMO. I will say that my dream would be for something like dolly to get some sort of transfer recovery mechanism, though I realize that would be quite difficult in such a topology. As an aside, I know that the dolly author (Felix) reads this list. I assume dolly itself is now unmaintained? Geoff Galitz Blankenheim NRW, Deutschland http://www.galitz.org -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of David Mathog Sent: Freitag, 8. August 2008 18:12 To: beowulf at beowulf.org Subject: [Beowulf] copying big files (Henning Fehrmann) Henning Fehrmann wrote: > Coping a big file onto all nodes in a cluster is a rather > common problem. I would have thought that there might be a > standard tool for distributing the files in an efficient way. > So far, I haven't found one. From mathog at caltech.edu Fri Aug 8 10:33:41 2008 From: mathog at caltech.edu (David Mathog) Date: Fri, 08 Aug 2008 10:33:41 -0700 Subject: [Beowulf] copying big files (Henning Fehrmann) Message-ID: Perry E. Metzger wrote: > Is there a reason bittorrent isn't suited to this application? That would probably work too. I also recall reading about "broadcast" methods for doing this sort of distribution. I originally wrote nettee, which is a derivative of "dolly", because Ghost (which uses a broadcast method) was dreadfully slow. This was many moons ago, and there is probably much better broadcast software around today, even in Ghost itself. nettee has a few whistles and bells for this "data push" application. bittorrent is more of a "data pull" application. So on the top node, nettee can easily tell you when all of the clients are in the chain (or not), monitor the progress of the download, and indicate if anything went wrong. To do that with the bittorrent clients I assume you would have to wrap them in scripts to report the desired information back to a central point. It looks to me like it would be easiest to have each client report its own completion status, perhaps through a wget with a carefully formed URL to a cgi script on a central web server. Because torrents are "pull", determining from the sending side(s) if a particular torrent client has successfully completed a file download looks like a nightmare. It would seem to require examining the log information on (potentially all) of the other clients. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From perry at piermont.com Fri Aug 8 10:36:54 2008 From: perry at piermont.com (Perry E. Metzger) Date: Fri, 08 Aug 2008 13:36:54 -0400 Subject: [Beowulf] copying big files In-Reply-To: <489C7EA7.3030602@aplpi.com> (stephen mulcahy's message of "Fri\, 08 Aug 2008 18\:13\:11 +0100") References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <8763qbphqr.fsf@snark.cb.piermont.com> <489C7EA7.3030602@aplpi.com> Message-ID: <87sktfo1c9.fsf@snark.cb.piermont.com> stephen mulcahy writes: > Perry E. Metzger wrote: >> bittorrent works quite well, and is trivial to use. It will use all >> your IO bandwidth between all nodes (which is what you want I >> presume.) > > What is the simplest app for setting up this kind of local torrenting > setup? Bittorrent itself, or bittornado which is a slightly improved version. They're written in python and trivial to install. > I've used public torrents before but assumed you needed a > tracker or something - or is that just for sharing them with the > public? You do need a tracker, but "bttrack" comes with bittorrent or bittornado and works just fine. You can launch it from the command line in seconds. You also need to create a .torrent file, which you do with "btmakemetafile", which also comes with the bittorrent or bittornado package. -- Perry E. Metzger perry at piermont.com From perry at piermont.com Fri Aug 8 10:53:08 2008 From: perry at piermont.com (Perry E. Metzger) Date: Fri, 08 Aug 2008 13:53:08 -0400 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: (David Mathog's message of "Fri\, 08 Aug 2008 10\:33\:41 -0700") References: Message-ID: <87k5ero0l7.fsf@snark.cb.piermont.com> "David Mathog" writes: > Perry E. Metzger wrote: >> Is there a reason bittorrent isn't suited to this application? > > That would probably work too. > > I also recall reading about "broadcast" methods for doing this sort > of distribution. John Ioannidis designed a protocol for this years ago called the Coherent File Distribution Protocol, CFDP. http://tools.ietf.org/html/rfc1235 It can be used for things like mass loading hundreds of machines with a single file on a broadcast fabric. It was originally intended for loading huge numbers of clients over a wireless broadcast network, for which it worked quite well. There hasn't been much interest in it over time, perhaps because the sorts of problems it was intended for are rare enough that people just use ad hoc methods. I think JI's original implementation is open source, though it is quite creaky at this point. If there is interest, I can ask him to put it up on the web somewhere. -- Perry E. Metzger perry at piermont.com From mathog at caltech.edu Fri Aug 8 10:55:19 2008 From: mathog at caltech.edu (David Mathog) Date: Fri, 08 Aug 2008 10:55:19 -0700 Subject: [Beowulf] copying big files (Henning Fehrmann) Message-ID: > I will say that my dream would be for something like dolly to get some sort > of transfer recovery mechanism, though I realize that would be quite > difficult in such a topology. nettee has some failover and continuation capabilities at different points - but not what I think you want. The development version has a few extra modes for cases where data is being merged, but that isn't relevant to this discussion. When setting up the initial chain nettee can connect to an alternate node (from a list of failovers) if the target node will not answer. It also has the ability to keep going if the local disk becomes unwritable, and it can continue a download on a chain down to the node above the point of failure. However, nettee cannot at present rewire around a failed node to continue a download to the node(s) below it. That would indeed be quite difficult, since one could have a situation like this: A -> B (A knows it has sent 100MB) B -> C (B knows it has sent 98MB, then it blows up) C (C knows it has received 98 MB) A and C will eventually figure out that B has died, and they could conceivably negotiate a new connection, but A may no longer have the missing 2 MB (it might have been sent out a pipe, processed, and not stored in the raw state anywhere.) On the other hand, the development version uses ring buffers, and one could set those to be very large, enabling a certain level of "redo" from A. So if C comes back and says "I only have 98MB" A can see if it has the missing parts and go on if it does. It still might not though. If B has stalled for long enough the ring buffer on A may have completely filled from the previous node, overwriting the data needed to recover. I guess it would be possible to implement a "safety region" in the ring buffer which could not be overwritten. > > As an aside, I know that the dolly author (Felix) reads this list. I assume > dolly itself is now unmaintained? AFAIK Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From kus at free.net Fri Aug 8 11:47:12 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 08 Aug 2008 22:47:12 +0400 Subject: [Beowulf] LAM w/HPC Challenge: ld.so interface ? Message-ID: For some reasons I need to run HPC Challenge tests w/LAM MPI instead of OpenMPI. LAM was installed, and the classical test w/hello - using gcc - was performed successfully. But when I build hpcc executable, I see linker messages about undefined references to dlsym/dlclose/dlopen/dlerror modules from the functions sys_dl_sym/close/open - which are presented in ltdl.o module (in liblam.a). Sorry, how to find - which library contains this dl* modules ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow From Daniel.Pfenniger at obs.unige.ch Fri Aug 8 11:57:43 2008 From: Daniel.Pfenniger at obs.unige.ch (Daniel Pfenniger) Date: Fri, 08 Aug 2008 20:57:43 +0200 Subject: [Beowulf] gcc 4.3 and acml 4.1.0 do not play together In-Reply-To: References: Message-ID: <489C9727.3020602@obs.unige.ch> Mark Kosmowski wrote: > After a bit of struggling I found a post at the AMD Developer Forum > stating that gcc 4.3 and acml 4.1.0 are not compatible. > > I've been reading a bunch of folks here using gcc 4.3 and just wanted > to make everyone aware, hopefully prevent some time lost to futility. > > Have a great weekend! > > Mark E. Kosmowski Perhaps related to this there was this announcement (http://lwn.net/Articles/272048/) that the Linux kernel was incompatible with gcc 4.3 because the new gcc series changed some assumptions about x86 processor flag. I think this problem is no longer of concern for the Linux kernel, but may still exist in other software. Dan From hahn at mcmaster.ca Fri Aug 8 12:16:10 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 8 Aug 2008 15:16:10 -0400 (EDT) Subject: [Beowulf] LAM w/HPC Challenge: ld.so interface ? In-Reply-To: References: Message-ID: > Sorry, how to find - which library contains this dl* modules ? I think you just need to link with -ldl From reuti at staff.uni-marburg.de Fri Aug 8 14:04:58 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri, 8 Aug 2008 23:04:58 +0200 Subject: [Beowulf] LAM w/HPC Challenge: ld.so interface ? In-Reply-To: References: Message-ID: <8A616A6D-5E63-4822-B0A5-2222ADE9CF42@staff.uni-marburg.de> Hi, Am 08.08.2008 um 20:47 schrieb Mikhail Kuzminsky: > For some reasons I need to run HPC Challenge tests w/LAM MPI > instead of OpenMPI. > > LAM was installed, and the classical test w/hello - using gcc - was > performed successfully. > > But when I build hpcc executable, I see linker messages about > undefined references to dlsym/dlclose/dlopen/dlerror modules from > the functions > sys_dl_sym/close/open - which are presented in ltdl.o module (in > liblam.a). > > Sorry, how to find - which library contains this dl* modules ? sometimes I make a loop using "nm " across a complete directory in such cases looking for T entries for the symbols in question. -- Reuti From coutinho at dcc.ufmg.br Fri Aug 8 16:07:40 2008 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Fri, 8 Aug 2008 20:07:40 -0300 Subject: [Beowulf] Re: computer Go In-Reply-To: References: Message-ID: 2008/8/8 Peter St. John > The program was MoGo, http://www.lri.fr/~gelly/MoGo.htm, > but I don't know anything about the "borrowed" hardware. > Peter > ... > > > I haven't found specifics for the machine or the team yet, but to quote the >> article: >> >> 800 processors, at 4.7 Ghz, 15 Teraflops on borrowed supercomputers >> >> A related article said the machine(s) was sited in Europe. >> >> If the processor clock is 4.7 Ghz, this cluster should use IBM Power 6 processors (http://en.wikipedia.org/wiki/POWER6). -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlforrest at berkeley.edu Fri Aug 8 17:09:54 2008 From: jlforrest at berkeley.edu (Jon Forrest) Date: Fri, 08 Aug 2008 17:09:54 -0700 Subject: [Beowulf] Weird CentOS Install Problem Message-ID: <489CE052.4020908@berkeley.edu> I thought maybe some of you cluster people might have seen this. I have a brand new machine that will be the frontend of a cluster. It has a 3ware 9650 12-port RAID controller with 12 1TB drives attached. I used the "Boot Volume" feature in the RAID controller to make an 80GB boot volume. I install CentOS 5.2 x86_64 on this and everything installed fine, except ... When I boot the newly installed machine it stops in the grub prompt. If I type by hand the commands in /etc/grub.conf (which I saw by booting from a rescue CD), the first command "root (hd0,0)" shows that an ext3 partition was recognized, as it should. However, when I enter the "kernel ...." command, I get the following error message: Error 18: Selected cylinder exceeds maximum supported by BIOS What's weird about this is that the root file system starts on cylinder 1, as confirmed by the fdisk command. This is using a brand new SuperMicro X7SBE motherboard with the newest BIOS. What's even weirder is that the integrator that I purchased the system from somehow managed to install CentOS 5. I saw it boot the first time I turned on the system. I deliberately wiped it out. Needless to say, I have a message in to them. Any ideas could cause this? Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From matt at technoronin.com Fri Aug 8 17:41:13 2008 From: matt at technoronin.com (Matt Lawrence) Date: Fri, 8 Aug 2008 19:41:13 -0500 (CDT) Subject: [Beowulf] Weird CentOS Install Problem In-Reply-To: <489CE052.4020908@berkeley.edu> References: <489CE052.4020908@berkeley.edu> Message-ID: On Fri, 8 Aug 2008, Jon Forrest wrote: > I thought maybe some of you cluster people might have > seen this. > > I have a brand new machine that will be the frontend > of a cluster. It has a 3ware 9650 12-port RAID controller > with 12 1TB drives attached. > > I used the "Boot Volume" feature in the RAID controller > to make an 80GB boot volume. I install CentOS 5.2 x86_64 > on this and everything installed fine, except ... I've never used that controller, so I may be off a bit. > When I boot the newly installed machine it stops in > the grub prompt. If I type by hand the commands in > /etc/grub.conf (which I saw by booting from a rescue CD), > the first command "root (hd0,0)" shows that an ext3 partition > was recognized, as it should. However, when I enter the > "kernel ...." command, I get the following error message: > > Error 18: Selected cylinder exceeds maximum supported by BIOS > > What's weird about this is that the root file system starts > on cylinder 1, as confirmed by the fdisk command. This is > using a brand new SuperMicro X7SBE motherboard with the > newest BIOS. I suggest you create a /boot partition of about 200MB. Personally, I have gone to using an IDE->CF adapter and putting /boot on the Compact Flash. I picked up several adapters for around $5 each a while back and I use CF cards that are smaller than I feel like using in my camera (nothing smaller than 4GB in my camera case). -- Matt It's not what I know that counts. It's what I can remember in time to use. From landman at scalableinformatics.com Fri Aug 8 19:18:57 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 08 Aug 2008 22:18:57 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> <489BA330.1020907@scalableinformatics.com> Message-ID: <489CFE91.7060003@scalableinformatics.com> Matt Lawrence wrote: > I really would like to have it all in one package. It's too easy to get > things out of sync when doing changes by hand. I was figuring to find > the appropriate sections of the CentOS spec file and use them. As > always, other suggestions are welcome. You can add this in to the spec file that the make command generates. From my experience in dealing with the spec files from Redhat and SuSE my advice is, again, don't waste your time. mkinitrd is done in a post section of the kernel install. Here is what I used for the 2.6.26.1 kernel: mkinitrd -v --with=sd_mod --with=libata --with=ata_generic \ --with=ata_piix --with=pata_acpi \ /boot/initrd-2.6.26.1.img 2.6.26.1 Then the grub manipulation is done with grubby grubby --add-kernel=/vmlinuz-2.6.26.1 \ --initrd=/initrd-2.6.26.1.img \ --make-default \ --compy-default So, add these two in to the spec file, and rpmbuild -bb kernel.spec and rpmbuild -bs kernel.spec Again, none of this is hard, and you can add it in to the %post section by hand. The generated spec file is a good working spec file and will work on RHEL and variants, SuSE and variants, as well as others. But grubby is specific to RHEL variants last I checked. This is why they don't generate a %post for you which does all this for you. Similar with mkinitrd. You can add it in and go from there. Joe > > -- Matt > It's not what I know that counts. > It's what I can remember in time to use. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From jlforrest at berkeley.edu Fri Aug 8 19:57:07 2008 From: jlforrest at berkeley.edu (Jon Forrest) Date: Fri, 08 Aug 2008 19:57:07 -0700 Subject: [Beowulf] Weird CentOS Install Problem In-Reply-To: References: <489CE052.4020908@berkeley.edu> Message-ID: <489D0783.1060303@berkeley.edu> Matt Lawrence wrote: > On Fri, 8 Aug 2008, Jon Forrest wrote: > >> What's weird about this is that the root file system starts >> on cylinder 1, as confirmed by the fdisk command. This is >> using a brand new SuperMicro X7SBE motherboard with the >> newest BIOS. > > I suggest you create a /boot partition of about 200MB. On Monday I'm going to try various combinations of file systems and partition sizes. The first one I'm going to try will be the default CentOS arrangement. This is a really strange problem to me. I'm guessing the problem is in the 3ware BIOS. After all, tons of people use 80GB drives (and larger) with no problems. The fact that an 80GB device carved by the 3ware BIOS/firmware doesn't work makes me suspect the 3ware device. I'm also going to call 3ware. Jon From carsten.aulbert at aei.mpg.de Fri Aug 8 23:50:31 2008 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Sat, 09 Aug 2008 08:50:31 +0200 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: <871w0zphpc.fsf@snark.cb.piermont.com> References: <871w0zphpc.fsf@snark.cb.piermont.com> Message-ID: <489D3E37.3010601@aei.mpg.de> Hi Perry E. Metzger wrote: > Is there a reason bittorrent isn't suited to this application? > Our investigations so far showed that bittorrent is only good if the files to be transferred fit well into main memory. If you exceed about 90-95% of the RAM your disks will be accessed a lot and the performance breaks down a lot (we have seen close to wirespeed in the beginning and in the end we were crawling with mere few 10 kByte/s). Cheers Carsten -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31 -------------- next part -------------- A non-text attachment was scrubbed... Name: carsten_aulbert.vcf Type: text/x-vcard Size: 414 bytes Desc: not available URL: From jbdundas at gmail.com Sat Aug 9 11:53:14 2008 From: jbdundas at gmail.com (jitesh dundas) Date: Sun, 10 Aug 2008 00:23:14 +0530 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: <489D3E37.3010601@aei.mpg.de> References: <871w0zphpc.fsf@snark.cb.piermont.com> <489D3E37.3010601@aei.mpg.de> Message-ID: <326ea8620808091153s53245cd2jc3bd1f81e9a55a08@mail.gmail.com> Hi, We could try and implement this functionality of resuming broken downloads like in some softwares like Download Accelerator and bit-torrent. I hope my views can help, so here goes:- When a file is being downloaded, we can keep a stack of all of these downloads in progress at a centralized repository, preferably where the user has kept his file hosted for download or on the machine where the download is to be done. Next, we can keep the track of the point at which the download stopped and store it in the repository. Next, if the user tries to start the download it again, we can again retrieve it back from the data and get the end point of the previous download. The end point for each file can include the file details in terms of bits and bytes( 0 & 1) or even in percentages or pieces..Next time we can break our file based on pieces or percentages( as needed) and start the download from the nearest point that is best suited for the user. I hope this helps... I request your feedback... Thanks, Jitesh Dundas Mobile- +91-9860925706 http://jiteshbdundas.blogspot.com On 8/9/08, Carsten Aulbert wrote: > Hi > > Perry E. Metzger wrote: > >> Is there a reason bittorrent isn't suited to this application? >> > > Our investigations so far showed that bittorrent is only good if the > files to be transferred fit well into main memory. If you exceed about > 90-95% of the RAM your disks will be accessed a lot and the performance > breaks down a lot (we have seen close to wirespeed in the beginning and > in the end we were crawling with mere few 10 kByte/s). > > Cheers > > Carsten > > -- > Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics > Callinstrasse 38, 30167 Hannover, Germany > Phone/Fax: +49 511 762-17185 / -17193 > http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31 > From reuti at staff.uni-marburg.de Sat Aug 9 14:03:57 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Sat, 9 Aug 2008 23:03:57 +0200 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: <326ea8620808091153s53245cd2jc3bd1f81e9a55a08@mail.gmail.com> References: <871w0zphpc.fsf@snark.cb.piermont.com> <489D3E37.3010601@aei.mpg.de> <326ea8620808091153s53245cd2jc3bd1f81e9a55a08@mail.gmail.com> Message-ID: Hi, Am 09.08.2008 um 20:53 schrieb jitesh dundas: > We could try and implement this functionality of resuming broken > downloads like in some softwares like Download Accelerator and > bit-torrent. > > I hope my views can help, so here goes:- > > When a file is being downloaded, we can keep a stack of all of these > downloads in progress at a centralized repository, preferably where > the user has kept his file hosted for download or on the machine where > the download is to be done. > > Next, we can keep the track of the point at which the download stopped > and store it in the repository. Next, if the user tries to start the > download it again, we can again retrieve it back from the data and get > the end point of the previous download. > > The end point for each file can include the file details in terms of > bits and bytes( 0 & 1) or even in percentages or pieces..Next time we > can break our file based on pieces or percentages( as needed) and > start the download from the nearest point that is best suited for the > user. regarding user transmission of big files, maybe even between sites, I would look into splitting the files and using a checksum like .par or .par2. http://sourceforge.net/projects/parchive Even if one part doesn't make it to the other node, you can still assemble the complete file due to the added checksum files. But the original question was copying files inside a cluster to thousands of nodes. As 1000 nodes still means some amount of money to spend, what about looking into something like IBM's GPFS and their SAN switch and connect all nodes to this switch? -- Reuti > I hope this helps... > I request your feedback... > > Thanks, > Jitesh Dundas > Mobile- +91-9860925706 > http://jiteshbdundas.blogspot.com > > > On 8/9/08, Carsten Aulbert wrote: >> Hi >> >> Perry E. Metzger wrote: >> >>> Is there a reason bittorrent isn't suited to this application? >>> >> >> Our investigations so far showed that bittorrent is only good if the >> files to be transferred fit well into main memory. If you exceed >> about >> 90-95% of the RAM your disks will be accessed a lot and the >> performance >> breaks down a lot (we have seen close to wirespeed in the >> beginning and >> in the end we were crawling with mere few 10 kByte/s). >> >> Cheers >> >> Carsten >> >> -- >> Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics >> Callinstrasse 38, 30167 Hannover, Germany >> Phone/Fax: +49 511 762-17185 / -17193 >> http://www.top500.org/system/9234 | http://www.top500.org/connfam/ >> 6/list/31 >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From atchley at myri.com Sun Aug 10 04:57:23 2008 From: atchley at myri.com (Scott Atchley) Date: Sun, 10 Aug 2008 07:57:23 -0400 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: References: <871w0zphpc.fsf@snark.cb.piermont.com> <489D3E37.3010601@aei.mpg.de> <326ea8620808091153s53245cd2jc3bd1f81e9a55a08@mail.gmail.com> Message-ID: <34AC7180-5373-4C35-9972-9CA4B0B94322@myri.com> On Aug 9, 2008, at 5:03 PM, Reuti wrote: > Hi, > > Am 09.08.2008 um 20:53 schrieb jitesh dundas: > >> We could try and implement this functionality of resuming broken >> downloads like in some softwares like Download Accelerator and >> bit-torrent. >> >> I hope my views can help, so here goes:- >> >> When a file is being downloaded, we can keep a stack of all of these >> downloads in progress at a centralized repository, preferably where >> the user has kept his file hosted for download or on the machine >> where >> the download is to be done. >> >> Next, we can keep the track of the point at which the download >> stopped >> and store it in the repository. Next, if the user tries to start the >> download it again, we can again retrieve it back from the data and >> get >> the end point of the previous download. >> >> The end point for each file can include the file details in terms of >> bits and bytes( 0 & 1) or even in percentages or pieces..Next time we >> can break our file based on pieces or percentages( as needed) and >> start the download from the nearest point that is best suited for the >> user. > > regarding user transmission of big files, maybe even between sites, > I would look into splitting the files and using a checksum like .par > or .par2. > > http://sourceforge.net/projects/parchive > > Even if one part doesn't make it to the other node, you can still > assemble the complete file due to the added checksum files. > > But the original question was copying files inside a cluster to > thousands of nodes. As 1000 nodes still means some amount of money > to spend, what about looking into something like IBM's GPFS and > their SAN switch and connect all nodes to this switch? > > -- Reuti You may want to look at http://loci.cs.utk.edu. If you need to distribute large files within a cluster or across the WAN, you can use the LoRS tools to stripe the file over multiple servers and the clients then try pulling blocks off of each server in parallel. Using Internet2 and one client at Vanderbilt and a couple servers at Univ of Tennessee, they were able to saturate UT's ~400 Mb/s I2 link (much to the disbelief of the Vandy IT staff). I have seen ~5 Gb/s within a cluster using good 10G NICs. :-) Scott From atchley at myri.com Sun Aug 10 05:02:52 2008 From: atchley at myri.com (Scott Atchley) Date: Sun, 10 Aug 2008 08:02:52 -0400 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: <34AC7180-5373-4C35-9972-9CA4B0B94322@myri.com> References: <871w0zphpc.fsf@snark.cb.piermont.com> <489D3E37.3010601@aei.mpg.de> <326ea8620808091153s53245cd2jc3bd1f81e9a55a08@mail.gmail.com> <34AC7180-5373-4C35-9972-9CA4B0B94322@myri.com> Message-ID: On Aug 10, 2008, at 7:57 AM, Scott Atchley wrote: > You may want to look at http://loci.cs.utk.edu. If you need to > distribute large files within a cluster or across the WAN, you can > use the LoRS tools to stripe the file over multiple servers and the > clients then try pulling blocks off of each server in parallel. > Using Internet2 and one client at Vanderbilt and a couple servers at > Univ of Tennessee, they were able to saturate UT's ~400 Mb/s I2 link > (much to the disbelief of the Vandy IT staff). I have seen ~5 Gb/s > within a cluster using good 10G NICs. :-) > > Scott I forgot to mention LoRS optionally uses MD5 for checksums and AES-128 for encryption (you can use either, both or neither). The stored file is represented by a XML file called an exNode. If you want to share the data, you can email the exNode to someone and they can then download the data. You control the download offset and length so that you can extract just the parts of the file that you want. I believe there is a NetCDF version that can use exNodes and there may be a HDF5 version as well. Scott From atchley at myri.com Sun Aug 10 05:09:01 2008 From: atchley at myri.com (Scott Atchley) Date: Sun, 10 Aug 2008 08:09:01 -0400 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <48990FE3.3080500@scalableinformatics.com> References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> <48990FE3.3080500@scalableinformatics.com> Message-ID: <3146A549-EB13-4C21-8D39-42B31073E717@myri.com> On Aug 5, 2008, at 10:43 PM, Joe Landman wrote: > As a note: I was pointed to a recent lockup (double lock > acquisition) in XFS with NFS. I don't think I have seen this one in > the wild myself. Right now I am fighting an NFS over RDMA crash in > 2.6.26 which seems to have been cured in 2.6.26.1 . .2 is almost > out, so will test with that as well. > > This said, our experience with xfs has been quite good (performance, > reliability, etc). Some vendors kernels (2.6.18 ahem!) have some > issues with xfs (and a bunch of other things), so we usually update > them anyway. > > Joe Joe, They have posted a patch to correct this problem on the linux-nfs list if anyone is encountering it. Scott From niftyompi at niftyegg.com Sun Aug 10 21:25:27 2008 From: niftyompi at niftyegg.com (Nifty niftyompi Mitch) Date: Sun, 10 Aug 2008 21:25:27 -0700 Subject: [Beowulf] copying big files In-Reply-To: <12010303044.20080808175240@gmx.net> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <12010303044.20080808175240@gmx.net> Message-ID: <20080811042527.GA21906@hpegg.niftyegg.com> On Fri, Aug 08, 2008 at 05:52:40PM +0200, Jan Heichler wrote: > Hallo Henning, > HF> Hi everybody, > > HF> One needs basically a daemon which handles copying requests and > establishes > > HF> the connection to next node in the chain. > > Why a daemon? Just MPI that starts up the processes on the remote nodes > during programm startup. Advantage is that you can use any > high-speed-interconnect which you have an MPI for. > > HF> Has somebody written such a tool? -?- Is this an administrative tool or an MPI application need? -?- If MPI, is this the executable file itself or a common data file? Administrative tools could leverage torrent ideas, scp or rsync trees with modest scripting to distribute and check the file. I have seen a handful of solutions good and bad, slow and fast, reliable and fragile... QLogic has a tool "scpall" in their Fast fabric tools to address this, Rocks has additional tools .... MPI is interesting because of the power of MPI and that most MPI clusters have VERY FAST links available to MPI. However it can be unclear where the original file resides, where it will go, how to manage multi core complications, file naming convention, and clean up. Assuming that the file is visible to one rank and only needs to be deposited on the nodes involved in the MPI job a standard MPI library using MPI data transfers could be used to move data. The internals of the library could use trees, rings or tree rings to move the data; who cares once a clean interface is established. One classic MPI problem is the user launching "mpirun ./justbuilt.exe" on his local system but ./ on the compute nodes does not have a copy. Batch systems could help here... If the problem is that a data file must be predistributed say for a dusty deck that will only open a single fixed path to data then the batch system may need to be ideal for managing the transfer in a %pre launch task. In such a case the administrative tool could be leveraged but again must be multi core safe/ aware. Another problem might be that the executable and libraries needs to be predistributed so the execution start up and paging is improved. On a 1000 node cluster running a 8000 rank MPI program that lives on a taxed NFS resource the 8000 startup reads could improve 1000 fold by executing something like /localscratch/my.exe, IFF the %pre could distribute it in a x8 deep tree in *8 time. This can be important for start up time.... The batch system could quickly check a look up table to check N ranks to see if ./my.exe is NFS and should be pre distributed and launched as /local2nodeCache/my-unique-something.exe. Policy on some large clusters is such that this issue has a forced solution that only permits the launching of /opt/blessedbymanagement/bin Another permutation is large (sparse) data files where each rank is responsible (read and or write) for a region that is a function(of-rank).... Such applications might be come from developers trained on large SMP systems with many IO channels like a old but big SGI Orign system. Next is the topic of size. With some applications the data sets are vast and cannot or should not be distributed. In the 8000 rank case how is it possible to know what portion of a data file is exclusive input or output for a rank. Smart parallel file systems can improve things here. I am sure I missed some topics.... Others should to add to the check list Summary: one size does not (currently) fit all. What problem is being addressed? -- T o m M i t c h e l l Got a great hat... now what. From i.kozin at dl.ac.uk Mon Aug 11 04:13:49 2008 From: i.kozin at dl.ac.uk (Kozin, I (Igor)) Date: Mon, 11 Aug 2008 12:13:49 +0100 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <588c11220808061131l6c3ecf49hc45b5da7151e64b8@mail.gmail.com> Message-ID: > Generally speaking, MPI programs will not be fetching/writing data > from/to storage at the same time they are doing MPI calls so there > tends to not be very much contention to worry about at the node level. I tend to agree with this. > > Are there other practical and cost effective alternatives to this idea? > > If the cluster is small enough, using gigabit with a shared filesystem > is preferred since IB's low latency has relatively little affect on > the big source of latency in any storage system: the physical disks. > It's not until you cross the gigabit bandwidth barrier that IB really > starts to make sense--and that's a barrier that's not crossed that > often in a small cluster. I was thinking whether it is practical to use compute nodes as storage nodes thereby creating distributed fs on the very same cluster. The down side of this is much higher interference between apps running in parallel even if the MPI and storage networks are physically different. But then the aggregate i/o bandwidth is going to be colossal even over gigabit. Besides there are so many cores in modern servers that some can be put aside if necessary with little or no performance penalty. Such approach is certainly not a good idea for large clusters because the jitter will kill all the scaling but perhaps given the right setup and applications it should work well on a small cluster. It turned out pivot3 has done something along this already http://www.pivot3.com/ From henning.fehrmann at aei.mpg.de Mon Aug 11 09:45:02 2008 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Mon, 11 Aug 2008 18:45:02 +0200 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: References: Message-ID: <20080811164502.GA17328@gretchen.aei.uni-hannover.de> Hi, I found some time to play with dolly and nettee. They do what I was looking for. Thank you for the hints. > > I will say that my dream would be for something like dolly to get some > sort > > of transfer recovery mechanism, though I realize that would be quite > > difficult in such a topology. > > nettee has some failover and continuation capabilities at different > points - but not what I think you want. The development version has a > few extra modes for cases where data is being merged, but that isn't > relevant to this discussion. When setting up the initial chain nettee > can connect to an alternate node (from a list of failovers) if the > target node will not answer. It also has the ability to keep going if > the local disk becomes unwritable, and it can continue a download on a > chain down to the node above the point of failure. > > However, nettee cannot at present rewire around a failed node to > continue a download to the node(s) below it. That would indeed be quite > difficult, since one could have a situation like this: > > A -> B (A knows it has sent 100MB) > B -> C (B knows it has sent 98MB, then it blows up) > C (C knows it has received 98 MB) > > A and C will eventually figure out that B has died, and they could > conceivably negotiate a new connection, but A may no longer have the > missing 2 MB (it might have been sent out a pipe, processed, and not > stored in the raw state anywhere.) On the other hand, the development > version uses ring buffers, and one could set those to be very large, > enabling a certain level of "redo" from A. So if C comes back and says > "I only have 98MB" A can see if it has the missing parts and go on if it > does. It still might not though. If B has stalled for long enough > the ring buffer on A may have completely filled from the previous node, > overwriting the data needed to recover. I guess it would be possible to > implement a "safety region" in the ring buffer which could not be > overwritten. > I spread successfully a 10G file to 50 nodes. The rate was 140Mb/s for nettee and a bit slower using dolly. I guess it was due to a busy node somewhere in the chain. Increasing the number of clients up to 100 failed in both cases. For nettee I got: nettee: fatal error writing to child: Connection reset by peer for dolly: Sent MB: 40, MB/s: 66.752, Current MB/s: 35.710 movebytes read/write: Connection reset by peer errno = 104 I will do more systematic test the next days. David Mathog, are you interested in bug reports? Cheers, Henning Fehrmann From jac67 at georgetown.edu Mon Aug 11 12:40:45 2008 From: jac67 at georgetown.edu (Jess Cannata) Date: Mon, 11 Aug 2008 15:40:45 -0400 Subject: [Beowulf] Weird CentOS Install Problem In-Reply-To: <489CE052.4020908@berkeley.edu> References: <489CE052.4020908@berkeley.edu> Message-ID: <48A095BD.6080904@georgetown.edu> Jon, I have had the same problem. You should double-check that the non-80 GB volume has a GPT type partition table set. To see your current partition table setting, run parted /dev/ and then "print." You should see something like this: parted) p Disk geometry for /dev/sde: 0.000-2626094.625 megabytes Disk label type: gpt Minor Start End Filesystem Name Flags 1 0.017 2626093.625 ext3 I've noticed that Red Hat's Anaconda will create an MS-DOS partition table on the disk even though MS-DOS partition tables cannot support greater than 2 TB volumes. You can use "parted" to change it to GPT via the mklabel option. Then you can create the ext3 file system. -- Jess Cannata Advanced Research Computing & High Performance Computing Training Georgetown University Jon Forrest wrote: > I thought maybe some of you cluster people might have > seen this. > > I have a brand new machine that will be the frontend > of a cluster. It has a 3ware 9650 12-port RAID controller > with 12 1TB drives attached. > > I used the "Boot Volume" feature in the RAID controller > to make an 80GB boot volume. I install CentOS 5.2 x86_64 > on this and everything installed fine, except ... > > When I boot the newly installed machine it stops in > the grub prompt. If I type by hand the commands in > /etc/grub.conf (which I saw by booting from a rescue CD), > the first command "root (hd0,0)" shows that an ext3 partition > was recognized, as it should. However, when I enter the > "kernel ...." command, I get the following error message: > > Error 18: Selected cylinder exceeds maximum supported by BIOS > > What's weird about this is that the root file system starts > on cylinder 1, as confirmed by the fdisk command. This is > using a brand new SuperMicro X7SBE motherboard with the > newest BIOS. > > What's even weirder is that the integrator that I purchased > the system from somehow managed to install CentOS 5. > I saw it boot the first time I turned on the system. > I deliberately wiped it out. Needless to say, I have > a message in to them. > > Any ideas could cause this? > > Cordially, From jac67 at georgetown.edu Mon Aug 11 12:44:52 2008 From: jac67 at georgetown.edu (Jess Cannata) Date: Mon, 11 Aug 2008 15:44:52 -0400 Subject: [Beowulf] Weird CentOS Install Problem In-Reply-To: <48A095BD.6080904@georgetown.edu> References: <489CE052.4020908@berkeley.edu> <48A095BD.6080904@georgetown.edu> Message-ID: <48A096B4.4000701@georgetown.edu> Jon, I just re-read your message more carefully and you may have a different problem if you didn't try to set up the non-80 GB volume in Anaconda. Are you using a /boot partition on the 80 GB volume? If not, I would try that first like the others have said. Jess Jess Cannata wrote: > Jon, > > I have had the same problem. You should double-check that the non-80 > GB volume has a GPT type partition table set. To see your current > partition table setting, run parted /dev/ and then > "print." You should see something like this: > > parted) p > Disk geometry for /dev/sde: 0.000-2626094.625 megabytes > Disk label type: gpt > Minor Start End Filesystem Name Flags > 1 0.017 2626093.625 ext3 > > I've noticed that Red Hat's Anaconda will create an MS-DOS partition > table on the disk even though MS-DOS partition tables cannot support > greater than 2 TB volumes. You can use "parted" to change it to GPT > via the mklabel option. Then you can create the ext3 file system. > From pal at di.fct.unl.pt Thu Aug 7 06:09:15 2008 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Thu, 7 Aug 2008 14:09:15 +0100 (WEST) Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A75B2.2050009@tamu.edu> References: <382428251.47851218066512841.JavaMail.root@mail.vpac.org> <489A75B2.2050009@tamu.edu> Message-ID: <1225.10.170.133.93.1218114555.squirrel@www.di.fct.unl.pt> > Chris Samuel wrote: >> ----- "Robert G. Brown" wrote: >>> And even on Linux machines, NFS has been, well, "functional" >>> is a good way to describe it. >> It actually seems to work pretty well these days, our >> general config is: >> 1) No automounter >> 2) Hard mounts (so jobs just hang if they loose contact) >> 3) NFS over TCP (NFS over UDP is sooo 1990's :-)) >> 4) Jumbo frames (9000 byte MTUs) on the NFS network >> 5) NFS file server has hardwired fsid's to prevent stale file handles on >> a reboot >> 6) Debian, not RHEL on the server >> 7) XFS for /home on the server > > Speaking of jumbo frames, I'm seeing a problem on a Broadcom 57xx chipset on CentOS 4.3, 2.6.9-67 kernel (yeah, I know) and a tg3 driver. > I can't make the thing recognize the ability to use jumbo frames. > Anyone got a fix? Are you sure the chip does support jumbo? E.g., BCM5721 integrated in HP DL145-G2 does not, while BCM5703X integrated in IBM x335 does support it. While we're on that subject, I'm going to wire the HPs to a different switch; I have the ideia that when started using them both connected to the same SMC8624T switch, funny things happened with the eth1 jumbo interfaces on the IBMs... (I'm keeping eth0 on 1500) Anyone has had this behaviour? Is mixing (jumbo and 1500-sized) on the same "segment" standard, or is it a chip/switch/whatever issue? Regards, paulo -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10763 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: pal at di.fct.unl.pt 2829-516 Caparica, PORTUGAL From ken at kschuster.org Thu Aug 7 07:03:10 2008 From: ken at kschuster.org (Ken Schuster) Date: Thu, 7 Aug 2008 07:03:10 -0700 (PDT) Subject: [Beowulf] OTS "computation" stories Message-ID: <449784.26677.qm@web56407.mail.re3.yahoo.com> Aug. 7, 1991: Ladies and Gentlemen, the World Wide Web ? http://www.wired.com/science/discoveries/news/2007/08/dayintech_0807 ? ? ? Where did the term "computer bug" originate?? Now you have the answer (:>) ? Aug. 7, 1944: Still a Few Bugs in the System ? http://www.wired.com/science/discoveries/news/2008/08/dayintech_0807 -------------- next part -------------- An HTML attachment was scrubbed... URL: From guanhome at gmail.com Fri Aug 8 07:27:23 2008 From: guanhome at gmail.com (guanq) Date: Fri, 8 Aug 2008 10:27:23 -0400 Subject: [Beowulf] Re: computer Go In-Reply-To: References: Message-ID: it is only a time issue for computer to beat human on everything. if people use some theory not related to Go and not being trapped in to remembering thousands and thousands Jusakis, for example kind of intensity vector to weight each move, probably people can find a better and easier way to beat human on Go On Fri, Aug 8, 2008 at 10:15 AM, Peter St. John wrote: > The American Go Association (which has a free e-newsletter) at > http://www.usgo.org/ reports that a machine won an exhibition game with a > master last night at the US Go Congress. This isn't really historic; the > master, Myungwan Kim, is an 8 dan professional, and gave 9 handicap stones > to the machine. > > Very roughly, 8 dan pro would be comparable to 9 dan amateur; and very > roughly, Kim would be able to give me 9 stones too (I'm 1 dan amateur and > amateur handicaps equate one stone to one rank, and the mathematician Don > Weiner 6d beats me easily at 6 stones, althugh I should be able to cope at > 5). > > 9 stones is very roughly comparable to queen odds at chess, but the > statistical distributions of Go and Chess are not the same; a Grandmaster of > chess could maybe give me rook odds, not queen odds; knight odds is roughly > comparable to two standard deviations, a rating difference of about 400 > points, and I'm about 800 below the world champion (and I"m comparable in > go and chess). But again speaking very roughly, this result is in the > ballpark of achieving amateur 1 dan status, about the level that Ken > Thompson achieved with Belle in the mid-80's (the first USCF Expert > machine). Odds games in chess do not have the same probabilistic qualities > as in Go; we almost never play odds games in chess anymore (it was popular > for money in the 19th century) but can't get along without handicapping in > Go, games between quite disparate players can be made interesting. > > I haven't found specifics for the machine or the team yet, but to quote the > article: > > 800 processors, at 4.7 Ghz, 15 Teraflops on borrowed supercomputers > > A related article said the machine(s) was sited in Europe. > > Peter > > --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google Groups > "Columbus.Go.Club" group. > To post to this group, send email to ColumbusGoClub at googlegroups.com > To unsubscribe from this group, send email to > ColumbusGoClub+unsubscribe at googlegroups.com > For more options, visit this group at > http://groups.google.com/group/ColumbusGoClub?hl=en > -~----------~----~----~----~------~----~------~--~--- > > From HuntressGB at Npt.NUWC.Navy.Mil Fri Aug 8 08:51:10 2008 From: HuntressGB at Npt.NUWC.Navy.Mil (Huntress Gary B NPRI) Date: Fri, 08 Aug 2008 11:51:10 -0400 Subject: [Beowulf] copying big files Message-ID: <7F93C0D0C6D8454B9B05720F713A09F1168A650B@npri54exc14.npt.nuwc.navy.mil> UDPCast http://udpcast.linux.lu/ might be useful for this purpose. Regards, Gary Huntress Code 4113 Naval Undersea Warfare Center Newport, RI 02841 1-800-669-6892 x28990 Blackberry: 401 256-1916 -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Henning Fehrmann Sent: Friday, August 08, 2008 11:37 AM To: Beowulf Subject: [Beowulf] copying big files Hi everybody, Coping a big file onto all nodes in a cluster is a rather common problem. I would have thought that there might be a standard tool for distributing the files in an efficient way. So far, I haven't found one. Assuming one has a network design which allows non blocking full duplex wire-speed connections between N/2 pairs of nodes where N is the number of nodes in the cluster. It is basically a non blocking coreswitch. In this case the following scheme would be convenient and rather simple: The file is placed on node n1 and one builds a chain of nodes n1 , n2 .... nN. One splits the file into many packages (p1..pM), lets say a fragment fits into one TCP package. In the first step n1 transmits the package p1 to node n2. In the second step n1 transmits the package p2 to n2 and n2 transmits p1 to node n3. The transmission of a single package is fast. The time of passing a particular package through the whole chain of nodes is short compared with time of the entire copying process. E.g., using jumbo frames a package can have the size of ca 10kB. In Gb network the transmission time of a single package between nodes is of the order of 0.1 ms. Even in a cluster with 1024 nodes it takes in an ideal case just 0.1s to pass a package from node n1 through all nodes to n1024. On each node the package is stored and, in the end, one reassembles the file. For big files (size >> 10Mb) the required time is approximately the same as one needs for copying the file between two nodes plus 0.1s. One needs basically a daemon which handles copying requests and establishes the connection to next node in the chain. Has somebody written such a tool? Cheers, Henning Fehrmann _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From abbyzcool at gmail.com Fri Aug 8 12:40:30 2008 From: abbyzcool at gmail.com (Abhishek Kulkarni) Date: Fri, 8 Aug 2008 13:40:30 -0600 Subject: [Beowulf] copying big files In-Reply-To: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Message-ID: <223eadbc0808081240p633cc464t8027f6aad4687106@mail.gmail.com> Yes, You might want to take a look at XGET, which is a part of the XCPU clustering framework (http://www.xcpu.org). It was primarily designed to transfer boot images (kernel and initrd) across the nodes in a cluster in a very scalable manner, but it can be used to transfer any big files/directories across the network. It creates an ad-hoc tree at runtime wherein a client can also "act as a server" for the other clients. Boot image distribution for over 1024 nodes has been done in less than 10 seconds. More recently, Perceus (http://www.perceus.org) has been using XGET as the default mechanism for scalable VNFS transfer across nodes and comes bundled with XCPU modules that makes configuring it a lot easier. -- Abhishek On Fri, Aug 8, 2008 at 9:37 AM, Henning Fehrmann < henning.fehrmann at aei.mpg.de> wrote: > Hi everybody, > > Coping a big file onto all nodes in a cluster is a rather common problem. > I would have thought that there might be a standard tool for > distributing the files in an efficient way. So far, I haven't found one. > > Assuming one has a network design which allows non blocking full duplex > wire-speed connections between N/2 pairs of nodes where N is the number > of nodes in the cluster. It is basically a non blocking coreswitch. > > In this case the following scheme would be convenient and rather simple: > > The file is placed on node n1 and one builds a chain of nodes n1 , n2 .... > nN. > > One splits the file into many packages (p1..pM), lets say a fragment fits > into one TCP package. In the first step n1 transmits the package p1 to node > n2. > In the second step n1 transmits the package p2 to n2 and n2 transmits p1 to > node n3. > > The transmission of a single package is fast. The time of passing a > particular > package through the whole chain of nodes is short compared with time of the > entire copying process. E.g., using jumbo frames a package can have the > size of ca 10kB. > In Gb network the transmission time of a single package between nodes is > of the order of 0.1 ms. Even in a cluster with 1024 nodes it takes > in an ideal case just 0.1s to pass a package from node n1 through all nodes > to n1024. > > On each node the package is stored and, in the end, one reassembles the > file. > For big files (size >> 10Mb) the required time is approximately > the same as one needs for copying the file between two nodes plus 0.1s. > > One needs basically a daemon which handles copying requests and establishes > the connection to next node in the chain. > > Has somebody written such a tool? > > Cheers, > Henning Fehrmann > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mm at yuhu.biz Sun Aug 10 06:56:50 2008 From: mm at yuhu.biz (Marian Marinov) Date: Sun, 10 Aug 2008 16:56:50 +0300 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: References: <34AC7180-5373-4C35-9972-9CA4B0B94322@myri.com> Message-ID: <200808101656.50732.mm@yuhu.biz> On Sunday 10 August 2008 15:02:52 Scott Atchley wrote: > On Aug 10, 2008, at 7:57 AM, Scott Atchley wrote: > > You may want to look at http://loci.cs.utk.edu. If you need to > > distribute large files within a cluster or across the WAN, you can > > use the LoRS tools to stripe the file over multiple servers and the > > clients then try pulling blocks off of each server in parallel. > > Using Internet2 and one client at Vanderbilt and a couple servers at > > Univ of Tennessee, they were able to saturate UT's ~400 Mb/s I2 link > > (much to the disbelief of the Vandy IT staff). I have seen ~5 Gb/s > > within a cluster using good 10G NICs. :-) > > > > Scott > > I forgot to mention LoRS optionally uses MD5 for checksums and AES-128 > for encryption (you can use either, both or neither). > > The stored file is represented by a XML file called an exNode. If you > want to share the data, you can email the exNode to someone and they > can then download the data. You control the download offset and length > so that you can extract just the parts of the file that you want. I > believe there is a NetCDF version that can use exNodes and there may > be a HDF5 version as well. > > Scott Hello, I'm new to the list and I don't know if this was previously discussed but when I need to provision a file to all machines within my cluster I use a cluster file system like GlusterFS(http://www.gluster.org/docs/index.php/GlusterFS) or GFarm(http://datafarm.apgrid.org/). I started with NFS but when you have more then 50-60 machines your NFS becomes the problem that all machines see. And the cure for that usually is an expensive hardware purchase. Regards Marian Marinov From per at computer.org Sun Aug 10 07:20:40 2008 From: per at computer.org (Per Jessen) Date: Sun, 10 Aug 2008 16:20:40 +0200 Subject: [Beowulf] copying big files References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Message-ID: Henning Fehrmann wrote: > Hi everybody, > > Coping a big file onto all nodes in a cluster is a rather common > problem. I would have thought that there might be a standard tool for > distributing the files in an efficient way. So far, I haven't found > one. > I haven't seen any mention of csync2 so far? http://oss.linbit.com/csync2/ /Per Jessen, Z?rich From abbyzcool at gmail.com Sun Aug 10 12:03:18 2008 From: abbyzcool at gmail.com (Abhishek Kulkarni) Date: Sun, 10 Aug 2008 13:03:18 -0600 Subject: [Beowulf] copying big files In-Reply-To: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Message-ID: <223eadbc0808101203y298b2441gcee0914e6b26854f@mail.gmail.com> Yes, You might want to take a look at XGET, which is a part of the XCPU clustering framework (http://www.xcpu.org). It was primarily designed to transfer boot images (kernel and initrd) across the nodes in a cluster in a very scalable manner, but it can be used to transfer any big files/directories across the network. It creates an ad-hoc tree at runtime wherein a client can also "act as a server" for the other clients. Boot image distribution for over 1024 nodes has been done in less than 10 seconds. More recently, Perceus (http://www.perceus.org) has been using XGET as the default mechanism for scalable VNFS transfer across nodes and comes bundled with XCPU modules that makes configuring it a lot easier. -- Abhishek (who wonders how can it take more than 3 days for such posts to get through on the Beowulf ML) On Fri, Aug 8, 2008 at 9:37 AM, Henning Fehrmann < henning.fehrmann at aei.mpg.de> wrote: > Hi everybody, > > Coping a big file onto all nodes in a cluster is a rather common problem. > I would have thought that there might be a standard tool for > distributing the files in an efficient way. So far, I haven't found one. > > Assuming one has a network design which allows non blocking full duplex > wire-speed connections between N/2 pairs of nodes where N is the number > of nodes in the cluster. It is basically a non blocking coreswitch. > > In this case the following scheme would be convenient and rather simple: > > The file is placed on node n1 and one builds a chain of nodes n1 , n2 .... > nN. > > One splits the file into many packages (p1..pM), lets say a fragment fits > into one TCP package. In the first step n1 transmits the package p1 to node > n2. > In the second step n1 transmits the package p2 to n2 and n2 transmits p1 to > node n3. > > The transmission of a single package is fast. The time of passing a > particular > package through the whole chain of nodes is short compared with time of the > entire copying process. E.g., using jumbo frames a package can have the > size of ca 10kB. > In Gb network the transmission time of a single package between nodes is > of the order of 0.1 ms. Even in a cluster with 1024 nodes it takes > in an ideal case just 0.1s to pass a package from node n1 through all nodes > to n1024. > > On each node the package is stored and, in the end, one reassembles the > file. > For big files (size >> 10Mb) the required time is approximately > the same as one needs for copying the file between two nodes plus 0.1s. > > One needs basically a daemon which handles copying requests and establishes > the connection to next node in the chain. > > Has somebody written such a tool? > > Cheers, > Henning Fehrmann > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jclinton at advancedclustering.com Sun Aug 10 12:13:36 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Sun, 10 Aug 2008 14:13:36 -0500 Subject: [Beowulf] copying big files (Henning Fehrmann) In-Reply-To: References: Message-ID: <588c11220808101213n4fa25b1dse86d64cda3857fa7@mail.gmail.com> On Fri, Aug 8, 2008 at 11:11 AM, David Mathog wrote: > Henning Fehrmann wrote: > >> Coping a big file onto all nodes in a cluster is a rather >> common problem. I would have thought that there might be a >> standard tool for distributing the files in an efficient way. >> So far, I haven't found one. We have used udpcast successfully hundreds of times: http://udpcast.linux.lu/ Only complaint would be that it doesn't handle a machine dropping out on the receive side very well. With a little patch, it could be made to be forgiving in that regard. -- Jason D. Clinton Advanced Clustering Technologies, Inc. From jclinton at advancedclustering.com Sun Aug 10 12:23:59 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Sun, 10 Aug 2008 14:23:59 -0500 Subject: [Beowulf] Weird CentOS Install Problem In-Reply-To: <489D0783.1060303@berkeley.edu> References: <489CE052.4020908@berkeley.edu> <489D0783.1060303@berkeley.edu> Message-ID: <588c11220808101223k3b1b60a3oc588e2c5d946c2d0@mail.gmail.com> On Fri, Aug 8, 2008 at 9:57 PM, Jon Forrest wrote: > Matt Lawrence wrote: >> >> On Fri, 8 Aug 2008, Jon Forrest wrote: >> >>> What's weird about this is that the root file system starts >>> on cylinder 1, as confirmed by the fdisk command. This is >>> using a brand new SuperMicro X7SBE motherboard with the >>> newest BIOS. >> >> I suggest you create a /boot partition of about 200MB. > > On Monday I'm going to try various combinations of file systems > and partition sizes. The first one I'm going to try will > be the default CentOS arrangement. > > This is a really strange problem to me. I'm guessing > the problem is in the 3ware BIOS. After all, tons > of people use 80GB drives (and larger) with no problems. > The fact that an 80GB device carved by the 3ware BIOS/firmware > doesn't work makes me suspect the 3ware device. > > I'm also going to call 3ware. Think of the 80GB slice as the "OS" volume rather than the "boot" volume. You still need a boot partition within this "OS" volume situated at the beginning of the disk. The boot partition doesn't need to be larger than 100-200MB. A good reason to use a "OS" slice is to use MS-DOS style partition tables for that slice and leave the remaining space on your array to be controlled by the GPT format partition table. And this is one of the many reasons that we'll all be happier when EFI replaces BIOS. From kilian.cavalotti.work at gmail.com Sun Aug 10 15:44:34 2008 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Sun, 10 Aug 2008 15:44:34 -0700 Subject: [Beowulf] Re: computer Go In-Reply-To: References: Message-ID: On Fri, Aug 8, 2008 at 4:07 PM, Bruno Coutinho wrote: >> The program was MoGo, http://www.lri.fr/~gelly/MoGo.htm, but I don't know >> anything about the "borrowed" hardware. >>> 800 processors, at 4.7 Ghz, 15 Teraflops on borrowed supercomputers >>> A related article said the machine(s) was sited in Europe. > > If the processor clock is 4.7 Ghz, this cluster should use IBM Power 6 > processors (http://en.wikipedia.org/wiki/POWER6). Spot on: http://computer-go.org/pipermail/computer-go/2008-August/015641.html It's "Huyghens", a supercomputer at Sara, Amsterdam, Netherlands. http://www.sara.nl/userinfo/huygens/description/index.html Cheers, -- Kilian From gus at ldeo.columbia.edu Mon Aug 11 13:28:10 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 11 Aug 2008 16:28:10 -0400 Subject: [Beowulf] Gigabit Ethernet and RDMA Message-ID: <48A0A0DA.6010001@ldeo.columbia.edu> Hello Beowulf fans Does anyone know the status of RDMA on Gigabit Ethernet? Is it a stable solution for a cluster interconnect, or still an experimental thing? Is it effective in offloading network tasks from the CPU? (Myrinet and Infiniband seem to use RDMA effectively, right?) What does it take for it to work under typical Linux distributions? A driver? A special kernel? Something else? Just plug the NIC in and play? Does it support standard MPICH2 and/or OpenMPI compiled out of the box, or does it require linking to some type of special low level communication library, or perhaps requires the use of a special flavor or MPI (say, from the NIC vendor)? I poked around on the web, and learned that Ammasso seems to have pioneered RDMA-enabled GigE NICs (Ammasso 1100). Broadcom advertises a NIC with similar characteristics (BCM5706). However, it is unclear if RDMA GigE NICs would work with standard Linux distros, if it is effective, how much it costs, and how much hassle is required to make it work. Thank you, Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From jlforrest at berkeley.edu Mon Aug 11 14:18:23 2008 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon, 11 Aug 2008 14:18:23 -0700 Subject: Resolved (Re: [Beowulf] Weird CentOS Install Problem) In-Reply-To: <48A096B4.4000701@georgetown.edu> References: <489CE052.4020908@berkeley.edu> <48A095BD.6080904@georgetown.edu> <48A096B4.4000701@georgetown.edu> Message-ID: <48A0AC9F.1020401@berkeley.edu> As several people suggested, the weird CentOS install problem I described last week was solved by using a separate small /boot partition. I used a 100MB partition. On the other hand, I do not understand why this made a difference. I originally had a root partition that started on cylinder 1, as shown by the fdisk program. I now have a /boot partition that starts on cylinder 1. The only difference is the ending cylinder number. I don't recall what it was originally but it was clearly greater than 1024. Now it's 13. I had no idea that the ending cylinder number made any difference as long as the files in /boot were all located in cylinder numbers <= 1024. Of course, cylinder numbers and other disk geometry measurements are completely virtual in a hardware RAID since the controller translates from the ideal disk the OS sees to the physical locations on the real disks. The thought crossed my mind that maybe the files in /boot, particularly vmlinuz, just happened to end up on cylinders > 1024. If so, that might be an explanation for what happened. Be that as it may, I consider this a bug in the 3ware BIOS. Every modern motherboard BIOS I've seen in the last 5 years, at least, has no problem booting from gigantic root partitions. Why should the 3ware BIOS be any different? I've opened a case with them to try to get to the bottom of this. I appreciate all the comments I received about this problem. I hope that other people who have this problem find this discussion via Google so that they can save themselves some time. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From gus at ldeo.columbia.edu Mon Aug 11 14:26:03 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 11 Aug 2008 17:26:03 -0400 Subject: [Beowulf] OTS "computation" stories In-Reply-To: <449784.26677.qm@web56407.mail.re3.yahoo.com> References: <449784.26677.qm@web56407.mail.re3.yahoo.com> Message-ID: <48A0AE6B.70604@ldeo.columbia.edu> Hello Ken and list The origin of the term "bug" seems to preceded Admiral Grace Hopper and the moth found on the Mark II relay: http://en.wikipedia.org/wiki/Software_bug#Etymology http://en.wikipedia.org/wiki/Grace_Hopper#Anecdotes http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=728224&isnumber=15706 Bugs pester my programs so much that I had to find where they came from. :) Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Ken Schuster wrote: > Aug. 7, 1991: Ladies and Gentlemen, the World Wide Web > > http://www.wired.com/science/discoveries/news/2007/08/dayintech_0807 > > > > Where did the term "computer bug" originate? Now you have the answer (:>) > > Aug. 7, 1944: Still a Few Bugs in the System > > http://www.wired.com/science/discoveries/news/2008/08/dayintech_0807 > >------------------------------------------------------------------------ > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From atchley at myri.com Mon Aug 11 15:26:15 2008 From: atchley at myri.com (Scott Atchley) Date: Mon, 11 Aug 2008 18:26:15 -0400 Subject: [Beowulf] Gigabit Ethernet and RDMA In-Reply-To: <48A0A0DA.6010001@ldeo.columbia.edu> References: <48A0A0DA.6010001@ldeo.columbia.edu> Message-ID: <0A784298-63F8-4DAB-BAEF-60F1C4657F76@myri.com> Hi Gus, Are you trying to find software for NICs you currently have? Or are you looking for gigabit Ethernet NICs that natively support some form of kernel-bypass/zero-copy? I do not know of any of the latter (do Chelsio or others offer 1G NICs with iWarp?). As for the former, there are several options for cluster use: I believe Scyld has an optimized Ethernet stack. GAMMA has special drivers for certain Intel NICs. PM/Ethernet-HXB is under active development. Open-MX works over any Ethernet driver and works with any MPI that works with native MX. If you are interested in 10G Ethernet, MX on Myricom 10G NICs can work with our Myrinet switches or any brand of Ethernet switches (some Ethernet switches provide lower latency than others). Scott On Aug 11, 2008, at 4:28 PM, Gus Correa wrote: > Hello Beowulf fans > > Does anyone know the status of RDMA on Gigabit Ethernet? > > Is it a stable solution for a cluster interconnect, or still an > experimental thing? > > Is it effective in offloading network tasks from the CPU? > (Myrinet and Infiniband seem to use RDMA effectively, right?) > > What does it take for it to work under typical Linux distributions? > A driver? > A special kernel? > Something else? Just plug the NIC in and play? > > Does it support standard MPICH2 and/or OpenMPI compiled out of the > box, > or does it require linking to some type of special low level > communication library, > or perhaps requires the use of a special flavor or MPI (say, from > the NIC vendor)? > > I poked around on the web, > and learned that Ammasso seems to have pioneered RDMA-enabled GigE > NICs (Ammasso 1100). > Broadcom advertises a NIC with similar characteristics (BCM5706). > However, it is unclear if RDMA GigE NICs would work with standard > Linux distros, > if it is effective, how much it costs, and how much hassle is > required to make it work. > > Thank you, > Gus Correa > > -- > --------------------------------------------------------------------- > Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu > Lamont-Doherty Earth Observatory - Columbia University > P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From jlb17 at duke.edu Mon Aug 11 15:57:51 2008 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Mon, 11 Aug 2008 18:57:51 -0400 (EDT) Subject: Resolved (Re: [Beowulf] Weird CentOS Install Problem) In-Reply-To: <48A0AC9F.1020401@berkeley.edu> References: <489CE052.4020908@berkeley.edu> <48A095BD.6080904@georgetown.edu> <48A096B4.4000701@georgetown.edu> <48A0AC9F.1020401@berkeley.edu> Message-ID: On Mon, 11 Aug 2008 at 2:18pm, Jon Forrest wrote > Be that as it may, I consider this a bug in the 3ware BIOS. > Every modern motherboard BIOS I've seen in the last 5 years, > at least, has no problem booting from gigantic root partitions. > Why should the 3ware BIOS be any different? I've opened > a case with them to try to get to the bottom of this. Thanks very much for posting the conclusion of this tale here. I definitely would consider that a bug in 3ware's BIOS, and I hope it's something they'll fix. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From gerry.creager at tamu.edu Mon Aug 11 16:57:25 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon, 11 Aug 2008 18:57:25 -0500 Subject: Resolved (Re: [Beowulf] Weird CentOS Install Problem) In-Reply-To: References: <489CE052.4020908@berkeley.edu> <48A095BD.6080904@georgetown.edu> <48A096B4.4000701@georgetown.edu> <48A0AC9F.1020401@berkeley.edu> Message-ID: <48A0D1E5.5070009@tamu.edu> Joshua Baker-LePain wrote: > On Mon, 11 Aug 2008 at 2:18pm, Jon Forrest wrote > >> Be that as it may, I consider this a bug in the 3ware BIOS. >> Every modern motherboard BIOS I've seen in the last 5 years, >> at least, has no problem booting from gigantic root partitions. >> Why should the 3ware BIOS be any different? I've opened >> a case with them to try to get to the bottom of this. > > Thanks very much for posting the conclusion of this tale here. I > definitely would consider that a bug in 3ware's BIOS, and I hope it's > something they'll fix. At least my experience with them has been good, and responsive. I suspect they'll get it right. gerry From larry.stewart at sicortex.com Mon Aug 11 17:21:25 2008 From: larry.stewart at sicortex.com (Lawrence Stewart) Date: Mon, 11 Aug 2008 20:21:25 -0400 Subject: [Beowulf] copying big files In-Reply-To: <223eadbc0808101203y298b2441gcee0914e6b26854f@mail.gmail.com> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <223eadbc0808101203y298b2441gcee0914e6b26854f@mail.gmail.com> Message-ID: <0370197F-9103-4C63-B9B3-06E7097EA305@sicortex.com> We use sbcast, which is part of the slurm resource manager. Wicked fast, it uses a fanout tree. At least this works if your cluster is up far enough to run slurm. -L From csamuel at vpac.org Tue Aug 12 00:01:30 2008 From: csamuel at vpac.org (Chris Samuel) Date: Tue, 12 Aug 2008 17:01:30 +1000 (EST) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: Message-ID: <1278943052.86771218524490230.JavaMail.root@mail.vpac.org> ----- "I Kozin (Igor)" wrote: > > Generally speaking, MPI programs will not be fetching/writing data > > from/to storage at the same time they are doing MPI calls so there > > tends to not be very much contention to worry about at the node > > level. > > I tend to agree with this. But that assumes you're not sharing a node with other jobs that may well be doing I/O. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From perry at piermont.com Tue Aug 12 09:37:21 2008 From: perry at piermont.com (Perry E. Metzger) Date: Tue, 12 Aug 2008 12:37:21 -0400 Subject: [Beowulf] Re: Kerberos + HPC In-Reply-To: <878wv2z06c.fsf@liv.ac.uk> (Dave Love's message of "Tue\, 12 Aug 2008 17\:07\:55 +0100") References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> <87hc9zjt3h.fsf@snark.cb.piermont.com> <878wv2z06c.fsf@liv.ac.uk> Message-ID: <877iamqjem.fsf@snark.cb.piermont.com> Dave Love writes: > "Perry E. Metzger" writes: > >> Standard documentation can tell you how to do >> it -- just read the manuals. > > I don't know what the `standard documentation' means, but you won't find > a recipe in the MIT or Heimdal manuals. I'm not sure about a recipe, but the the "kinit" man page trivially explains how to get a password from a stashed location, and also explains how to specifies how to set the lifetime of the requested ticket. So, you just run kinit in cron as the specified daemon user with the appropriate flags and it will renew its own tickets and all is well. I'm not sure why people think this is all so mysterious. Can you explain what is hard about this? > The canonical tool for daemonic use is > , but it's probably > not so useful for jobs in a batch system. Why bother when kinit will do the job? That's what it is for. Perry From dnlombar at ichips.intel.com Tue Aug 12 10:10:55 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Tue, 12 Aug 2008 10:10:55 -0700 Subject: [Beowulf] copying big files In-Reply-To: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Message-ID: <20080812171055.GA3846@nlxdcldnl2.cl.intel.com> On Fri, Aug 08, 2008 at 08:37:13AM -0700, Henning Fehrmann wrote: > Hi everybody, > > Coping a big file onto all nodes in a cluster is a rather common problem. > I would have thought that there might be a standard tool for > distributing the files in an efficient way. So far, I haven't found one. > > Assuming one has a network design which allows non blocking full duplex > wire-speed connections between N/2 pairs of nodes where N is the number > of nodes in the cluster. It is basically a non blocking coreswitch. > > In this case the following scheme would be convenient and rather simple: > > The file is placed on node n1 and one builds a chain of nodes n1 , n2 .... nN. See Brent Chen's pcp at You'll want pcp, authd, and libe. Get gexec while you're at it... -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From Craig.Tierney at noaa.gov Tue Aug 12 11:09:28 2008 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Tue, 12 Aug 2008 12:09:28 -0600 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <1278943052.86771218524490230.JavaMail.root@mail.vpac.org> References: <1278943052.86771218524490230.JavaMail.root@mail.vpac.org> Message-ID: <48A1D1D8.6040405@noaa.gov> Chris Samuel wrote: > ----- "I Kozin (Igor)" wrote: > >>> Generally speaking, MPI programs will not be fetching/writing data >>> from/to storage at the same time they are doing MPI calls so there >>> tends to not be very much contention to worry about at the node >>> level. >> I tend to agree with this. > > But that assumes you're not sharing a node with other > jobs that may well be doing I/O. > > cheers, > Chris I am wondering, who shares nodes in cluster systems with MPI codes? We never have shared nodes for codes that need multiple cores since be built our first SMP cluster in 2001. The contention for shared resources (like memory bandwidth and disk IO) would lead to unpredictable code performance. Also, a poorly behaved program can cause the other codes on that node to crash (which we don't want). Even at TACC (62000+ cores) with 16 cores per node, nodes are dedicated to jobs. Craig -- Craig Tierney (craig.tierney at noaa.gov) From d.love at liverpool.ac.uk Tue Aug 12 09:03:45 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Tue, 12 Aug 2008 17:03:45 +0100 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <1050108012.6491217568524892.JavaMail.root@mail.vpac.org> (Chris Samuel's message of "Fri, 1 Aug 2008 15:28:44 +1000 (EST)") References: <1217492171.3072.4.camel@w1199.insrv.cf.ac.uk> <1050108012.6491217568524892.JavaMail.root@mail.vpac.org> Message-ID: <87d4kez0da.fsf@liv.ac.uk> Chris Samuel writes: >> It required some patches to nss_ldap to make it work properly and the >> pam config was a little bit tricky, but it did work. > > Yeah, we'd looked at some of the NSS stuff and realised it > would need patching.. :-( It's been a while since I looked at NSS's guts, but I'd guess it just needs another instance of nss_ldap with a different name (service) built with a different config file name wired in; that file points at the alternate server. Then decide on the priority between them in lookups. You could also script building db databases from one server, updated via cron, similarly to nss_updatedb. From robertkubrick at gmail.com Mon Aug 11 16:02:04 2008 From: robertkubrick at gmail.com (Robert Kubrick) Date: Mon, 11 Aug 2008 19:02:04 -0400 Subject: [Beowulf] Gigabit Ethernet and RDMA In-Reply-To: <0A784298-63F8-4DAB-BAEF-60F1C4657F76@myri.com> References: <48A0A0DA.6010001@ldeo.columbia.edu> <0A784298-63F8-4DAB-BAEF-60F1C4657F76@myri.com> Message-ID: CriticalIO has a silicon Ethernet NIC for GigE: http:// www.criticalio.com/XGE-Silicon-Stack-Ethernet.asp? gclid=CKvllaDwhpUCFQkcHgodOC1irA An interesting recent article on GigE TOE: TOE cards have fallen out of favor in recent years as server processors caught up to the task of processing Gigabit Ethernet. However, some observers predict TOE cards will be back as 10 Gigabit Ethernet network capacity leapfrogs processing power again. http://searchstorage.techtarget.com/news/article/ 0,289142,sid5_gci1322922,00.html I think what Gus is asking is there exists some native *RDMA* support on GigE NICs. I suspect most efforts in this area are concentrated on 10 Gig Ethernet. However, even if there were some GigE RDMA available, would you use the vendor API directly or need to fetch it through MPI? On Aug 11, 2008, at 6:26 PM, Scott Atchley wrote: > Hi Gus, > > Are you trying to find software for NICs you currently have? Or are > you looking for gigabit Ethernet NICs that natively support some > form of kernel-bypass/zero-copy? > > I do not know of any of the latter (do Chelsio or others offer 1G > NICs with iWarp?). > > As for the former, there are several options for cluster use: > > I believe Scyld has an optimized Ethernet stack. GAMMA has special > drivers for certain Intel NICs. PM/Ethernet-HXB is under active > development. Open-MX works over any Ethernet driver and works with > any MPI that works with native MX. > > If you are interested in 10G Ethernet, MX on Myricom 10G NICs can > work with our Myrinet switches or any brand of Ethernet switches > (some Ethernet switches provide lower latency than others). > > Scott > > On Aug 11, 2008, at 4:28 PM, Gus Correa wrote: > >> Hello Beowulf fans >> >> Does anyone know the status of RDMA on Gigabit Ethernet? >> >> Is it a stable solution for a cluster interconnect, or still an >> experimental thing? >> >> Is it effective in offloading network tasks from the CPU? >> (Myrinet and Infiniband seem to use RDMA effectively, right?) >> >> What does it take for it to work under typical Linux >> distributions? A driver? >> A special kernel? >> Something else? Just plug the NIC in and play? >> >> Does it support standard MPICH2 and/or OpenMPI compiled out of the >> box, >> or does it require linking to some type of special low level >> communication library, >> or perhaps requires the use of a special flavor or MPI (say, from >> the NIC vendor)? >> >> I poked around on the web, >> and learned that Ammasso seems to have pioneered RDMA-enabled GigE >> NICs (Ammasso 1100). >> Broadcom advertises a NIC with similar characteristics (BCM5706). >> However, it is unclear if RDMA GigE NICs would work with standard >> Linux distros, >> if it is effective, how much it costs, and how much hassle is >> required to make it work. >> >> Thank you, >> Gus Correa >> >> -- >> --------------------------------------------------------------------- >> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu >> Lamont-Doherty Earth Observatory - Columbia University >> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA >> --------------------------------------------------------------------- >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From d.love at liverpool.ac.uk Tue Aug 12 09:04:32 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Tue, 12 Aug 2008 17:04:32 +0100 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> (Chris Samuel's message of "Fri, 1 Aug 2008 15:37:12 +1000 (EST)") References: <810152863.6581217568750370.JavaMail.root@mail.vpac.org> <371991977.6661217569032542.JavaMail.root@mail.vpac.org> Message-ID: <87bpzyz0bz.fsf@liv.ac.uk> Chris Samuel writes: > My information is that it's NSS that's more the problem > here rather than PAm, because of the assumptions it makes. Well, the OP only talked about authentication. > We'd prefer to steer clear of Kerberos, it introduces > arbitrary job limitations through ticket lives that > are not tolerable for HPC work. > > Say you submit a job that is in the queue for a week > and then will run for 3 months - we don't know if the > AD admins will permit the creation of a 4 month ticket > "just in case".. Why do you need to re-authenticate, and if you do, surely you need to stash a credential somewhere however you do it? From d.love at liverpool.ac.uk Tue Aug 12 09:06:09 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Tue, 12 Aug 2008 17:06:09 +0100 Subject: [Beowulf] Re: Kerberos + HPC In-Reply-To: <4898609D.7060405@ias.edu> (Prentice Bisbal's message of "Tue, 05 Aug 2008 10:15:57 -0400") References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> Message-ID: <87abfiz09a.fsf@liv.ac.uk> Prentice Bisbal writes: > John Hearns wrote: >> Kerberos is heavily used at CERN. [As far as I know, the main issue at HEP sites is acquiring tokens for AFS. Obviously Kerberos is used all over the place.] >> They have a solution for that issue - >> the job can ask for an extension to the tickets. >> Sorry, I don't have a reference handy but its worth documenting this for >> the list. >> > > If ANYONE has more information on how this is done at CERN, I'd be very > interested in hearing about it. I know, I know... GIYF... The DESY solution with Grid Engine is discussed at From d.love at liverpool.ac.uk Tue Aug 12 09:07:55 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Tue, 12 Aug 2008 17:07:55 +0100 Subject: [Beowulf] Re: Kerberos + HPC In-Reply-To: <87hc9zjt3h.fsf@snark.cb.piermont.com> (Perry E. Metzger's message of "Tue, 05 Aug 2008 12:59:30 -0400") References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> <87hc9zjt3h.fsf@snark.cb.piermont.com> Message-ID: <878wv2z06c.fsf@liv.ac.uk> "Perry E. Metzger" writes: > Standard documentation can tell you how to do > it -- just read the manuals. I don't know what the `standard documentation' means, but you won't find a recipe in the MIT or Heimdal manuals. The canonical tool for daemonic use is , but it's probably not so useful for jobs in a batch system. From csamuel at vpac.org Tue Aug 12 20:29:31 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed, 13 Aug 2008 13:29:31 +1000 (EST) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <443537045.95101218598091657.JavaMail.root@mail.vpac.org> Message-ID: <1083107021.95121218598171378.JavaMail.root@mail.vpac.org> ----- "Craig Tierney" wrote: > I am wondering, who shares nodes in cluster systems with > MPI codes? People in countries outside of the US where investment in HPC results in insufficient resources to meet the growing demand and where not even the peak *national* HPC facility makes the Top 500. For ourselves in a state based organisation our new top of the range cluster has 760 cores, which is more than all our previous clusters combined. We have over 600 registered users from 8 universities and our systems are continuously over subscribed and we have to run our systems to try and get the best throughput. We do use things like cpusets to try and limit the impact that jobs can have on other jobs on the same nodes, and users can request entire nodes for themselves should they so wish, just that their project will be tallied as having used all the cores on that node as they're not available to others. Hmm, that wasn't meant to be a whinge, just that we have to cut our cloth to fit. cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Tue Aug 12 21:21:45 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed, 13 Aug 2008 14:21:45 +1000 (EST) Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <1109283354.96451218600521759.JavaMail.root@mail.vpac.org> Message-ID: <315345181.96691218601305738.JavaMail.root@mail.vpac.org> ----- "Dave Love" wrote: > It's been a while since I looked at NSS's guts, but > I'd guess it just needs another instance of nss_ldap > with a different name (service) built with a different > config file name wired in I think that's pretty much what we'd concluded, but I don't think that's going to be the route we're going to follow here. cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Tue Aug 12 21:27:40 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed, 13 Aug 2008 14:27:40 +1000 (EST) Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <87bpzyz0bz.fsf@liv.ac.uk> Message-ID: <193947989.96781218601660712.JavaMail.root@mail.vpac.org> ----- "Dave Love" wrote: > Chris Samuel writes: > > > My information is that it's NSS that's more the problem > > here rather than PAm, because of the assumptions it makes. > > Well, the OP only talked about authentication. I was the OP. ;-) To clarify, we'd need to both auth and do NSS lookups against the two AD systems. > > We'd prefer to steer clear of Kerberos, it introduces > > arbitrary job limitations through ticket lives that > > are not tolerable for HPC work. > > Why do you need to re-authenticate, If I create a 3 month long Kerberos ticket, and my PBS job will run for 3 months but ends up waiting in the queue for 2 weeks before it can start due to demand then that ticket will have expired before the job can complete. Now, if I don't do anything that requires further re-authentication then it'll probably be OK. But if I do, then it may not work.. > and if you do, surely you need to stash a credential > somewhere however you do it? The GSSAPI branch of Torque will cache the ticket for you, but (AFAIK) cannot extend the life of it. But it's academic anyway as I don't think that branch is usable in production currently. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From apittman at concurrent-thinking.com Wed Aug 13 03:29:05 2008 From: apittman at concurrent-thinking.com (Ashley Pittman) Date: Wed, 13 Aug 2008 11:29:05 +0100 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A1D1D8.6040405@noaa.gov> References: <1278943052.86771218524490230.JavaMail.root@mail.vpac.org> <48A1D1D8.6040405@noaa.gov> Message-ID: <1218623345.7749.15.camel@bruce.priv.wark.uk.streamline-computing.com> On Tue, 2008-08-12 at 12:09 -0600, Craig Tierney wrote: > Chris Samuel wrote: > > ----- "I Kozin (Igor)" wrote: > > But that assumes you're not sharing a node with other > > jobs that may well be doing I/O. > > > I am wondering, who shares nodes in cluster systems with > MPI codes? In my experience, almost everyone. In practise though most jobs ask for even numbers of CPU's so larger jobs rarely get scheduled this way. > We never have shared nodes for codes that need > multiple cores since be built our first SMP cluster > in 2001. The contention for shared resources (like memory > bandwidth and disk IO) would lead to unpredictable code performance. Unpredictable maybe but if the alternative is to not run at all then it's still a win. What you wouldn't want is to have a small number of processes in a big job sharing a node with a resource hogging job and slow down the entire big job however I've never seen this happening in the wild. > Also, a poorly behaved program can cause the other codes on > that node to crash (which we don't want). It goes without saying that this shouldn't be able to happen. Ashley. From landman at scalableinformatics.com Wed Aug 13 04:27:11 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 13 Aug 2008 07:27:11 -0400 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A1D1D8.6040405@noaa.gov> References: <1278943052.86771218524490230.JavaMail.root@mail.vpac.org> <48A1D1D8.6040405@noaa.gov> Message-ID: <48A2C50F.5010305@scalableinformatics.com> Craig Tierney wrote: > Chris Samuel wrote: >> ----- "I Kozin (Igor)" wrote: >> >>>> Generally speaking, MPI programs will not be fetching/writing data >>>> from/to storage at the same time they are doing MPI calls so there >>>> tends to not be very much contention to worry about at the node >>>> level. >>> I tend to agree with this. >> >> But that assumes you're not sharing a node with other >> jobs that may well be doing I/O. >> >> cheers, >> Chris > > I am wondering, who shares nodes in cluster systems with > MPI codes? We never have shared nodes for codes that need The vast majority of our customers/users do. Limited resources, they have to balance performance against cost and opportunity cost. Sadly not every user has an infinite budget to invest in contention free hardware (nodes, fabrics, or disks). So they have to maximize the utilization of what they have, while (hopefully) not trashing the efficiency too badly. > multiple cores since be built our first SMP cluster > in 2001. The contention for shared resources (like memory > bandwidth and disk IO) would lead to unpredictable code performance. Yes it does. As does OS jitter and other issues. > Also, a poorly behaved program can cause the other codes on > that node to crash (which we don't want). Yes this happens as well, but some users simply have no choice. > > Even at TACC (62000+ cores) with 16 cores per node, nodes > are dedicated to jobs. I think every user would love to run on a TACC like system. I think most users have a budget for something less than 1/100th the size. Its easy to forget how much resource (un)availability constrains actions when you have very large resources to work with. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From perry at piermont.com Wed Aug 13 04:38:44 2008 From: perry at piermont.com (Perry E. Metzger) Date: Wed, 13 Aug 2008 07:38:44 -0400 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <87bpzyz0bz.fsf@liv.ac.uk> (Dave Love's message of "Tue\, 12 Aug 2008 17\:04\:32 +0100") References: <810152863.6581217568750370.JavaMail.root@mail.vpac.org> <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <87bpzyz0bz.fsf@liv.ac.uk> Message-ID: <87skt9m9ff.fsf@snark.cb.piermont.com> Dave Love writes: >> We'd prefer to steer clear of Kerberos, it introduces >> arbitrary job limitations through ticket lives that >> are not tolerable for HPC work. Which of course isn't true. If Wall Street firms, which really cannot afford to have their trading systems go down even for a second, can happily use kerberos in servers, so can anyone. >> Say you submit a job that is in the queue for a week >> and then will run for 3 months - we don't know if the >> AD admins will permit the creation of a 4 month ticket >> "just in case".. > > Why do you need to re-authenticate, and if you do, surely you need to > stash a credential somewhere however you do it? Indeed, and if you have stashed your key appropriately you can just have a cron job kinit as often as you like. The kinit man page gives the command line flag for requesting credentials using a key taken from a file, ans also lists the flag for setting your ticket expiry time. All you do is put one line in a crontab with kinit and those two options, say every 24 hours. I keep seeing these messages go by over and over making it sound like this is difficult. It is not difficult. I've seen people say "I have seen no document with a recipe for how to do it", perhaps because a single kinit command in a cron job is too simple for a HOWTO. Maybe some sort of strange myth has been going by so long on this that people refuse to believe that the ticket refresh is a single easy command? Perry -- Perry E. Metzger perry at piermont.com From prentice at ias.edu Wed Aug 13 06:10:18 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 13 Aug 2008 09:10:18 -0400 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <87skt9m9ff.fsf@snark.cb.piermont.com> References: <810152863.6581217568750370.JavaMail.root@mail.vpac.org> <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <87bpzyz0bz.fsf@liv.ac.uk> <87skt9m9ff.fsf@snark.cb.piermont.com> Message-ID: <48A2DD3A.7000906@ias.edu> Perry E. Metzger wrote: > Dave Love writes: >>> We'd prefer to steer clear of Kerberos, it introduces >>> arbitrary job limitations through ticket lives that >>> are not tolerable for HPC work. > > Which of course isn't true. If Wall Street firms, which really cannot > afford to have their trading systems go down even for a second, can > happily use kerberos in servers, so can anyone. > >>> Say you submit a job that is in the queue for a week >>> and then will run for 3 months - we don't know if the >>> AD admins will permit the creation of a 4 month ticket >>> "just in case".. >> Why do you need to re-authenticate, and if you do, surely you need to >> stash a credential somewhere however you do it? > > Indeed, and if you have stashed your key appropriately you can just > have a cron job kinit as often as you like. The kinit man page > gives the command line flag for requesting credentials using a key > taken from a file, ans also lists the flag for setting your ticket > expiry time. All you do is put one line in a crontab with kinit and > those two options, say every 24 hours. Isn't stashing a bunch of user keys on a single system a major security risk? One could argue that the cluster should be on a private, secured network so it's okay, but then I'll argue that the node storing the keys will most likely be the master node, which is often on a "public" network so users can access it to submit jobs. I can't imagine an arrangement like this passing the scrutiny of the local information security officer. And if you're going to use a key stored on a disk, why not just use SSH keys? > > I keep seeing these messages go by over and over making it sound like > this is difficult. It is not difficult. I've seen people say "I have > seen no document with a recipe for how to do it", perhaps because a > single kinit command in a cron job is too simple for a HOWTO. Isn't stashing a bunch of user keys on a single system a major security risk? One could argue that the cluster should be on a private, secured network so it's okay, but then I'll argue that the node storing the keys will most likely be the master node, which is often on a "public" network so users can access it to submit jobs. I can't imagine an arrangement like this passing the scrutiny of the local information security officer. And if you're going to use a key stored on a disk, why not just use SSH keys? > > Maybe some sort of strange myth has been going by so long on this that > people refuse to believe that the ticket refresh is a single easy > command? Maybe you're not reading the questions correctly. In my original question about how to do this, I asked how to do this using the queuing system to refresh the keys -- I was asking for an integrated solution. Above, you describe doing it with a cron job, which does not answer my question. -- Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ From d.love at liverpool.ac.uk Wed Aug 13 07:14:05 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed, 13 Aug 2008 15:14:05 +0100 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <193947989.96781218601660712.JavaMail.root@mail.vpac.org> (Chris Samuel's message of "Wed, 13 Aug 2008 14:27:40 +1000 (EST)") References: <87bpzyz0bz.fsf@liv.ac.uk> <193947989.96781218601660712.JavaMail.root@mail.vpac.org> Message-ID: <87r68tvw7m.fsf@liv.ac.uk> Chris Samuel writes: > I was the OP. ;-) [`Post', not `poster'!] >> Why do you need to re-authenticate, > > If I create a 3 month long Kerberos ticket, and my PBS > job will run for 3 months but ends up waiting in the > queue for 2 weeks before it can start due to demand > then that ticket will have expired before the job can > complete. Yes, I realize that, but typically that isn't an issue. I have operated a cluster with Kerberos authN to AD (spit), and am about to do it again (sigh). > Now, if I don't do anything that requires > further re-authentication then it'll probably be OK. > But if I do, then it may not work.. Yes. That's what I meant in reply to John Hearns, which was ambiguous, according to mail he sent me. The problems arise when you need tickets to access something like AFS (which seems to be much the most common case), but I'd guess that's a non-issue for the majority of cases. >> and if you do, surely you need to stash a credential >> somewhere however you do it? > > The GSSAPI branch of Torque will cache the ticket > for you, but (AFAIK) cannot extend the life of it. I mean that I don't see the objection to Kerberos per se, because if you use any other authN mechanism, you have essentially the same problem, and sure, GSSAPI doesn't solve it. From d.love at liverpool.ac.uk Wed Aug 13 07:15:04 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed, 13 Aug 2008 15:15:04 +0100 Subject: [Beowulf] Re: Kerberos + HPC In-Reply-To: <877iamqjem.fsf@snark.cb.piermont.com> (Perry E. Metzger's message of "Tue, 12 Aug 2008 12:37:21 -0400") References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> <87hc9zjt3h.fsf@snark.cb.piermont.com> <878wv2z06c.fsf@liv.ac.uk> <877iamqjem.fsf@snark.cb.piermont.com> Message-ID: <87prodvw5z.fsf@liv.ac.uk> "Perry E. Metzger" writes: > So, you just run kinit in cron as the specified daemon user with the > appropriate flags and it will renew its own tickets and all is well. Who says you can even run kinit from cron if it was appropriate? > I'm not sure why people think this is all so mysterious. Can you > explain what is hard about this? That's just hand-waving. Hard things include how you integrate it with a distributed batch system, for a start. Making it tolerably secure too. I don't want all users to keep keytabs around everywhere (synchronized with password changes), even if they were practically going to solve the problem of having valid credential caches at the relevant times on the relevant nodes. >> The canonical tool for daemonic use is >> , but it's probably >> not so useful for jobs in a batch system. > > Why bother when kinit will do the job? That's what it is for. Russ Allbery can doubtless justify it better than I can, if the doc doesn't help. He's an MIT Kerberos maintainer(?)/contributor and runs a large Kerberos infrastructure; I don't think he was wasting his time. From tortay at cc.in2p3.fr Wed Aug 13 07:20:37 2008 From: tortay at cc.in2p3.fr (Loic Tortay) Date: Wed, 13 Aug 2008 16:20:37 +0200 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <87skt9m9ff.fsf@snark.cb.piermont.com> References: <810152863.6581217568750370.JavaMail.root@mail.vpac.org> <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <87bpzyz0bz.fsf@liv.ac.uk> <87skt9m9ff.fsf@snark.cb.piermont.com> Message-ID: <48A2EDB5.6090407@cc.in2p3.fr> Perry E. Metzger wrote: [...] > > Maybe some sort of strange myth has been going by so long on this that > people refuse to believe that the ticket refresh is a single easy > command? > The "myth" is the ability to automatically get a Kerberos ticket on any node in a cluster *especially* for the nodes on which you can neither login nor run cron jobs to renew tickets (which is ugly and likely to be non practical and/or insecure in any but the most simple environment anyway). That's the point of "kstart" and similar tools, as well as specific modifications/extensions to batch queueing systems used where a Kerberos ticket is required for jobs (including many HEP sites): *transparently* get and renew Kerberos tickets (for the local realm) on *any* node in the cluster without the need to ever enter a password on the computing nodes. The tickets are discarded when the process/job ends (unlike the "kinit" in a cron job thingy). The version of LSF used at CERN is modified to be able to renew and transmit Kerberos tickets in CERN's realm as long as needed (queue time + execution time). AFAIK this is a (non free) extra feature developed by Platform Computing. If I'm not mistaken, the same (also paid for) LSF modification is used at SLAC and BNL. As someone mentionned, DESY (the German HEP organisation) has something similar for SGE, as we (the French HEP organisation) do for our own batch system and others certainly have similar things. Everyday use case example: the user job runs a program binary stored in CERN's AFS cell with input data in our AFS cell and writes its output in BNL's AFS cell (Kerberos tickets for at least two realms/cells required). This is the way things have been routinely going on in the HEP world (where people usually read manuals) during the last decade or so. Lo?c. -- | Lo?c Tortay - IN2P3 Computing Centre | From d.love at liverpool.ac.uk Wed Aug 13 07:29:33 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed, 13 Aug 2008 15:29:33 +0100 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <87skt9m9ff.fsf@snark.cb.piermont.com> (Perry E. Metzger's message of "Wed, 13 Aug 2008 07:38:44 -0400") References: <810152863.6581217568750370.JavaMail.root@mail.vpac.org> <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <87bpzyz0bz.fsf@liv.ac.uk> <87skt9m9ff.fsf@snark.cb.piermont.com> Message-ID: <87ljz1vvhu.fsf@liv.ac.uk> "Perry E. Metzger" writes: > I keep seeing these messages go by over and over making it sound like > this is difficult. It is not difficult. I've seen people say "I have > seen no document with a recipe for how to do it", perhaps because a > single kinit command in a cron job is too simple for a HOWTO. How about commenting on the DESY paper I linked to and pointing out exactly how they were wasting their time? > Maybe some sort of strange myth has been going by so long on this that > people refuse to believe that the ticket refresh is a single easy > command? Because it simply isn't, in the context of typical Beowulf batch systems, especially if you're not going to pretty well chuck out the Kerberos security model. (Those of us who've contributed to a Kerberos implementation -- particularly the documentation -- know all about kinit, obviously.) From Craig.Tierney at noaa.gov Wed Aug 13 07:55:19 2008 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed, 13 Aug 2008 08:55:19 -0600 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A2C50F.5010305@scalableinformatics.com> References: <1278943052.86771218524490230.JavaMail.root@mail.vpac.org> <48A1D1D8.6040405@noaa.gov> <48A2C50F.5010305@scalableinformatics.com> Message-ID: <48A2F5D7.9080406@noaa.gov> Joe Landman wrote: > Craig Tierney wrote: >> Chris Samuel wrote: >>> ----- "I Kozin (Igor)" wrote: >>> >>>>> Generally speaking, MPI programs will not be fetching/writing data >>>>> from/to storage at the same time they are doing MPI calls so there >>>>> tends to not be very much contention to worry about at the node >>>>> level. >>>> I tend to agree with this. >>> >>> But that assumes you're not sharing a node with other >>> jobs that may well be doing I/O. >>> >>> cheers, >>> Chris >> >> I am wondering, who shares nodes in cluster systems with >> MPI codes? We never have shared nodes for codes that need > > The vast majority of our customers/users do. Limited resources, they > have to balance performance against cost and opportunity cost. > > Sadly not every user has an infinite budget to invest in contention free > hardware (nodes, fabrics, or disks). So they have to maximize the > utilization of what they have, while (hopefully) not trashing the > efficiency too badly. > >> multiple cores since be built our first SMP cluster >> in 2001. The contention for shared resources (like memory >> bandwidth and disk IO) would lead to unpredictable code performance. > > Yes it does. As does OS jitter and other issues. > >> Also, a poorly behaved program can cause the other codes on >> that node to crash (which we don't want). > > Yes this happens as well, but some users simply have no choice. > >> >> Even at TACC (62000+ cores) with 16 cores per node, nodes >> are dedicated to jobs. > > I think every user would love to run on a TACC like system. I think > most users have a budget for something less than 1/100th the size. Its > easy to forget how much resource (un)availability constrains actions > when you have very large resources to work with. > TACC probably wasn't a good example for the "rest of us". It hasn't been difficult to dedicate nodes to jobs when the number of cores was 2 or 4. We now have some 8 core nodes, and we are wondering if the policy of not sharing nodes is going to continue, or at least modified to minimize waste. Craig > Joe > > -- Craig Tierney (craig.tierney at noaa.gov) From d.love at liverpool.ac.uk Wed Aug 13 09:03:46 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed, 13 Aug 2008 17:03:46 +0100 Subject: [Beowulf] Infinipath memory parity errors Message-ID: <87d4kcx5p9.fsf@liv.ac.uk> [I know in an ideal world the vendor between us and PathScale^WQlogic would sort this out.] I'm interested in the cause (and possible cure!) of intermittent errors on various nodes in our Infinipath system which stop MPI jobs with kernel messages like this, in case anyone's familiar with them: lvinfi095:21.Hardware problem: {[RXE EAGERTID Memory Parity]} They seem to be new with an upgrade to Linux 2.6.22 from 2.6.11, but probably just manifested themselves in some other way previously. Google didn't produce any leads, and a brief look in the source suggests that tracking it down where it's generated in the ib_ipath module is non-trivial and likely won't tell me a lot. For what it's worth, the adaptors are 06:00.0 InfiniBand: PathScale, Inc InfiniPath HT-400 (rev 02) in two different sorts of Supermicro whose model numbers I don't know. Thanks for any leads. From gus at ldeo.columbia.edu Wed Aug 13 10:00:23 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 13 Aug 2008 13:00:23 -0400 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A2F5D7.9080406@noaa.gov> References: <1278943052.86771218524490230.JavaMail.root@mail.vpac.org> <48A1D1D8.6040405@noaa.gov> <48A2C50F.5010305@scalableinformatics.com> <48A2F5D7.9080406@noaa.gov> Message-ID: <48A31327.8060801@ldeo.columbia.edu> Hi resource management concerned experts and list I started the thread, before it gained a life of its own and its current incarnation. So, please let me add my two cents to this interesting discussion. We do share nodes on our cluster. After all, we have only 32 nodes, 64 dual processors, single core) on our 6.5+ year old cluster, and many climate and ocean modeling projects to run there. We gladly thank NOAA for the helping to us to get the cluster in 2002! :) Its been hard to get support for a replacement ... We share nodes for the reasons pointed out by Chris, Joe, and others. One reason not mentioned is serial programs. Well a cluster is to run parallel jobs. However, we don't have the money to buy a farm of serial machines, and a few users have valuable scientific projects but don't know (or don't want to know) how to translate their serial code into a parallel algorithm. You really want multiple instances of this type of job to share nodes whenever possible. Last time I checked our cluster was used more heavily than NCAR machines, for instance. We had an average of then less than 72 hours downtime per year (I stayed awake, I am the IT team, programmer, factotum), and an average of about 75% use of its maximum capacity (i.e. all nodes and processors working all the time 24 / 7 / 365 / 6.5+years). I couldn't find usage data of other public, academic, or industry machines to compare. However, I guess there are small clusters like ours out there which are under more intensive use then ours, doing good science and useful applications, and sharing nodes. Yes, we did have cases of jobs failing on a node and breaking another job sharing the node. After I banned the use of Matlab (to the dismay and revolt of many users) things improved on this front, but still happen occasionally. As Craig pointed out, the current trend of overpopulating a single node with many cores may pose further challenges, to manage things like processor and memory affinity requests, etc. Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Craig Tierney wrote: > Joe Landman wrote: > >> Craig Tierney wrote: >> >>> Chris Samuel wrote: >>> >>>> ----- "I Kozin (Igor)" wrote: >>>> >>>>>> Generally speaking, MPI programs will not be fetching/writing data >>>>>> from/to storage at the same time they are doing MPI calls so there >>>>>> tends to not be very much contention to worry about at the node >>>>>> level. >>>>> >>>>> I tend to agree with this. >>>> >>>> >>>> But that assumes you're not sharing a node with other >>>> jobs that may well be doing I/O. >>>> >>>> cheers, >>>> Chris >>> >>> >>> I am wondering, who shares nodes in cluster systems with >>> MPI codes? We never have shared nodes for codes that need >> >> >> The vast majority of our customers/users do. Limited resources, they >> have to balance performance against cost and opportunity cost. >> >> Sadly not every user has an infinite budget to invest in contention >> free hardware (nodes, fabrics, or disks). So they have to maximize >> the utilization of what they have, while (hopefully) not trashing the >> efficiency too badly. >> >>> multiple cores since be built our first SMP cluster >>> in 2001. The contention for shared resources (like memory >>> bandwidth and disk IO) would lead to unpredictable code performance. >> >> >> Yes it does. As does OS jitter and other issues. >> >>> Also, a poorly behaved program can cause the other codes on >>> that node to crash (which we don't want). >> >> >> Yes this happens as well, but some users simply have no choice. >> >>> >>> Even at TACC (62000+ cores) with 16 cores per node, nodes >>> are dedicated to jobs. >> >> >> I think every user would love to run on a TACC like system. I think >> most users have a budget for something less than 1/100th the size. >> Its easy to forget how much resource (un)availability constrains >> actions when you have very large resources to work with. >> > > TACC probably wasn't a good example for the "rest of us". It hasn't been > difficult to dedicate nodes to jobs when the number of cores was 2 or 4. > We now have some 8 core nodes, and we are wondering if the policy of > not sharing nodes is going to continue, or at least modified to minimize > waste. > > Craig > > >> Joe >> >> > > From perry at piermont.com Wed Aug 13 10:09:34 2008 From: perry at piermont.com (Perry E. Metzger) Date: Wed, 13 Aug 2008 13:09:34 -0400 Subject: [Beowulf] Re: Kerberos + HPC In-Reply-To: <87prodvw5z.fsf@liv.ac.uk> (Dave Love's message of "Wed\, 13 Aug 2008 15\:15\:04 +0100") References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> <87hc9zjt3h.fsf@snark.cb.piermont.com> <878wv2z06c.fsf@liv.ac.uk> <877iamqjem.fsf@snark.cb.piermont.com> <87prodvw5z.fsf@liv.ac.uk> Message-ID: <87r68skfjl.fsf@snark.cb.piermont.com> Dave Love writes: > "Perry E. Metzger" writes: > >> So, you just run kinit in cron as the specified daemon user with the >> appropriate flags and it will renew its own tickets and all is well. > > Who says you can even run kinit from cron if it was appropriate? > >> I'm not sure why people think this is all so mysterious. Can you >> explain what is hard about this? > > That's just hand-waving. Hard things include how you integrate it with > a distributed batch system, for a start. Kerberos is already a distributed system. Machines at MIT have been refreshing their server tickets for what, 20 years now? This is not hard. > Making it tolerably secure too. That's why you use kerberos. > I don't want all users to keep keytabs around everywhere > (synchronized with password changes), You don't need to do that. If the issue is a user process on a remote machine that needs user rather than server credentials, you forward tickets or design things so server credentials are good enough to get the needed resources once things have started. You can re-forward tickets as often as you want. There are large firms I know that run this stuff in production and it really does work. Perry From perry at piermont.com Wed Aug 13 10:16:52 2008 From: perry at piermont.com (Perry E. Metzger) Date: Wed, 13 Aug 2008 13:16:52 -0400 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <87ljz1vvhu.fsf@liv.ac.uk> (Dave Love's message of "Wed\, 13 Aug 2008 15\:29\:33 +0100") References: <810152863.6581217568750370.JavaMail.root@mail.vpac.org> <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <87bpzyz0bz.fsf@liv.ac.uk> <87skt9m9ff.fsf@snark.cb.piermont.com> <87ljz1vvhu.fsf@liv.ac.uk> Message-ID: <87myjgkf7f.fsf@snark.cb.piermont.com> Dave Love writes: > "Perry E. Metzger" writes: >> I keep seeing these messages go by over and over making it sound like >> this is difficult. It is not difficult. I've seen people say "I have >> seen no document with a recipe for how to do it", perhaps because a >> single kinit command in a cron job is too simple for a HOWTO. > > How about commenting on the DESY paper I linked to and pointing out > exactly how they were wasting their time? I didn't see that link. Please re-forward it. >> Maybe some sort of strange myth has been going by so long on this >> that people refuse to believe that the ticket refresh is a single >> easy command? > > Because it simply isn't, in the context of typical Beowulf batch > systems, especially if you're not going to pretty well chuck out the > Kerberos security model. (Those of us who've contributed to a Kerberos > implementation -- particularly the documentation -- know all about > kinit, obviously.) Maybe I'm not getting the problem domain here. There are, as I see it, two contexts in which you want kerberos tickets: you want to authenticate access to compute nodes, in which case the remote server is doing nothing that kerberized services haven't done for 20 years to get its tickets, and you may need user credentials to get resources for the user process once it is running on the cluster node. The latter isn't an issue in the average cluster which runs on a segregated network and isn't trying to mount the user's home file system or what have you. If it were a real issue, I would give the user a new instance just for remote jobs so that you could restrict the permissions for that particular instance down to what was absolutely needed, and forward the tickets at intervals from his trusted machine to the compute nodes. This is, after all, more or less what forwarding credentials were made for. -- Perry E. Metzger perry at piermont.com From jlb17 at duke.edu Wed Aug 13 10:18:50 2008 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 13 Aug 2008 13:18:50 -0400 (EDT) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A31327.8060801@ldeo.columbia.edu> References: <1278943052.86771218524490230.JavaMail.root@mail.vpac.org> <48A1D1D8.6040405@noaa.gov> <48A2C50F.5010305@scalableinformatics.com> <48A2F5D7.9080406@noaa.gov> <48A31327.8060801@ldeo.columbia.edu> Message-ID: On Wed, 13 Aug 2008 at 1:00pm, Gus Correa wrote > After I banned the use of Matlab (to the dismay and revolt of many users) I am *so* jealous... -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From perry at piermont.com Wed Aug 13 10:48:37 2008 From: perry at piermont.com (Perry E. Metzger) Date: Wed, 13 Aug 2008 13:48:37 -0400 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <48A2EDB5.6090407@cc.in2p3.fr> (Loic Tortay's message of "Wed\, 13 Aug 2008 16\:20\:37 +0200") References: <810152863.6581217568750370.JavaMail.root@mail.vpac.org> <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <87bpzyz0bz.fsf@liv.ac.uk> <87skt9m9ff.fsf@snark.cb.piermont.com> <48A2EDB5.6090407@cc.in2p3.fr> Message-ID: <87ej4skdqi.fsf@snark.cb.piermont.com> Loic Tortay writes: > Perry E. Metzger wrote: > [...] >> >> Maybe some sort of strange myth has been going by so long on this that >> people refuse to believe that the ticket refresh is a single easy >> command? >> > The "myth" is the ability to automatically get a Kerberos ticket on any > node in a cluster *especially* for the nodes on which you can neither > login nor run cron jobs to renew tickets (which is ugly and likely to be > non practical and/or insecure in any but the most simple environment > anyway). It is the way virtually all server credentials are handled. If you have any kerberized service on the network, it almost always works with stashed creds. > That's the point of "kstart" and similar tools, kstart is a modified version of kinit. It is just a more sophisticated version of what I described already -- it uses an srvtab or keytab to get the tickets, forks a job, waits, and then does a kdestroy at the exit. I'm not going to say it is a stupid program -- it is very useful -- but it isn't doing anything terribly deep or special. > as well as specific modifications/extensions to batch queueing > systems used where a Kerberos ticket is required for jobs (including > many HEP sites): *transparently* get and renew Kerberos tickets (for > the local realm) on *any* node in the cluster without the need to > ever enter a password on the computing nodes. One doesn't "enter a password" using tools like kstart because one uses a servtab or keytab -- you are putting the crypto key into a file. > The tickets are discarded when the process/job ends (unlike the > "kinit" in a cron job thingy). It appears you are talking about distributing *user* credentials to the remote systems. What exactly is it that these jobs are doing that require user tickets rather than the tickets for the locally provided service? In any case, for user credentials, kstart isn't the appropriate mechanism, forwarding a ticket from a trusted machine is the appropriate mechanism. > Everyday use case example: the user job runs a program binary stored in > CERN's AFS cell with input data in our AFS cell and writes its output in > BNL's AFS cell (Kerberos tickets for at least two realms/cells required). Actually, with cross realm auth you only need tickets for one realm along with an appropriate trust relationship between the two KDCs. It seems like a bad move for performance to use AFS this way, but it seems reasonably straightforward. AFS has an advanced ACL mechanism so it is not necessary to give the user's normal credentials away to the job -- it is more than sufficient to set up a distinct instance for the job that has permission to read and write only the appropriate files, and to forward the credentials for the segregated instance to the compute nodes. Naturally the last thing the job should do is a kdestroy or the moral equivalent. Perry -- Perry E. Metzger perry at piermont.com From robl at mcs.anl.gov Wed Aug 13 11:38:53 2008 From: robl at mcs.anl.gov (Robert Latham) Date: Wed, 13 Aug 2008 13:38:53 -0500 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <588c11220808061131l6c3ecf49hc45b5da7151e64b8@mail.gmail.com> References: <4898C560.6040606@ldeo.columbia.edu> <588c11220808061131l6c3ecf49hc45b5da7151e64b8@mail.gmail.com> Message-ID: <20080813183853.GJ11340@mcs.anl.gov> On Wed, Aug 06, 2008 at 01:31:09PM -0500, Jason Clinton wrote: > Generally speaking, MPI programs will not be fetching/writing data > from/to storage at the same time they are doing MPI calls so there > tends to not be very much contention to worry about at the node level. Well... if the MPI program uses collective I/O (maybe they use MPI_File_write_all directly, or they are using collective parallel-NetCDF or parallel-HDF5 calls), it is likely the MPI-IO implementation will optimize that access with "two-pase" buffering. I would like to think this situation is common in high performance I/O, especially when accessing a parallel file system. Some tremendous performance gains can be had with collective I/O. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From bernard at vanhpc.org Wed Aug 13 16:14:50 2008 From: bernard at vanhpc.org (Bernard Li) Date: Wed, 13 Aug 2008 16:14:50 -0700 Subject: [Beowulf] copying big files In-Reply-To: <20080812171055.GA3846@nlxdcldnl2.cl.intel.com> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <20080812171055.GA3846@nlxdcldnl2.cl.intel.com> Message-ID: Hi: On Tue, Aug 12, 2008 at 10:10 AM, Lombard, David N wrote: > See Brent Chen's pcp at > You'll want pcp, authd, and libe. Get gexec while you're at it... Dave!! You beat me to mention pcp! :-) I really wished somebody would pick up the project and make it better. I have tested it a few years back and it was really fast. The general idea is that the host that has the original file will send to the neighbour and before that transfer completes the neighbour will send to its neighbour and so on and so forth. But it has some shortcomings: 1) If one host in the path goes down, you need to start over again 2) The command option/interface is a little bit awkward, if I remember correctly you need to specify all the hosts manually in the command line etc. And yes, I agree BitTorrent is a simple solution that works well, that is why we integrated BitTorrent as part of the distribution mechanism of OS images in SystemImager. However, it would be nice if someone actually wrote an application which eliminates the manual setup of the tracker, seed, etc. -- better yet, code something from scratch as BitTorrent cannot handle user/file permissions, and thus the way around it is to tar up the files you wanted transfer, and untar it after, which adds additional overhead. I look forward to trying out XGET, though. Cheers, Bernard From bernard at vanhpc.org Wed Aug 13 16:20:56 2008 From: bernard at vanhpc.org (Bernard Li) Date: Wed, 13 Aug 2008 16:20:56 -0700 Subject: [Beowulf] copying big files In-Reply-To: References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <20080812171055.GA3846@nlxdcldnl2.cl.intel.com> Message-ID: I'd like to add a comment -- is the reason why this "issue" hasn't been brought up as frequently as I think it should be mainly because a lot of folks use distributed FS that eliminates the need to do one to many file transfers? Cheers, Bernard On Wed, Aug 13, 2008 at 4:14 PM, Bernard Li wrote: > Hi: > > On Tue, Aug 12, 2008 at 10:10 AM, Lombard, David N > wrote: > >> See Brent Chen's pcp at >> You'll want pcp, authd, and libe. Get gexec while you're at it... > > Dave!! You beat me to mention pcp! :-) > > I really wished somebody would pick up the project and make it better. > I have tested it a few years back and it was really fast. The > general idea is that the host that has the original file will send to > the neighbour and before that transfer completes the neighbour will > send to its neighbour and so on and so forth. But it has some > shortcomings: > > 1) If one host in the path goes down, you need to start over again > 2) The command option/interface is a little bit awkward, if I remember > correctly you need to specify all the hosts manually in the command > line > etc. > > And yes, I agree BitTorrent is a simple solution that works well, that > is why we integrated BitTorrent as part of the distribution mechanism > of OS images in SystemImager. However, it would be nice if someone > actually wrote an application which eliminates the manual setup of the > tracker, seed, etc. -- better yet, code something from scratch as > BitTorrent cannot handle user/file permissions, and thus the way > around it is to tar up the files you wanted transfer, and untar it > after, which adds additional overhead. > > I look forward to trying out XGET, though. > > Cheers, > > Bernard > From niftyompi at niftyegg.com Wed Aug 13 17:12:40 2008 From: niftyompi at niftyegg.com (Nifty niftyompi Mitch) Date: Wed, 13 Aug 2008 17:12:40 -0700 Subject: [Beowulf] Infinipath memory parity errors In-Reply-To: <87d4kcx5p9.fsf@liv.ac.uk> References: <87d4kcx5p9.fsf@liv.ac.uk> Message-ID: <20080814001240.GA5557@hpegg.wr.niftyegg.com> On Wed, Aug 13, 2008 at 05:03:46PM +0100, Dave Love wrote: > [I know in an ideal world the vendor between us and PathScale^WQlogic > would sort this out.] > > I'm interested in the cause (and possible cure!) of intermittent errors > on various nodes in our Infinipath system which stop MPI jobs with > kernel messages like this, in case anyone's familiar with them: > > lvinfi095:21.Hardware problem: {[RXE EAGERTID Memory Parity]} > > They seem to be new with an upgrade to Linux 2.6.22 from 2.6.11, but > probably just manifested themselves in some other way previously. > > Google didn't produce any leads, and a brief look in the source suggests > that tracking it down where it's generated in the ib_ipath module is > non-trivial and likely won't tell me a lot. > > For what it's worth, the adaptors are > > 06:00.0 InfiniBand: PathScale, Inc InfiniPath HT-400 (rev 02) > > in two different sorts of Supermicro whose model numbers I don't know. > Dave, Which driver is active? Which Infinipath software release is installed? The tool "ipath_control -i" can show which... The kernel.org/ofed driver does not have as rich a set of error recovery code for this card as the shipped driver. The recovery code was seen as a badness and not accepted by the kernel.org folk.... With a kernel update the driver will not have been recompiled and the kernel.org driver would become active. Look for this stuff in the Install Guide. # To rebuild the drivers, do the following (as root): # cd /usr/src/infinipath/drivers # ./make-install.sh # /etc/init.d/infinipath restart -- T o m M i t c h e l l Got a great hat... now what. From carsten.aulbert at aei.mpg.de Wed Aug 13 22:23:10 2008 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Thu, 14 Aug 2008 07:23:10 +0200 Subject: [Beowulf] Distributed FS (Was: copying big files) In-Reply-To: References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <20080812171055.GA3846@nlxdcldnl2.cl.intel.com> Message-ID: <48A3C13E.5010508@aei.mpg.de> Hi all Bernard Li wrote: > I'd like to add a comment -- is the reason why this "issue" hasn't > been brought up as frequently as I think it should be mainly because a > lot of folks use distributed FS that eliminates the need to do one to > many file transfers? > Speaking of this, what do people use when they have say ~ 200 nodes with an extra 1 TB drive in it. I know of glusterFS but very few others who will be able to utilize this in a somewhat efficient matter. Are the good/better alternatives out there? Extending your storage that way is just darn cheap. Cheers Carsten -------------- next part -------------- A non-text attachment was scrubbed... Name: carsten_aulbert.vcf Type: text/x-vcard Size: 414 bytes Desc: not available URL: From csamuel at vpac.org Wed Aug 13 22:37:35 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 14 Aug 2008 15:37:35 +1000 (EST) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <56844017.103401218676989887.JavaMail.root@mail.vpac.org> Message-ID: <749528484.108171218692255152.JavaMail.root@mail.vpac.org> ----- "Kilian CAVALOTTI" wrote: > Hi Chris, Hello Kilian, > On Tuesday 12 August 2008 08:29:31 pm Chris Samuel wrote: > > > We do use things like cpusets to try and limit the impact > > that jobs can have on other jobs on the same nodes, > > I'm actually curious about how you implemented that. Not a problem. > Do you have NUMA hardware? Yes, the cluster we're using this on has dual quad core Barcelona CPUs and 32GB RAM per node (to get it to the 4GB/core level). It's running CentOS 5 with the mainline kernel. > Do you use a resources manager, and is the cpusets creation > process integrated with it? We are using Torque (an open source PBS derivative) and that has built in cpusets support. It previously had some support for the older SGI Altix cpusets but that has now been replaced with support for the 2.6 kernel implementation (which itself has now been pulled into the more generic cgroups work). The 2.6 cpuset support in Torque came out of a long discussion between Garrick Staples and myself at SC'07 where we nutted out the basic design and Garrick then did the hard work of implementing it. > How do you manage concurrent jobs running on the same > machine: do you pin them on specific CPUs and keep track > of what CPU is busy and which is not, or do you have a > way to just limit the number of CPUs they're using? There are two major assumptions in the current Torque code: 1) There is a direct mapping between Torque's concept of vnodes (cpus) and cores. I.e. if you have told Torque a node has 8 cpus then it has 8 cores to bind to. 2) The cpus are contiguous and start at 0. So if you are using a boot cpuset then it's best to reserve the *last* core in the box for that and not the first. You will also need to tell Torque that the node has N-1 cpus. The design is sort of hierarchical: 1) A top level "torque" cpuset is created by the pbs_mom when it starts if it does not already exist. It adds all the cpus and mems into it. 2) When a job is scheduled onto the node(s) the pbs_mom creates a job cpuset which includes the specific cpus (vnodes) that have been allocated by the scheduler, and all the mems present (it currently makes no attempt to be clever about that). 3) Prior to the 2.3.2 release there was a per vnode (core) cpuset created within the job cpuset and then processes launched via the PBS tm_spawn interface by tools like Pete Wyckoff's mpiexec would get locked to a core. Great in theory, but... That's been changed now to just put processes in the job cpuset as MPI tools like OpenMPI's mpiexec only make a single tm_spawn call *per node* and then fork the MPI processes from that so you would end up with all the processes of an OpenMPI job locked to a single core with the old code. This still leaves issues for codes that use rsh/rsh based MPI launchers but we're playing around with a drop in script that makes it do the right thing using pbsdsh instead. > As you can guess, I'd be interested in some technical details. :) Hope that's useful! We also have an init script that does: mkdir /dev/cpuset mount -t cpuset none /dev/cpuset to make sure the cpuset VFS is there on boot. Tangent: Linux cpusets were how we found that the noacpi boot option broke the kernels detection of NUMA capabilities [1] on Barcelona as /dev/cpuset/mems only had "0" in it, not "0-1" as it should have had! [1] - it first tries a K8 specific hack and then uses ACPI, so for K10 no ACPI - no NUMA. ;-) cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Wed Aug 13 22:48:48 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 14 Aug 2008 15:48:48 +1000 (EST) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A31327.8060801@ldeo.columbia.edu> Message-ID: <1495958777.108261218692928122.JavaMail.root@mail.vpac.org> ----- "Gus Correa" wrote: > One reason not mentioned is serial programs. > Well a cluster is to run parallel jobs. Hmm, a cluster is to run HPC codes, there are plenty of legitimate single CPU codes to solve embarrassingly parallel problems! :-) [...] > and an average of about 75% use of its maximum capacity [..] > I couldn't find usage data of other public, academic, or industry > machines to compare. It appears we've averaged almost 77% utilisation since the beginning of 2004 (when our current usage system records begin). cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Thu Aug 14 01:03:16 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 14 Aug 2008 18:03:16 +1000 (EST) Subject: Linux cpusets and HPC (was Re: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem?) In-Reply-To: <1703956069.109811218700794811.JavaMail.root@mail.vpac.org> Message-ID: <440195143.109831218700996351.JavaMail.root@mail.vpac.org> ----- "Paul Jackson" wrote: Hi Paul, > Chris wrote: > > The 2.6 cpuset support in Torque came out of a long > > Would you have any pointers to some more details of what you've > done here? Sure - mostly it was discussed on the torquedev list after the initial discussion at SC'07, Garrick started the thread here: http://www.supercluster.org/pipermail/torquedev/2007-November/000748.html His announcement of the initial implementation, along with notes on the differences from the plan is here: http://www.supercluster.org/pipermail/torquedev/2008-January/000842.html There is a Wiki page on it too, but that isn't up to date as it doesn't mention not using the per-vnode/core cpusets due to the OpenMPI issues. http://www.clusterresources.com/wiki/doku.php?id=torque:3.5_linux_cpuset_support > I'm the maintainer, and one of the authors, of Linux 2.6 > cpusets, and would like to do what I can with cpusets to make life > easier (or at least no more painful) for cluster and MPI folks. Wonderful! First of all thanks so much for the code. The only major issue we've come across is not due to cpusets themselves but just the way that things like OpenMPI tend to work in that they send launch a single process per node via the MPI launcher and then that forks off all the child processes necessary. This means it's not easy to lock MPI tasks to cores via this method, and it's also not trivial for the MPI program to be able to work out what cores it can try and bind itself to via setaffinity(). > My background comes more from the "big honkin NUMA iron" > running a Single System Image on 100's or 1000's of CPUs > (SGI Irix/Origin and later Linux/Altix), which was the > "country of origin" for cpusets, so my interest (and > ignorance) in asking this question is more to gain > an understanding of how cpusets have been adapted to > clusters, as I understand less well the needs of clusters, > and what if anything cpusets might do here to be of more use. The main purpose we're using them for is a quick and easy way to catch users who don't know better doing things like running an OpenMP code as a single CPU job and overloading a node (and causing chaos for other users) when it discovers 8 cores. Single CPU jobs get the benefit of being locked to a single core, and even MPI jobs get some benefit in that they can only be migrated between cores they've been allocated. > Totally totally trivial nit -- you wrote: [...] > I prefer in my setups to have that mount command be: > > mount -t cpuset cpuset /dev/cpuset > > so that the mount shows up in the output of the mount(8) command > with 'cpuset' in the mount 'device' field, not 'none'. Thanks for that, much appreciated! cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Thu Aug 14 01:08:17 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 14 Aug 2008 18:08:17 +1000 (EST) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <20080814020045.9b1961bb.pj@sgi.com> Message-ID: <361311091.109861218701297236.JavaMail.root@mail.vpac.org> ----- "Paul Jackson" wrote: > I have recently open sourced a major user level C library, > called libcpusets, which includes routines to map cpus to > their corresponding memory nodes. Aha, a bunch of us had been badgering the local Melbourne SGI rep about getting that published when we found it was included in the ProPack that supported the 2.6 series cpusets as we could tell from the RPM that is was LGPL'd. Great to see it finally arrive! cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Thu Aug 14 03:42:10 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 14 Aug 2008 20:42:10 +1000 (EST) Subject: Linux cpusets and HPC (was Re: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem?) In-Reply-To: <1818932448.110281218710420889.JavaMail.root@mail.vpac.org> Message-ID: <491682710.110301218710530374.JavaMail.root@mail.vpac.org> ----- "Paul Jackson" wrote: Hi Paul, > Let me see if I understand this. Is the following right: > > Without the cpuset constraint, such a 'bad' job could tell the > cluster management software (PBS or Torque or ...) it needed just > one CPU, which could end up putting it on a cluster node with > say eight CPUs, along with some other jobs that expect to use the > other seven CPUs. It's the user that specifies how many CPUs their particular problem will require, but effectively that's correct. > But then OpenMP code in that 'bad' job could notice it had eight > CPUs, think to itself 'wow - cool', and proceed to hog all eight > CPUs, messing up those other jobs. That's correct, some OpenMP codes automatically detect the number of cores in a system and will (if not told otherwise) use them all. Alternatively some users forget they've reduced the number of cores they've said the job needs and have (in a config file say) still a larger number specified. > With the cpuset constraint, that 'bad' job -will- only be able to > use that one CPU, and if OpenMP or other code in that job can't > deal reasonably with that circumstance, well, tough, the owner of > that job should fix something. Well, "tough" might be a tad hard on them, but yes. > But at least the other jobs that were > hoping to use the other seven CPUs won't be bothered much by this. Spot on. > Did I say that right? That's a pretty fair summary for our main use case, yes. The memory locality possibilities are also important, just not currently covered by Torque and will probably require more smarts in the pbs_mom and how it detects core/socket relationships, locality, hyperthreading/SMT, etc.. It will also change how it reports that to the pbs_server and then how the scheduler (Maui or Moab) allocates resources to jobs based upon policies and what the job has requested. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From d.love at liverpool.ac.uk Thu Aug 14 03:51:41 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Thu, 14 Aug 2008 11:51:41 +0100 Subject: [Beowulf] Infinipath memory parity errors In-Reply-To: <20080814001240.GA5557@hpegg.wr.niftyegg.com> (Nifty niftyompi Mitch's message of "Wed, 13 Aug 2008 17:12:40 -0700") References: <87d4kcx5p9.fsf@liv.ac.uk> <20080814001240.GA5557@hpegg.wr.niftyegg.com> Message-ID: <878wuzvphe.fsf@liv.ac.uk> Nifty niftyompi Mitch writes: > Which driver is active? Which Infinipath software release > is installed? The tool "ipath_control -i" can show which... QLogic kernel.org driver 00: Version: Driver 2.0, InfiniPath_QLE7140, InfiniPath1 4.2, PCI 2, SW Compat 2 I think this is a 2.1 distribution, whereas there's at least 2.2 now available. > The kernel.org/ofed driver does not have as rich a set of error recovery > code for this card as the shipped driver. The recovery code was seen > as a badness and not accepted by the kernel.org folk.... Hmm... > With a kernel update the driver will not have been recompiled > and the kernel.org driver would become active. [Actually it wasn't just a kernel update -- the SuSE 9.3 system disk was removed and replaced by a 10.3 one shortly after I arrived, trashing all the configuration, so I'm a little at sea, without infiniband experience.] > Look for this stuff in the Install Guide. > > # To rebuild the drivers, do the following (as root): > # cd /usr/src/infinipath/drivers > # ./make-install.sh > # /etc/init.d/infinipath restart I didn't realize that there's a driver significantly different from the kernel.org one, and haven't had time to read up enough. I'll give it a go when we can restart the relevant nodes. Many thanks for the info. From csamuel at vpac.org Thu Aug 14 04:15:43 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 14 Aug 2008 21:15:43 +1000 (EST) Subject: [Beowulf] Gigabit Ethernet and RDMA In-Reply-To: <1728580444.110441218712351526.JavaMail.root@mail.vpac.org> Message-ID: <569110383.110461218712543056.JavaMail.root@mail.vpac.org> ----- "Robert Kubrick" wrote: > However, some observers predict TOE cards will be back as 10 Gigabit > Ethernet network capacity leapfrogs processing power again. Van Jacobsen's "channelised" mods to the Linux TCP stack described at Linux.Conf.Au in Dunedin, NZ in 2006 showed that the 4.3Gb/s limit he hit on 10GigE was due to the DDR333 MHz RAM in the box and not the CPUs, he commented during the presentation that they'd proved it to themselves by putting in faster memory and getting a speedup. :-) http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf He estimated that you would need at least DDR800 RAM to be able to have enough memory bandwidth to drive 10GigE at capacity. Whilst VJ's code didn't (to my knowledge) get published it did inspire other work by kernel hackers, though I'm not sure how much of it (if any) got into the mainline. http://lwn.net/Articles/182060/ cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Thu Aug 14 04:19:55 2008 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 14 Aug 2008 21:19:55 +1000 (EST) Subject: [Beowulf] Distributed FS (Was: copying big files) In-Reply-To: <48A3C13E.5010508@aei.mpg.de> Message-ID: <1492037214.110491218712795387.JavaMail.root@mail.vpac.org> ----- "Carsten Aulbert" wrote: > Speaking of this, what do people use when they have say ~ 200 nodes > with an extra 1 TB drive in it. We're just using them for local scratch space, we thought it was overkill until two weeks after the first test node arrived when we got a user whose job needed over 600GB of scratch whilst it ran and it was the only place we could run it... cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From robl at mcs.anl.gov Thu Aug 14 05:55:05 2008 From: robl at mcs.anl.gov (Robert Latham) Date: Thu, 14 Aug 2008 07:55:05 -0500 Subject: [Beowulf] Distributed FS (Was: copying big files) In-Reply-To: <48A3C13E.5010508@aei.mpg.de> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <20080812171055.GA3846@nlxdcldnl2.cl.intel.com> <48A3C13E.5010508@aei.mpg.de> Message-ID: <20080814125505.GB30610@mcs.anl.gov> On Thu, Aug 14, 2008 at 07:23:10AM +0200, Carsten Aulbert wrote: > Speaking of this, what do people use when they have say ~ 200 nodes > with an extra 1 TB drive in it. I know of glusterFS but very few > others who will be able to utilize this in a somewhat efficient > matter. Are the good/better alternatives out there? > > Extending your storage that way is just darn cheap. Take a look at PVFS (www.pvfs.org). Disclaimer: I work on PVFS, but seriously, PVFS has been used this way for a decade. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From dnlombar at ichips.intel.com Thu Aug 14 06:27:09 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Thu, 14 Aug 2008 06:27:09 -0700 Subject: [Beowulf] copying big files In-Reply-To: References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <20080812171055.GA3846@nlxdcldnl2.cl.intel.com> Message-ID: <20080814132709.GA1751@nlxdcldnl2.cl.intel.com> On Wed, Aug 13, 2008 at 04:14:50PM -0700, Bernard Li wrote: > Hi: > > On Tue, Aug 12, 2008 at 10:10 AM, Lombard, David N > wrote: > > > See Brent Chen's pcp at > > You'll want pcp, authd, and libe. Get gexec while you're at it... > > Dave!! You beat me to mention pcp! :-) > > I really wished somebody would pick up the project and make it better. It's now part of Ganglia. It was *always* intended to support ganglia, hence the g of gexec, but now it's in the Ganglia tree @ SF. > I have tested it a few years back and it was really fast. The > general idea is that the host that has the original file will send to > the neighbour and before that transfer completes the neighbour will > send to its neighbour and so on and so forth. It's essentially a file transfer pipeline through the hosts, so O(1) transfers. gexec is a tree. > But it has some > shortcomings: > > 1) If one host in the path goes down, you need to start over again Yup > 2) The command option/interface is a little bit awkward, if I remember > correctly you need to specify all the hosts manually in the command > line > etc. Running via ganglia eases this. I once provided patches to allow for the various naming shortcuts, e.g., 'n[1-42]', but Brent was more focused on Ganglia integration than general purpose tools. > And yes, I agree BitTorrent is a simple solution that works well, that > is why we integrated BitTorrent as part of the distribution mechanism > of OS images in SystemImager. Yes, BT is an outstanding application-level multicast. BUT, it depends on having a suitable number of contributers to the stream, which are built up over time in this usage. Having said that, it really is a good capability. > However, it would be nice if someone > actually wrote an application which eliminates the manual setup of the > tracker, seed, etc. -- better yet, code something from scratch as > BitTorrent cannot handle user/file permissions, and thus the way > around it is to tar up the files you wanted transfer, and untar it > after, which adds additional overhead. Hmmm... > I look forward to trying out XGET, though. me2 -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From hahn at mcmaster.ca Thu Aug 14 07:17:41 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 14 Aug 2008 10:17:41 -0400 (EDT) Subject: [Beowulf] Distributed FS (Was: copying big files) In-Reply-To: <20080814125505.GB30610@mcs.anl.gov> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <20080812171055.GA3846@nlxdcldnl2.cl.intel.com> <48A3C13E.5010508@aei.mpg.de> <20080814125505.GB30610@mcs.anl.gov> Message-ID: >> Extending your storage that way is just darn cheap. in fact, we really have to stop thinking of storage as a significant cost component to clusters. if adding a terabyte disk to a node increases its cost by ~2%, then it's in the noise. > Take a look at PVFS (www.pvfs.org). Disclaimer: I work on PVFS, but > seriously, PVFS has been used this way for a decade. the premise of this approach is that whoever is using the node doesn't mind the overhead of external accesses. do you have a sense (or even measurements) on how bad this loss is (cpu, cache, memory, interconnect overheads)? if you follow the reasoning that current machines are pretty 'fat' wrt IB bandwidth and cpu power, there's still a question of who does the work of raid/fec - ideally, it would be on the client side to minimize the imposed jitter. thanks, mark hahn. From carsten.aulbert at aei.mpg.de Thu Aug 14 07:26:52 2008 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Thu, 14 Aug 2008 16:26:52 +0200 Subject: [Beowulf] Distributed FS (Was: copying big files) In-Reply-To: References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <20080812171055.GA3846@nlxdcldnl2.cl.intel.com> <48A3C13E.5010508@aei.mpg.de> <20080814125505.GB30610@mcs.anl.gov> Message-ID: <48A440AC.5080508@aei.mpg.de> Hi Mark Mark Hahn wrote: > the premise of this approach is that whoever is using the node doesn't > mind the overhead of external accesses. do you have a sense (or even > measurements) on how bad this loss is (cpu, cache, memory, interconnect > overheads)? if you follow the reasoning that current machines are > pretty 'fat' wrt IB bandwidth and cpu power, there's still a question > of who does the work of raid/fec - ideally, it would be on the client > side to minimize the imposed jitter. As always: It depends. All our nodes run on single GigE but mostly their computations are non-MPI and even local to their core, i.e. the bandwidth should not be a problem. Of course you add more heat to the system, e.g. 1000 extra disks might be around 10 kW sustained, but OTOH you gain a lot, provided you can efficiently use these extra disks. I need to look into PVFS, if this would provide a kind of uniform namespace (and maybe some kind of automatically duplicated files) that would already be perfect. But I need to read first. Cheers Carsten -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31 -------------- next part -------------- A non-text attachment was scrubbed... Name: carsten_aulbert.vcf Type: text/x-vcard Size: 414 bytes Desc: not available URL: From erwan at seanodes.com Thu Aug 14 07:42:03 2008 From: erwan at seanodes.com (Erwan Velu) Date: Thu, 14 Aug 2008 16:42:03 +0200 Subject: [Beowulf] copying big files In-Reply-To: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> Message-ID: <48A4443B.5060705@seanodes.com> Henning Fehrmann wrote: > Hi everybody, > > Coping a big file onto all nodes in a cluster is a rather common problem. > I would have thought that there might be a standard tool for > distributing the files in an efficient way. So far, I haven't found one. > > Assuming one has a network design which allows non blocking full duplex > wire-speed connections between N/2 pairs of nodes where N is the number > of nodes in the cluster. It is basically a non blocking coreswitch. > > In this case the following scheme would be convenient and rather simple: > > The file is placed on node n1 and one builds a chain of nodes n1 , n2 .... nN. > > One splits the file into many packages (p1..pM), lets say a fragment fits > into one TCP package. In the first step n1 transmits the package p1 to node n2. > In the second step n1 transmits the package p2 to n2 and n2 transmits p1 to node n3. > > The transmission of a single package is fast. The time of passing a particular > package through the whole chain of nodes is short compared with time of the > entire copying process. E.g., using jumbo frames a package can have the size of ca 10kB. > In Gb network the transmission time of a single package between nodes is > of the order of 0.1 ms. Even in a cluster with 1024 nodes it takes > in an ideal case just 0.1s to pass a package from node n1 through all nodes to n1024. > > On each node the package is stored and, in the end, one reassembles the file. > For big files (size >> 10Mb) the required time is approximately > the same as one needs for copying the file between two nodes plus 0.1s. > > One needs basically a daemon which handles copying requests and establishes > the connection to next node in the chain. > > Has somebody written such a tool? > Sounds like you are looking for http://taktuk.gforge.inria.fr/ -- Erwan Velu Pre-Sales Engineer Seanodes http://www.seanodes.com +33 (0)1 41 22 13 83 From gus at ldeo.columbia.edu Thu Aug 14 08:58:20 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 14 Aug 2008 11:58:20 -0400 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <1495958777.108261218692928122.JavaMail.root@mail.vpac.org> References: <1495958777.108261218692928122.JavaMail.root@mail.vpac.org> Message-ID: <48A4561C.2040302@ldeo.columbia.edu> Hello Chris and list Chris Samuel wrote: >----- "Gus Correa" wrote: > > > >>One reason not mentioned is serial programs. >>Well a cluster is to run parallel jobs. >> >> > >Hmm, a cluster is to run HPC codes, there are plenty of >legitimate single CPU codes to solve embarrassingly >parallel problems! :-) > > > We seem to agree on this. An example: Millions of cross correlations of micro-earthquake seismograms (time series), to locate the focii precisely, and produce a high-resolution map of geologic faults and potential hazard in California took several days using shared nodes. The code is serial, the size of each calculation doesn't justify parallelism, but the large number of them requires massive computational resources. I wouldn't bother the people who wrote the program to parallelize it (in the sense of using MPI or OpenMP). The script that launched the tons of serial jobs was the "embarrassingly parallel" component of it. Some people would say it is a waste to run this type of program on a cluster with Myrinet. If we had a a farm of serial computers we would have used it, but we don't have one. >[...] > > >>and an average of about 75% use of its maximum capacity >> >> >[..] > > >>I couldn't find usage data of other public, academic, or industry >>machines to compare. >> >> > >It appears we've averaged almost 77% utilisation >since the beginning of 2004 (when our current usage >system records begin). > > > Thank you very much for the data point! I've insisted here that above 70% utilization is very good, given the random nature of demand and jobs on queues in the academia, etc. However, some folks would want more than 90% efficiency to get happy. I had to resort to the Second Law of Thermodynamics, compare our efficiency with Carnot cycles, with the efficiency of thermal engines, of biological systems, of the atmosphere and ocean heat transport, etc, to make my point, and the theoretical argument almost jeopardized my job ... :) >cheers, >Chris > > Cheers, Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From hahn at mcmaster.ca Thu Aug 14 09:44:07 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 14 Aug 2008 12:44:07 -0400 (EDT) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A4561C.2040302@ldeo.columbia.edu> References: <1495958777.108261218692928122.JavaMail.root@mail.vpac.org> <48A4561C.2040302@ldeo.columbia.edu> Message-ID: >> It appears we've averaged almost 77% utilisation >> since the beginning of 2004 (when our current usage >> system records begin). >> > Thank you very much for the data point! > > I've insisted here that above 70% utilization is very good, > given the random nature of demand and jobs on queues in the academia, etc. that sounds very strange to me. do you really mean that 30% of your cpu time is idle? I wonder whether there could be a big difference in methodology. for instance, if you're using an MPI library (probably based on tcp) that doesn't spin-wait but blocks as for disk IO say 20% of the time, then you might consider this to be 80% utilization. an MPI that spin-waits might show 100% with the same perf/throughput. 70% utilization is terrible if you really mean "fraction of allocatable cpu time occupied by jobs". that is at the job scheduler level, not at the kernel scheduler level. > However, some folks would want more than 90% efficiency to get happy. I would be embarassed to have less than 90%. perhaps 70% would make sense for a cluster dedicated to a small or narrowly-defined group. I find that a sufficient userbase means you _always_ have something to run, of any size/resource available. From gus at ldeo.columbia.edu Thu Aug 14 10:45:01 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 14 Aug 2008 13:45:01 -0400 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: References: <1495958777.108261218692928122.JavaMail.root@mail.vpac.org> <48A4561C.2040302@ldeo.columbia.edu> Message-ID: <48A46F1D.9030204@ldeo.columbia.edu> Hello Mark and list The measurement was based on walltime. It just refers to the user occupancy of the cluster, versus what was left idle (for all reasons, e.g. lack of resources to serve large queued jobs, lack of enough jobs to fill all nodes, etc). The number is simply the utilized resources divided by the available resources. This gives a coarse measure of machine utilization. Take the walltime of all jobs multiplied by the number of nodes (or CPUs) each job used, sum them, and divide by the duration of this period (say, one year) times the number of nodes (or CPUs) in the cluster. Maybe 70% utilization is low compared to airplane seats, subway occupancy, hotel rooms, restaurant tables, Internet, telephone networks, and perhaps to other clusters. I don't know, I am not an operations research person. The only other number I could find for a (well used) large cluster in our science field was below 70%, and now Chris mentioned 77%. Are there published numbers of resource utilization for other machines, say, public clusters in the US, Canada, Europe, world? Yes, our cluster is dedicated to a small group of earth scientists and students (20-40 users) and it is small (32 nodes, 64 cpus). Cluster size and user population size most likely make a difference, but in any case, I would be interested in seeing any other numbers for any kind of cluster. Regards, Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Mark Hahn wrote: >>> It appears we've averaged almost 77% utilisation >>> since the beginning of 2004 (when our current usage >>> system records begin). >>> >> Thank you very much for the data point! >> >> I've insisted here that above 70% utilization is very good, >> given the random nature of demand and jobs on queues in the academia, >> etc. > > > that sounds very strange to me. do you really mean that 30% of your > cpu time is idle? I wonder whether there could be a big > difference in methodology. for instance, if you're using an MPI library > (probably based on tcp) that doesn't spin-wait but blocks as for disk IO > say 20% of the time, then you might consider this to be 80% utilization. > an MPI that spin-waits might show 100% with the same perf/throughput. > > 70% utilization is terrible if you really mean "fraction of > allocatable cpu > time occupied by jobs". that is at the job scheduler level, not at > the kernel scheduler level. > >> However, some folks would want more than 90% efficiency to get happy. > > > I would be embarassed to have less than 90%. perhaps 70% would make > sense > for a cluster dedicated to a small or narrowly-defined group. I find > that a sufficient userbase means you _always_ have something to run, > of any size/resource available. From mark.kosmowski at gmail.com Thu Aug 14 11:47:58 2008 From: mark.kosmowski at gmail.com (Mark Kosmowski) Date: Thu, 14 Aug 2008 14:47:58 -0400 Subject: [Beowulf] Infinipath memory parity errors Message-ID: > > > Which driver is active? Which Infinipath software release > > is installed? The tool "ipath_control -i" can show which... > > QLogic kernel.org driver > 00: Version: Driver 2.0, InfiniPath_QLE7140, InfiniPath1 4.2, PCI 2, SW Compat 2 > > I think this is a 2.1 distribution, whereas there's at least 2.2 now > available. > > > The kernel.org/ofed driver does not have as rich a set of error recovery > > code for this card as the shipped driver. The recovery code was seen > > as a badness and not accepted by the kernel.org folk.... > > Hmm... > > > With a kernel update the driver will not have been recompiled > > and the kernel.org driver would become active. > > [Actually it wasn't just a kernel update -- the SuSE 9.3 system disk was > removed and replaced by a 10.3 one shortly after I arrived, trashing all > the configuration, so I'm a little at sea, without infiniband > experience.] Have you tried searching for Infinipath drivers at the SUSE 10.3 repositories? If you're using OpenSUSE rather than SLED / SLES, perhaps it would be worth checking the community build repository too. Maybe someone has already done the build work for you. I'm continually amazed at the useful stuff I find there that I was certain I'd have to build for myself. For that matter, a clean install may be in order as a last resort. Good luck! > > > Look for this stuff in the Install Guide. > > > > # To rebuild the drivers, do the following (as root): > > # cd /usr/src/infinipath/drivers > > # ./make-install.sh > > # /etc/init.d/infinipath restart > > I didn't realize that there's a driver significantly different from the > kernel.org one, and haven't had time to read up enough. I'll give it a > go when we can restart the relevant nodes. > > Many thanks for the info. > > From andrea at soe.ucsc.edu Tue Aug 12 16:36:35 2008 From: andrea at soe.ucsc.edu (Andrea Di Blas) Date: Tue, 12 Aug 2008 16:36:35 -0700 (PDT) Subject: [Beowulf] large MPI adopters Message-ID: <18076.148.87.1.172.1218584195.squirrel@squirrelmail.soe.ucsc.edu> hello, I am curious about what companies, besides the national labs of course, use any implementation of MPI to support large applications of any kind, whether only internally (like mapreduce for google, for example) or not. does anybody know of any cases? thank you and best regards, andrea -- Andrea Di Blas, UCSC School of Engineering From mm at yuhu.biz Wed Aug 13 07:28:33 2008 From: mm at yuhu.biz (Marian Marinov) Date: Wed, 13 Aug 2008 17:28:33 +0300 Subject: [Beowulf] Re: Linux cluster authenticating against =?iso-8859-1?q?multiple=09Active_Directory?= domains In-Reply-To: <48A2DD3A.7000906@ias.edu> References: <810152863.6581217568750370.JavaMail.root@mail.vpac.org> <87skt9m9ff.fsf@snark.cb.piermont.com> <48A2DD3A.7000906@ias.edu> Message-ID: <200808131728.33988.mm@yuhu.biz> On Wednesday 13 August 2008 16:10:18 Prentice Bisbal wrote: > Perry E. Metzger wrote: > > Dave Love writes: > >>> We'd prefer to steer clear of Kerberos, it introduces > >>> arbitrary job limitations through ticket lives that > >>> are not tolerable for HPC work. > > > > Which of course isn't true. If Wall Street firms, which really cannot > > afford to have their trading systems go down even for a second, can > > happily use kerberos in servers, so can anyone. > > > >>> Say you submit a job that is in the queue for a week > >>> and then will run for 3 months - we don't know if the > >>> AD admins will permit the creation of a 4 month ticket > >>> "just in case".. > >> > >> Why do you need to re-authenticate, and if you do, surely you need to > >> stash a credential somewhere however you do it? > > > > Indeed, and if you have stashed your key appropriately you can just > > have a cron job kinit as often as you like. The kinit man page > > gives the command line flag for requesting credentials using a key > > taken from a file, ans also lists the flag for setting your ticket > > expiry time. All you do is put one line in a crontab with kinit and > > those two options, say every 24 hours. > > Isn't stashing a bunch of user keys on a single system a major security > risk? One could argue that the cluster should be on a private, secured > network so it's okay, but then I'll argue that the node storing the keys > will most likely be the master node, which is often on a "public" > network so users can access it to submit jobs. > > I can't imagine an arrangement like this passing the scrutiny of the > local information security officer. And if you're going to use a key > stored on a disk, why not just use SSH keys? > > > I keep seeing these messages go by over and over making it sound like > > this is difficult. It is not difficult. I've seen people say "I have > > seen no document with a recipe for how to do it", perhaps because a > > single kinit command in a cron job is too simple for a HOWTO. > > Isn't stashing a bunch of user keys on a single system a major security > risk? One could argue that the cluster should be on a private, secured > network so it's okay, but then I'll argue that the node storing the keys > will most likely be the master node, which is often on a "public" > network so users can access it to submit jobs. > > I can't imagine an arrangement like this passing the scrutiny of the > local information security officer. And if you're going to use a key > stored on a disk, why not just use SSH keys? When using SSH keys you can have an agent which will keep all of your keys and you are not required to have them on the machine. But again the Information Security Officer would most likely drop the idea on first sight. > > > Maybe some sort of strange myth has been going by so long on this that > > people refuse to believe that the ticket refresh is a single easy > > command? > > Maybe you're not reading the questions correctly. In my original > question about how to do this, I asked how to do this using the queuing > system to refresh the keys -- I was asking for an integrated solution. > Above, you describe doing it with a cron job, which does not answer my > question. From kilian.cavalotti.work at gmail.com Wed Aug 13 16:35:16 2008 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed, 13 Aug 2008 16:35:16 -0700 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <1083107021.95121218598171378.JavaMail.root@mail.vpac.org> References: <1083107021.95121218598171378.JavaMail.root@mail.vpac.org> Message-ID: <200808131635.18029.kilian.cavalotti.work@gmail.com> Hi Chris, On Tuesday 12 August 2008 08:29:31 pm Chris Samuel wrote: > We do use things like cpusets to try and limit the impact > that jobs can have on other jobs on the same nodes, I'm actually curious about how you implemented that. Do you have NUMA hardware? Do you use a resources manager, and is the cpusets creation process integrated with it? How do you manage concurrent jobs running on the same machine: do you pin them on specific CPUs and keep track of what CPU is busy and which is not, or do you have a way to just limit the number of CPUs they're using? As you can guess, I'd be interested in some technical details. :) Cheers, -- Kilian From pj at sgi.com Wed Aug 13 23:45:28 2008 From: pj at sgi.com (Paul Jackson) Date: Thu, 14 Aug 2008 01:45:28 -0500 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <749528484.108171218692255152.JavaMail.root@mail.vpac.org> References: <56844017.103401218676989887.JavaMail.root@mail.vpac.org> <749528484.108171218692255152.JavaMail.root@mail.vpac.org> Message-ID: <20080814014528.6d7de113.pj@sgi.com> Chris wrote: > The 2.6 cpuset support in Torque came out of a long Would you have any pointers to some more details of what you've done here? I'm the maintainer, and one of the authors, of Linux 2.6 cpusets, and would like to do what I can with cpusets to make life easier (or at least no more painful) for cluster and MPI folks. My background comes more from the "big honkin NUMA iron" running a Single System Image on 100's or 1000's of CPUs (SGI Irix/Origin and later Linux/Altix), which was the "country of origin" for cpusets, so my interest (and ignorance) in asking this question is more to gain an understanding of how cpusets have been adapted to clusters, as I understand less well the needs of clusters, and what if anything cpusets might do here to be of more use. Totally totally trivial nit -- you wrote: > mkdir /dev/cpuset > mount -t cpuset none /dev/cpuset I prefer in my setups to have that mount command be: mount -t cpuset cpuset /dev/cpuset so that the mount shows up in the output of the mount(8) command with 'cpuset' in the mount 'device' field, not 'none'. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.940.382.4214 From pj at sgi.com Thu Aug 14 00:00:45 2008 From: pj at sgi.com (Paul Jackson) Date: Thu, 14 Aug 2008 02:00:45 -0500 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <749528484.108171218692255152.JavaMail.root@mail.vpac.org> References: <56844017.103401218676989887.JavaMail.root@mail.vpac.org> <749528484.108171218692255152.JavaMail.root@mail.vpac.org> Message-ID: <20080814020045.9b1961bb.pj@sgi.com> Chris wrote: > creates a job cpuset which includes the specific cpus > (vnodes) that have been allocated by the scheduler, and > all the mems present (it currently makes no attempt to be > clever about that). I have recently open sourced a major user level C library, called libcpusets, which includes routines to map cpus to their corresponding memory nodes. See further the "User library support for cpusets" section, at the bottom of: http://oss.sgi.com/projects/cpusets/ Right now, just RPM forms of libcpuset (and the related libbitmask on which it depends) are on the above website. As soon as I can poke my webmaster, there should also be tarballs, as well as the key documents directly web accessible. See the libcpuset routine cpuset_localmems(). It maps a set of CPUs to the set of matching Memory Nodes. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.940.382.4214 From mm at yuhu.biz Thu Aug 14 01:01:27 2008 From: mm at yuhu.biz (Marian Marinov) Date: Thu, 14 Aug 2008 11:01:27 +0300 Subject: [Beowulf] Distributed FS (Was: copying big files) In-Reply-To: <48A3C13E.5010508@aei.mpg.de> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <48A3C13E.5010508@aei.mpg.de> Message-ID: <200808141101.27918.mm@yuhu.biz> On Thursday 14 August 2008 08:23:10 Carsten Aulbert wrote: > Hi all > > Bernard Li wrote: > > I'd like to add a comment -- is the reason why this "issue" hasn't > > been brought up as frequently as I think it should be mainly because a > > lot of folks use distributed FS that eliminates the need to do one to > > many file transfers? > > Speaking of this, what do people use when they have say ~ 200 nodes with > an extra 1 TB drive in it. I know of glusterFS but very few others who > will be able to utilize this in a somewhat efficient matter. Are the > good/better alternatives out there? > > Extending your storage that way is just darn cheap. Have you looked at GFarm and Hadoop ? Marian > > Cheers > > Carsten From pj at sgi.com Thu Aug 14 01:54:58 2008 From: pj at sgi.com (Paul Jackson) Date: Thu, 14 Aug 2008 03:54:58 -0500 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <361311091.109861218701297236.JavaMail.root@mail.vpac.org> References: <20080814020045.9b1961bb.pj@sgi.com> <361311091.109861218701297236.JavaMail.root@mail.vpac.org> Message-ID: <20080814035458.291d5a03.pj@sgi.com> > the RPM that is was LGPL'd. Yes, libcpuset and libbitmask are LGPL. Have fun! -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.940.382.4214 From pj at sgi.com Thu Aug 14 01:56:02 2008 From: pj at sgi.com (Paul Jackson) Date: Thu, 14 Aug 2008 03:56:02 -0500 Subject: Linux cpusets and HPC (was Re: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem?) In-Reply-To: <440195143.109831218700996351.JavaMail.root@mail.vpac.org> References: <1703956069.109811218700794811.JavaMail.root@mail.vpac.org> <440195143.109831218700996351.JavaMail.root@mail.vpac.org> Message-ID: <20080814035602.453fbc37.pj@sgi.com> Chris wrote: > The main purpose we're using them for is a quick and > easy way to catch users who don't know better doing > things like running an OpenMP code as a single CPU job > and overloading a node (and causing chaos for other > users) when it discovers 8 cores. Let me see if I understand this. Is the following right: Without the cpuset constraint, such a 'bad' job could tell the cluster management software (PBS or Torque or ...) it needed just one CPU, which could end up putting it on a cluster node with say eight CPUs, along with some other jobs that expect to use the other seven CPUs. But then OpenMP code in that 'bad' job could notice it had eight CPUs, think to itself 'wow - cool', and proceed to hog all eight CPUs, messing up those other jobs. With the cpuset constraint, that 'bad' job -will- only be able to use that one CPU, and if OpenMP or other code in that job can't deal reasonably with that circumstance, well, tough, the owner of that job should fix something. But at least the other jobs that were hoping to use the other seven CPUs won't be bothered much by this. Did I say that right? > http://www.supercluster.org/pipermail/torquedev/2007-November/000748.html > http://www.supercluster.org/pipermail/torquedev/2008-January/000842.html > http://www.clusterresources.com/wiki/doku.php?id=torque:3.5_linux_cpuset_support Thanks for the links! -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson 1.940.382.4214 From niftyompi at niftyegg.com Thu Aug 14 13:45:01 2008 From: niftyompi at niftyegg.com (Nifty niftyompi Mitch) Date: Thu, 14 Aug 2008 13:45:01 -0700 Subject: [Beowulf] Infinipath memory parity errors In-Reply-To: References: Message-ID: <20080814204501.GA4599@hpegg.wr.niftyegg.com> On Thu, Aug 14, 2008 at 02:47:58PM -0400, Mark Kosmowski wrote: > > > Which driver is active? Which Infinipath software release > > > is installed? The tool "ipath_control -i" can show which... > > > > QLogic kernel.org driver > > 00: Version: Driver 2.0, InfiniPath_QLE7140, InfiniPath1 4.2, PCI 2, SW Compat 2 > > > > I think this is a 2.1 distribution, whereas there's at least 2.2 now > > available. > > > > > The kernel.org/ofed driver does not have as rich a set of error recovery > > > code for this card as the shipped driver. The recovery code was seen > > > as a badness and not accepted by the kernel.org folk.... > > > > Hmm... > > > > > With a kernel update the driver will not have been recompiled > > > and the kernel.org driver would become active. > > > > [Actually it wasn't just a kernel update -- the SuSE 9.3 system disk was > > removed and replaced by a 10.3 one shortly after I arrived, trashing all > > the configuration, so I'm a little at sea, without infiniband > > experience.] > > Have you tried searching for Infinipath drivers at the SUSE 10.3 > repositories? If you're using OpenSUSE rather than SLED / SLES, > perhaps it would be worth checking the community build repository too. > Maybe someone has already done the build work for you. I'm > continually amazed at the useful stuff I find there that I was certain > I'd have to build for myself. > > For that matter, a clean install may be in order as a last resort. > > Good luck! Yes you have the OFED/kernel.org driver. For this card do pull and load your drivers from the QLogic support download area! For this card the latest and greatest will be on the QLogic site. Also, pickup the documentation at the same time..... The driver is built from source on the system for the active kernel. And yes recall the OFED/kernel.org driver is missing the nonstop recovery code for parity errors that have been observed on this card. The kernel.org driver for this card detects the error and stops rather than risk passing a data error. -- T o m M i t c h e l l Got a great hat... now what. From hbugge at platform.com Thu Aug 14 13:32:22 2008 From: hbugge at platform.com (=?iso-8859-1?Q?H=E5kon?= Bugge) Date: Thu, 14 Aug 2008 22:32:22 +0200 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A46F1D.9030204@ldeo.columbia.edu> References: <1495958777.108261218692928122.JavaMail.root@mail.vpac.org> <48A4561C.2040302@ldeo.columbia.edu> <48A46F1D.9030204@ldeo.columbia.edu> Message-ID: Gus' numbers makes sense to me. I assume his workload consists of multiple sized jobs, serial, modest parallel, and parallel jobs using all resources. Without pre-emptive scheduling, the batch queue system has to starve the system in order to run the larger jobs. Obviously, before a job which consumes all resources starts , then all resources have to be idle. Which means no jobs can't be scheduled, even though they're idle. Another interesting metric is of course how many of the jobs runs to successful completion, i.e., are not killed due to resource limits, or crashes, or for other reasons. That's what I call net vs. gross utilization. Thanks, H?kon (opinions of myself, now working for Platform Computing) At 19:45 14.08.2008, Gus Correa wrote: >Hello Mark and list > >The measurement was based on walltime. >It just refers to the user occupancy of the cluster, versus what was left idle >(for all reasons, e.g. lack of resources to >serve large queued jobs, lack of enough jobs to fill all nodes, etc). >The number is simply the utilized resources >divided by the available resources. >This gives a coarse measure of machine utilization. >Take the walltime of all jobs multiplied by the >number of nodes (or CPUs) each job used, >sum them, >and divide by the duration of this period (say, >one year) times the number of nodes (or CPUs) in the cluster. > >Maybe 70% utilization is low compared to >airplane seats, subway occupancy, hotel rooms, restaurant tables, >Internet, telephone networks, and perhaps to other clusters. >I don't know, I am not an operations research person. >The only other number I could find for a (well >used) large cluster in our science field was below 70%, >and now Chris mentioned 77%. > >Are there published numbers of resource utilization for other machines, >say, public clusters in the US, Canada, Europe, world? > >Yes, our cluster is dedicated to a small group >of earth scientists and students (20-40 users) >and it is small (32 nodes, 64 cpus). Cluster >size and user population size most likely make a difference, >but in any case, I would be interested in seeing any other numbers >for any kind of cluster. > >Regards, >Gus Correa > >-- >--------------------------------------------------------------------- >Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu >Lamont-Doherty Earth Observatory - Columbia University >P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA >--------------------------------------------------------------------- > > >Mark Hahn wrote: > >>>>It appears we've averaged almost 77% utilisation >>>>since the beginning of 2004 (when our current usage >>>>system records begin). >>>Thank you very much for the data point! >>> >>>I've insisted here that above 70% utilization is very good, >>>given the random nature of demand and jobs on queues in the academia, etc. >> >> >>that sounds very strange to me. do you really >>mean that 30% of your cpu time is idle? I wonder whether there could be a big >>difference in methodology. for instance, if you're using an MPI library >>(probably based on tcp) that doesn't spin-wait but blocks as for disk IO >>say 20% of the time, then you might consider this to be 80% utilization. >>an MPI that spin-waits might show 100% with the same perf/throughput. >> >>70% utilization is terrible if you really mean "fraction of allocatable cpu >>time occupied by jobs". that is at the job >>scheduler level, not at the kernel scheduler level. >> >>>However, some folks would want more than 90% efficiency to get happy. >> >> >>I would be embarassed to have less than 90%. perhaps 70% would make sense >>for a cluster dedicated to a small or >>narrowly-defined group. I find that a >>sufficient userbase means you _always_ have >>something to run, of any size/resource available. > > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or >unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- H?kon Bugge Chief Technologist mob. +47 92 48 45 14 off. +47 21 37 93 19 fax. +47 22 23 36 66 Hakon.Bugge at platform.com Skype: hakon_bugge Platform Computing, Inc. From alscheinine at tuffmail.us Thu Aug 14 20:32:48 2008 From: alscheinine at tuffmail.us (Alan Louis Scheinine) Date: Thu, 14 Aug 2008 22:32:48 -0500 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A46F1D.9030204@ldeo.columbia.edu> References: <1495958777.108261218692928122.JavaMail.root@mail.vpac.org> <48A4561C.2040302@ldeo.columbia.edu> <48A46F1D.9030204@ldeo.columbia.edu> Message-ID: <48A4F8E0.7020806@tuffmail.us> This thread has moved to the question of utilization, discussed by Mark Hahn, Gus Correa and H?kon Bugge. In my previous job most people developed code, though test runs could run for days and use as many as 64 cores. It was convenient for most people to have immediate access due to the excess computation capacity whereas some people in top management wanted maximum utilization. I was at a parallel computing workshop where other people described the contrast between their needs and the goals of their computer centers. The computer centers wanted maximum utilization whereas the spare capacity of the various clusters in the labs were especially useful for the researchers. They could bring to bear the computational power of their informally administered clusters for special tasks such as when a huge block of data needed to be analyzed in nearly realtime to see if an experiment of limited duration was going well. When most work involves code development, waiting for jobs in a batch queue means that the human resources are not being used efficiently. Of course, maximum utilization of computer resources is necessary for production code, I just want to emphasize the wide range of needs. I would like to add that maximum utilization and fast turn- around are contradictory goals, it would seem to me based on the following reasoning. Consider packing a truck with boxes where the heigth of the boxes represents the number of cores and the width of the boxes represents the time of execution (leaving aside third spatial dimension). To most efficiently solve the packing problem we would like to have all boxes visible on the loading dock before we start packing. On the other hand, if boxes arrive a few at a time and we must put the boxes into the truck as they arrive (low queue wait time) then the packing will not be efficient. Moreover, as a very rough estimate, the size of the box defines the scale of the problem, specifically, if the average running time is 4 hours, then to have efficient "packing" the time spent waiting in a queue must on the order of at least 4 and more likely 8 hours in order to have enough requests visible to be able to find an efficient solution to the scheduling problem. Best regards, Alan -- Alan Scheinine 5010 Mancuso Lane, Apt. 621 Baton Rouge, LA 70809 Email: alscheinine at tuffmail.us Office phone: 225 578 0294 Mobile phone USA: 225 288 4176 [+1 225 288 4176] From gerry.creager at tamu.edu Thu Aug 14 22:08:27 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Fri, 15 Aug 2008 00:08:27 -0500 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <48A4F8E0.7020806@tuffmail.us> References: <1495958777.108261218692928122.JavaMail.root@mail.vpac.org> <48A4561C.2040302@ldeo.columbia.edu> <48A46F1D.9030204@ldeo.columbia.edu> <48A4F8E0.7020806@tuffmail.us> Message-ID: <48A50F4B.70703@tamu.edu> Alan Louis Scheinine wrote: > This thread has moved to the question of utilization, > discussed by Mark Hahn, Gus Correa and H?kon Bugge. > In my previous job most people developed code, though test runs > could run for days and use as many as 64 cores. It was > convenient for most people to have immediate access due to > the excess computation capacity whereas some people in top > management wanted maximum utilization. > > I was at a parallel computing workshop where other people > described the contrast between their needs and the goals of > their computer centers. The computer centers wanted maximum > utilization whereas the spare capacity of the various clusters > in the labs were especially useful for the researchers. They > could bring to bear the computational power of their informally > administered clusters for special tasks such as when a huge > block of data needed to be analyzed in nearly realtime to see > if an experiment of limited duration was going well. > > When most work involves code development, waiting for jobs in > a batch queue means that the human resources are not being > used efficiently. Of course, maximum utilization of computer > resources is necessary for production code, I just want to > emphasize the wide range of needs. > > I would like to add that maximum utilization and fast turn- > around are contradictory goals, it would seem to me based > on the following reasoning. Consider packing a truck with > boxes where the heigth of the boxes represents the number > of cores and the width of the boxes represents the time of > execution (leaving aside third spatial dimension). To most > efficiently solve the packing problem we would like to have > all boxes visible on the loading dock before we start packing. > On the other hand, if boxes arrive a few at a time and we must > put the boxes into the truck as they arrive (low queue wait time) > then the packing will not be efficient. Moreover, as a very > rough estimate, the size of the box defines the scale of the > problem, specifically, if the average running time is 4 hours, > then to have efficient "packing" the time spent waiting in a > queue must on the order of at least 4 and more likely 8 hours > in order to have enough requests visible to be able to find > an efficient solution to the scheduling problem. An interesting analogy, and further, the thread has been interesting. However, it doesn't even begin to really address near-realtime processing requirements. Examples of these are common in the weather modeling I'm engaged in. In some cases, looking at severe weather and predictive models, a model needs to initiate shortly after a watch or warning is issued, something that's controlled by humans and is not scheduled, hence somewhat difficult to model for job scheduling. These models would likely be re-run with new data assimilated into the forcings, and a new solution produced. Similarly, models of toxic release plumes are unscheduled events with a high priority and low queue-wait time requirement. Other weather models are more predictable but have fairly hard requirements for when output must be available. Conventional batch scheduling handles these conditions pretty poorly. A full queue with even reasonable matching of available cores to request isn't likely to get these jobs out very quickly on a loaded system. Preemption is the easy answer but unpopular with administrators who have to answer the phone, users whose jobs are preempted (some never to see their jobs return), and the guy who's the preemptor... who gets blamed for all the problems. Worse, arbitrary preemption assignment means someone made a value judgment that someone's science is more important than someone else's, a sure plan for troubles when the parties all gather somewhere... like a faculty meeting. OK, so I've laid out a piece of the problem. I've got some ideas on solutions, and avenues for investigation to address these but I'd like to see others ideas. I don't want to influence the outcome any mroe than I already have. Oh, and, yeah, I'm aware of SPRUCE but I see a few potential problems there, although that framework has some potential. gc -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From carsten.aulbert at aei.mpg.de Thu Aug 14 22:09:50 2008 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Fri, 15 Aug 2008 07:09:50 +0200 Subject: [Beowulf] Distributed FS (Was: copying big files) In-Reply-To: <200808141101.27918.mm@yuhu.biz> References: <20080808153713.GA15753@gretchen.aei.uni-hannover.de> <48A3C13E.5010508@aei.mpg.de> <200808141101.27918.mm@yuhu.biz> Message-ID: <48A50F9E.9000207@aei.mpg.de> Marian Marinov wrote: > Have you looked at GFarm and Hadoop ? Very briefly at GFarm ("ages" ago), Hadoop was unknown to me. Thanks! Carsten -------------- next part -------------- A non-text attachment was scrubbed... Name: carsten_aulbert.vcf Type: text/x-vcard Size: 414 bytes Desc: not available URL: From hahn at mcmaster.ca Thu Aug 14 22:23:53 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 15 Aug 2008 01:23:53 -0400 (EDT) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: References: <1495958777.108261218692928122.JavaMail.root@mail.vpac.org> <48A4561C.2040302@ldeo.columbia.edu> <48A46F1D.9030204@ldeo.columbia.edu> Message-ID: > Gus' numbers makes sense to me. I assume his workload consists of multiple > sized jobs, serial, modest parallel, and parallel jobs using all resources. > Without pre-emptive scheduling, the batch queue system has to starve the > system in order to run the larger jobs. unless backfill can utilize those temporarily idle cpus. > Obviously, before a job which > consumes all resources starts , then all resources have to be idle. Which > means no jobs can't be scheduled, even though they're idle. true enough, but does depend on the size of large, high-prio jobs relative to the size of the cluster. > Another interesting metric is of course how many of the jobs runs to > successful completion, i.e., are not killed due to resource limits, or > crashes, or for other reasons. That's what I call net vs. gross utilization. surely this survival rate is quite high, no? again, it depends largely on the design of the cluster (I see few node crashes, maybe 1 of 768 nodes per week, and few resource crashes (perhaps a couple buggy jobs per week)) From csamuel at vpac.org Fri Aug 15 00:23:37 2008 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 15 Aug 2008 17:23:37 +1000 (EST) Subject: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem? In-Reply-To: <1359915506.118821218784450441.JavaMail.root@mail.vpac.org> Message-ID: <1670664645.118911218785017196.JavaMail.root@mail.vpac.org> ----- "Mark Hahn" wrote: > I would be embarassed to have less than 90%. perhaps > 70% would make sense for a cluster dedicated to a small > or narrowly-defined group. Also depends on external criteria too. For instance we have to try to balance usage between the 8 different universities who are partners in VPAC, and they have hugely varying usage profiles. So we are not given the luxury of just letting the scheduler completely reorder queued jobs for backfill and we have to impose limits on number of CPUs per user and institute as well as using fair share to count in prioritising jobs. We could get a higher utilisation if we gave carte blanche to the handful of users who queue hundreds of single CPU jobs each and had them backfill every gap, but then we'd get hammered by the members on the board (quite rightly) for not giving others a look in. :-) cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From mark.kosmowski at gmail.com Fri Aug 15 05:03:32 2008 From: mark.kosmowski at gmail.com (Mark Kosmowski) Date: Fri, 15 Aug 2008 08:03:32 -0400 Subject: utilization / efficiency - was: Re: [Beowulf] Can one Infiniband net support MPI and a Message-ID: > > Message: 4 > Date: Fri, 15 Aug 2008 00:08:27 -0500 > From: Gerry Creager > Subject: Re: [Beowulf] Can one Infiniband net support MPI and a > parallel filesystem? > To: alscheinine at tuffmail.us > Cc: Beowulf > Message-ID: <48A50F4B.70703 at tamu.edu> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Alan Louis Scheinine wrote: > > This thread has moved to the question of utilization, > > discussed by Mark Hahn, Gus Correa and H?kon Bugge. > > In my previous job most people developed code, though test runs > > could run for days and use as many as 64 cores. It was > > convenient for most people to have immediate access due to > > the excess computation capacity whereas some people in top > > management wanted maximum utilization. > > > > I was at a parallel computing workshop where other people > > described the contrast between their needs and the goals of > > their computer centers. The computer centers wanted maximum > > utilization whereas the spare capacity of the various clusters > > in the labs were especially useful for the researchers. They > > could bring to bear the computational power of their informally > > administered clusters for special tasks such as when a huge > > block of data needed to be analyzed in nearly realtime to see > > if an experiment of limited duration was going well. > > > > When most work involves code development, waiting for jobs in > > a batch queue means that the human resources are not being > > used efficiently. Of course, maximum utilization of computer > > resources is necessary for production code, I just want to > > emphasize the wide range of needs. > > > > I would like to add that maximum utilization and fast turn- > > around are contradictory goals, it would seem to me based > > on the following reasoning. Consider packing a truck with > > boxes where the heigth of the boxes represents the number > > of cores and the width of the boxes represents the time of > > execution (leaving aside third spatial dimension). To most > > efficiently solve the packing problem we would like to have > > all boxes visible on the loading dock before we start packing. > > On the other hand, if boxes arrive a few at a time and we must > > put the boxes into the truck as they arrive (low queue wait time) > > then the packing will not be efficient. Moreover, as a very > > rough estimate, the size of the box defines the scale of the > > problem, specifically, if the average running time is 4 hours, > > then to have efficient "packing" the time spent waiting in a > > queue must on the order of at least 4 and more likely 8 hours > > in order to have enough requests visible to be able to find > > an efficient solution to the scheduling problem. > So far the utilization discussion is discussing number of cpus as a bottelneck. Especially for general use clusters, RAM may also be a bottle neck. It is easy to imagine where giving a large RAM requirement job 16 instead of 32 cores with each node allocating 75% total node RAM to the job might be preferable so that a small RAM, cpu-intensive job could then use the remaining 16 cores with each node allocating 10% total node RAM to this second job. Multi-dimensional scheduling gets difficult quickly when different jobs have very different resource profiles. > An interesting analogy, and further, the thread has been interesting. > However, it doesn't even begin to really address near-realtime > processing requirements. Examples of these are common in the weather > modeling I'm engaged in. In some cases, looking at severe weather and > predictive models, a model needs to initiate shortly after a watch or > warning is issued, something that's controlled by humans and is not > scheduled, hence somewhat difficult to model for job scheduling. These > models would likely be re-run with new data assimilated into the > forcings, and a new solution produced. Similarly, models of toxic > release plumes are unscheduled events with a high priority and low > queue-wait time requirement. > > Other weather models are more predictable but have fairly hard > requirements for when output must be available. > > Conventional batch scheduling handles these conditions pretty poorly. A > full queue with even reasonable matching of available cores to request > isn't likely to get these jobs out very quickly on a loaded system. > Preemption is the easy answer but unpopular with administrators who have > to answer the phone, users whose jobs are preempted (some never to see > their jobs return), and the guy who's the preemptor... who gets blamed > for all the problems. Worse, arbitrary preemption assignment means > someone made a value judgment that someone's science is more important > than someone else's, a sure plan for troubles when the parties all > gather somewhere... like a faculty meeting. This may mark me as hopelessly naive, but for the emergency critical use clusters, couldn't there be a terms of use agreement in place stating that the purpose of the cluster is for the emergency events and that non-emergency usage, while allowed to make the cluster create more value for itself, are subject to preemption in emergency situations? Maybe have some sort of policy in place to give restarts of preempted jobs an earlier place in the post-emergency queue? At least this way folks might be upset that their jobs died unexpectedly due to preemption, but reasonable folks (I know, a big assumption here) will understand that this was explained at the beginning. Mark Kosmowski > > OK, so I've laid out a piece of the problem. I've got some ideas on > solutions, and avenues for investigation to address these but I'd like > to see others ideas. I don't want to influence the outcome any mroe > than I already have. > > Oh, and, yeah, I'm aware of SPRUCE but I see a few potential problems > there, although that framework has some potential. > > gc > -- > Gerry Creager -- gerry.creager at tamu.edu From d.love at liverpool.ac.uk Fri Aug 15 08:04:05 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Fri, 15 Aug 2008 16:04:05 +0100 Subject: [Beowulf] Re: Infinipath memory parity errors References: Message-ID: <87iqu2s4ka.fsf@liv.ac.uk> "Mark Kosmowski" writes: > Have you tried searching for Infinipath drivers at the SUSE 10.3 > repositories? They're certainly not in opensuse separate from the kernel packages. > If you're using OpenSUSE rather than SLED / SLES, > perhaps it would be worth checking the community build repository too. > Maybe someone has already done the build work for you. Thanks, but it's not a problem building from source, and I want the source to be able to do kernel upgrades. > I'm continually amazed at the useful stuff I find there that I was > certain I'd have to build for myself. [SuSE isn't nearly as good as Debian in that respect.] > For that matter, a clean install may be in order as a last resort. Er, no! The relevant info I was missing was that there was better module source to use, but thanks anyhow. In case it helps anyone else: openSuSE 10.3 (openSuSE generally?) isn't supported in the infinitpath 2.2 distribution, and the modules collection won't build directly with make-install.sh. I hacked build-guards.sh to recognize the system as 2.6.22_FC6, since it has a 2.6.22. kernel, and killed the code in drivers-2.6.22_FC6/kernel_addons/backport/2.6.22/include/scsi/scsi_cmnd.h that duplicates stuff in the scsi_cmnd.h in the SuSE kernel sources. (Although I only want the ib_ipath module, it was simplest just to make that change.) Then it will build/install. From d.love at liverpool.ac.uk Fri Aug 15 08:06:01 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Fri, 15 Aug 2008 16:06:01 +0100 Subject: [Beowulf] Re: Infinipath memory parity errors In-Reply-To: <20080814204501.GA4599@hpegg.wr.niftyegg.com> (Nifty niftyompi Mitch's message of "Thu, 14 Aug 2008 13:45:01 -0700") References: