From john.hearns at streamline-computing.com Fri Aug 1 00:28:45 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> Message-ID: <1217575735.4977.1.camel@Vigor13> On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: > We'd prefer to steer clear of Kerberos, it introduces > arbitrary job limitations through ticket lives that > are not tolerable for HPC work. > Kerberos is heavily used at CERN. They have a solution for that issue - the job can ask for an extension to the tickets. Sorry, I don't have a reference handy but its worth documenting this for the list. From hahn at mcmaster.ca Fri Aug 1 07:06:17 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: <488FEF6A.30003@harddata.com> References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> Message-ID: > BTW< where a lot of people are jumping on the "Get IPMI " bandwagon, I > suggest getting PDUs with remote IP controlled ports is more useful. the thing I don't like about controlled PDUs is that they're pretty harsh - don't you expect a higher failure rate of node PSUs if you go yanking the power this way? I have only seen a handful of different IPMI interfaces, but they all were reasonably reliable. > If you set your machines BIOS to start on power up, it is trivial to stop and > start machines with the PD U power, and that is definitely reliable. huh? we're talking about network-attached IPMI, which is fully independent of the controlled motherboard's bios. are you talking about those hybrid systems where the IPMI controller shares an ethernet port with the host? or IPMI through a kernel driver? > Plus , with a lot of those PDUs you can add thermal sensors and trigger power > off on high temperature conditions. IPMI normally provides all the motherboard's sensors as well. it seems like those are far more relevant than the temp of the PDU... using lm_sensors is a poor substitute for IPMI. From mathog at caltech.edu Fri Aug 1 09:11:25 2008 From: mathog at caltech.edu (David Mathog) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] reboot without passing through BIOS? Message-ID: Kilian CAVALOTTI wrote: > I may be totally missing the point, but doesn't the memory need to be > physically (as in electrically) reset in order to clean out those bad > bits? And doesn't this require a hard reboot, for the machine to be > power cycled, so that memory cells are reinitialized? The type of errors I am talking about are random bit flips, for instance, from ambient radiation. When the OS reboots it will overwrite memory and so remove those errors. The affected cells were not damaged, just in the wrong state. This should work so long as none of the damaged bits prevent kexec from doing its job. Presumably the OS will also reinitialize all memory structures stored elsewhere in hardware (as in storage controllers and NICs) since it should not trust the BIOS to have done this. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From hahn at mcmaster.ca Fri Aug 1 09:12:00 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: <48932EAB.1070500@harddata.com> References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> <48932EAB.1070500@harddata.com> Message-ID: >> the thing I don't like about controlled PDUs is that they're pretty >> harsh - don't you expect a higher failure rate of node PSUs if you go >> yanking the power this way? > Why? > If nodes shutdown, on commands from the scheduler, that is good. > And, if they do not, how is cutting power by the PDU socket any different > than a power switch on the node? I don't design PSU's, but yanking the cord seems "rude" compared to simply raising the "please power off" signal. the latter is part of all PSU's these days, and is what IPMI uses (via I2C, I guess). perhaps it's superstition - I do always prefer to use the off button, rather than ranking the cord. but thinking of how a switching PSU works, perhaps it doesn't really matter - it views the input power as highly variable anyway (ie, 90-250V, and with that annoying 50-60 Hz flutter ;) >>> If you set your machines BIOS to start on power up, it is trivial to stop >>> and start machines with the PD U power, and that is definitely reliable. >> >> huh? we're talking about network-attached IPMI, which is fully independent >> of the controlled motherboard's bios. are you talking about those hybrid >> systems where the IPMI controller shares an ethernet port with the host? >> or IPMI through a kernel driver? >> > Either. > Most share a port, some have dedicated ports on board. I'm not sure about the "most" part - HP's don't, and it looks like supermicro offers options both ways. all the recent tyan boards I've looked at had dedicated IPMI/OPMA onboard. all HP machines have dedicated ports. but to me this has all the hallmarks of a religious issue, so... From kus at free.net Fri Aug 1 09:27:44 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: Message-ID: In message from Mark Hahn (Fri, 1 Aug 2008 10:06:17 -0400 (EDT)): >> ... Plus , with a lot of those PDUs you can add thermal sensors and >>trigger power >> off on high temperature conditions. >IPMI normally provides all the motherboard's sensors as well. it >seems like those are far more relevant than the temp of the PDU... > >using lm_sensors is a poor substitute for IPMI. IMHO the only disadvantage of lm_sensors is the poroblem of building of right sensors.conf file. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow From jmdavis1 at vcu.edu Fri Aug 1 07:50:11 2008 From: jmdavis1 at vcu.edu (Mike Davis) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> Message-ID: <489322A3.1010600@vcu.edu> > > the thing I don't like about controlled PDUs is that they're pretty > harsh - don't you expect a higher failure rate of node PSUs if you go > yanking the power this way? > > I have only seen a handful of different IPMI interfaces, but they all > were reasonably reliable. In using the ethernet interfaced PDU's for the past 8 years on several clusters, I haven't noticed a high PSU failure rate. In all honesty, we haven't had a unit connected to them ever lose a power supply. One benefit of using the PDU solution is one enet for X machines rather than X additional enets for X machines. That being said, IPMI offers additional functionality not provided by the PDU's. Mike Davis From hahn at mcmaster.ca Fri Aug 1 10:28:47 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: Message-ID: >> using lm_sensors is a poor substitute for IPMI. > > IMHO the only disadvantage of lm_sensors is the poroblem of building of right > sensors.conf file. well, there's the little matter of being able to get data when the node is crashed, offline, busy, etc. I also very much like the ability to query status, temps, fan speeds out-of-band - that is, without stealing cycles from the job. From maurice at harddata.com Fri Aug 1 06:48:49 2008 From: maurice at harddata.com (Maurice Hilarius) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan, Oleynik) In-Reply-To: <200808010608.m716894U008585@bluewest.scyld.com> References: <200808010608.m716894U008585@bluewest.scyld.com> Message-ID: <48931441.9050603@harddata.com> Chris Samuel wrote: .. >> > BTW< where a lot of people are jumping on the "Get IPMI " >> > bandwagon, I suggest getting PDUs with remote IP controlled >> > ports is more useful. >> > > Well, it depends on what you're trying to do, if it's get > the system and CPU temperatures then a PDU isn't much cop.. :) > > True, but on most boards lm_sensors will do that for you for free.. -- With our best regards, //Maurice W. Hilarius Telephone: 01-780-456-9771/ /Hard Data Ltd. FAX: 01-780-456-9772/ /11060 - 166 Avenue email:maurice@harddata.com/ /Edmonton, AB, Canada http://www.harddata.com// / T5X 1Y3/ / -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080801/16805ed0/attachment.html From pal at di.fct.unl.pt Fri Aug 1 08:40:42 2008 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275 Message-ID: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> Dear all: Around 2/Apr I removed 2 Opterons 246 and "companion" 4x 512 MB DIMMs from two HPs DL145-G2, leaving them void, to populate other two HPs (got 2 CPUs and 4GB per node). Then, I installed 2 dual-core Opterons per DL145-G2, together with 4 sticks of 1GB (2 sticks per CPU). So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 DL145-G2 nodes with 2 dual-core 275 / 4GB each. On 18th/Apr, one of the dual-core nodes crashed with an ECC error. From IMPI, for that node, 04/18/2008 | 20:26:26 | Memory #0x02 | Uncorrectable ECC | Asserted 06/18/2008 | 12:00:16 | Memory #0x02 | Uncorrectable ECC | Asserted 06/23/2008 | 11:58:34 | Memory #0x02 | Uncorrectable ECC | Asserted 07/19/2008 | 22:41:12 | Memory #0x02 | Uncorrectable ECC | Asserted 07/22/2008 | 17:18:00 | Memory #0x02 | Uncorrectable ECC | Asserted 07/23/2008 | 22:08:15 | Memory #0x02 | Uncorrectable ECC | Asserted 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted On 07/19 the memory of CPU0 was replaced; on the 27th, the remaining memory was replaced. ECC crashes do continue, from 1 per day to 1 per week. 07/28: first ECC error on the other Opteron-275 populated node. 07/28/2008 | 18:54:23 | Memory #0x02 | Uncorrectable ECC | Asserted All nodes have IB boards, and I swapped the boards from the first crashing and second crashing nodes (that's when, a few days later, the second node crashed the very first time). I have observed that not more than 2 minutes away from the ECC there are always these events logged: 06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S0/G0: working | Asserted 06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S5/G2: soft-off | Deasserted (but they are logged also at other times) I am running Scientific Linux 5, the (lam) MPI application uses almost 100% CPU and does exchange lots of small packets through IPoIB (I have not used "native" IB yet). "Everything" is 64-bit (kernel, apps). Any thoughts? Best Regards, paulo lopes -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10763 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: pal@di.fct.unl.pt 2829-516 Caparica, PORTUGAL From maurice at harddata.com Fri Aug 1 08:41:31 2008 From: maurice at harddata.com (Maurice Hilarius) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> Message-ID: <48932EAB.1070500@harddata.com> Mark Hahn wrote: >> BTW< where a lot of people are jumping on the "Get IPMI " bandwagon, >> I suggest getting PDUs with remote IP controlled ports is more useful. > > the thing I don't like about controlled PDUs is that they're pretty > harsh - don't you expect a higher failure rate of node PSUs if you go > yanking the power this way? Why? If nodes shutdown, on commands from the scheduler, that is good. And, if they do not, how is cutting power by the PDU socket any different than a power switch on the node? Obviously we want to avoid "dropping the hammer" on a mounted filesystem, at least until it has its cache cleared. That is not hard to accomplish. > > I have only seen a handful of different IPMI interfaces, but they all > were reasonably reliable. > I have used the Supermicro, Tyan, ASUS, and Dell, and they all had some tendency to choke sometimes. The thing is, at the nominal cost of $50 to $100 per machine for BMC ( IPMI) cards, one can buy a couple of network controlled PDUs, with the thermal and humidity sensors. As you are likely to at least buy "dumb" PDUs, this means the typical cost per node added by this is usually around $30 per node, resulting in a tidy savings. It also means you are "talking" tp only one device pre 10 to 30 nodes, versus 10 to 30 BMC devices. Further, these IPMI cards typically "steal" a GbE port on the nodes. >> If you set your machines BIOS to start on power up, it is trivial to >> stop and start machines with the PD U power, and that is definitely >> reliable. > > huh? we're talking about network-attached IPMI, which is fully > independent > of the controlled motherboard's bios. are you talking about those > hybrid systems where the IPMI controller shares an ethernet port with > the host? > or IPMI through a kernel driver? > Either. Most share a port, some have dedicated ports on board. >> Plus , with a lot of those PDUs you can add thermal sensors and >> trigger power off on high temperature conditions. > > IPMI normally provides all the motherboard's sensors as well. it > seems like those are far more relevant than the temp of the PDU... I would rather monitor the room temperature at the racks, and shut the whole works down in a hurry if something is wrong, such as air conditioning failure. > using lm_sensors is a poor substitute for IPMI. Yes, and no. For monitoring the temps and fans an such on nodes it is quite sufficient. For power control it is useless, of course. -- With our best regards, //Maurice W. Hilarius Telephone: 01-780-456-9771/ /Hard Data Ltd. FAX: 01-780-456-9772/ /11060 - 166 Avenue email:maurice@harddata.com/ /Edmonton, AB, Canada http://www.harddata.com// / T5X 1Y3/ / -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080801/2f04d789/attachment.html From rreis at aero.ist.utl.pt Fri Aug 1 10:36:47 2008 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran Message-ID: which means... segfault. Hi all I've scourged the net for answers to no avail and the fftw project seems to have grinded to a halt. Maybe someone has had this problem and can throw some light. I've coded a small program that reads a vorticity field, uses FFTW2 to send it from the physical to the spectral space and then computes its energy spectrum. Everything works in my laptop (32 bit, Linux), serial, threaded and mpi (using openmpi). In a 64 bit machine the mpi version kaputs. Any thoughts? Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt & Cultural Instigator @ R?dio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ From hahn at mcmaster.ca Fri Aug 1 11:23:13 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: > I've scourged the net for answers to no avail and the fftw project seems to > have grinded to a halt. Maybe someone has had this problem and can throw some > light. I don't know the status of the project, but fftw is definitely still widely used, and definitely works in 64b. > I've coded a small program that reads a vorticity field, uses FFTW2 to send > it from the physical to the spectral space and then computes its energy > spectrum. Everything works in my laptop (32 bit, Linux), serial, threaded and > mpi (using openmpi). In a 64 bit machine the mpi version kaputs. Any > thoughts? 32-64 problems usually stem from someone conflating ints and pointers. we have fftw2 installed on all our machines, which are all 64b (for years). no reports of problems. did you compile your own fftw2, and if so, did you run the test cases? From hahn at mcmaster.ca Fri Aug 1 11:25:54 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275 In-Reply-To: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> References: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> Message-ID: > So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 > DL145-G2 nodes with 2 dual-core 275 / 4GB each. it's worth making sure you have current bios installed. > 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted it may also be useful to run mcelog, which will tell you about any ongoing _correctable_ ECC activity. From glen.beane at jax.org Fri Aug 1 11:34:50 2008 From: glen.beane at jax.org (Glen Beane) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: <4893574A.3010405@jax.org> Mark Hahn wrote: >> I've scourged the net for answers to no avail and the fftw project >> seems to have grinded to a halt. Maybe someone has had this problem >> and can throw some light. > > I don't know the status of the project, but fftw is definitely still > widely used, and definitely works in 64b. > >> I've coded a small program that reads a vorticity field, uses FFTW2 to >> send it from the physical to the spectral space and then computes its >> energy spectrum. Everything works in my laptop (32 bit, Linux), >> serial, threaded and mpi (using openmpi). In a 64 bit machine the mpi >> version kaputs. Any thoughts? > > 32-64 problems usually stem from someone conflating ints and pointers. > we have fftw2 installed on all our machines, which are all 64b (for years). > no reports of problems. I am also using fftw2 on our 64-bit Linux cluster without any issues -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153 From jason at acm.org Fri Aug 1 12:29:08 2008 From: jason at acm.org (Jason Riedy) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: (Ricardo Reis's message of "Fri, 1 Aug 2008 18:36:47 +0100 (WEST)") References: Message-ID: <87proszg8r.fsf@sparse.dyndns.org> And Ricardo Reis writes: > In a 64 bit machine the mpi version kaputs. Any thoughts? I'd bet that you're calling MPI routines directly from your Fortran code somewhere, and fftw is a red herring... When calling MPI routines directly from your Fortran code, be very, very careful about the arguments being passed. Many MPI routines stuff a pointer in a "large enough" integer, but some of the MPI/Fortran "header" files make too many assumptions about the particular compiler and flags in use. You might want to use the ISO_C_BINDING module and the BIND(C, NAME="...") gizmos to declare the specific routines you're using. Jason From mark.kosmowski at gmail.com Fri Aug 1 12:45:07 2008 From: mark.kosmowski at gmail.com (Mark Kosmowski) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran Message-ID: > Message: 3 > Date: Fri, 1 Aug 2008 14:23:13 -0400 (EDT) > From: Mark Hahn > Subject: Re: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran > To: Ricardo Reis > Cc: beowulf@beowulf.org > Message-ID: > > Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed > > > I've scourged the net for answers to no avail and the fftw project seems to > > have grinded to a halt. Maybe someone has had this problem and can throw some > > light. > > I don't know the status of the project, but fftw is definitely still > widely used, and definitely works in 64b. > > > I've coded a small program that reads a vorticity field, uses FFTW2 to send > > it from the physical to the spectral space and then computes its energy > > spectrum. Everything works in my laptop (32 bit, Linux), serial, threaded and > > mpi (using openmpi). In a 64 bit machine the mpi version kaputs. Any > > thoughts? > > 32-64 problems usually stem from someone conflating ints and pointers. > we have fftw2 installed on all our machines, which are all 64b (for years). > no reports of problems. > > did you compile your own fftw2, and if so, did you run the test cases? What exactly is going wrong? Is your program failing to link or does it die during execution? Have you built 64-bit versions of fftw and mpi libraries? Are you positive that you've changed from 32-bit to 64-bit all of the libraries your code links to, even the ones not related to fftw or mpi? You imply that 64-bit serial with fftw works - are you able to get a different code to run with 64-bit mpi? Good luck, Mark Kosmowski From jmdavis1 at vcu.edu Fri Aug 1 13:26:52 2008 From: jmdavis1 at vcu.edu (Mike Davis) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: <87proszg8r.fsf@sparse.dyndns.org> References: <87proszg8r.fsf@sparse.dyndns.org> Message-ID: <4893718C.5030109@vcu.edu> My only issue with fftw is that some of our software will only work with fftw2 and not fftw3. That being said, running both is relatively trivial. Mike From john.hearns at streamline-computing.com Fri Aug 1 15:01:18 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> <48932EAB.1070500@harddata.com> Message-ID: <1217628088.4725.3.camel@Vigor13> On Fri, 2008-08-01 at 12:12 -0400, Mark Hahn wrote: > I'm not sure about the "most" part - HP's don't, and it looks like supermicro > offers options both ways. all the recent tyan boards I've looked at had > dedicated IPMI/OPMA onboard. all HP machines have dedicated ports. > > but to me this has all the hallmarks of a religious issue, so... On the contrary Mark, my honest advice - through experience - is to go for systems with the separate ethernet port if you have a choice in the matter. Yes, this involves double the cabling and the installation of a set of 10/100 switches or similar. From landman at scalableinformatics.com Fri Aug 1 16:23:34 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Building new cluster - estimate (Ivan Oleynik) In-Reply-To: References: <200807261957.m6QJv6HE031997@bluewest.scyld.com> <488BD53A.2010907@harddata.com> <488EC70F.8090906@harddata.com> <488FEF6A.30003@harddata.com> <48932EAB.1070500@harddata.com> Message-ID: <48939AF6.8010100@scalableinformatics.com> Mark Hahn wrote: > I'm not sure about the "most" part - HP's don't, and it looks like > supermicro > offers options both ways. all the recent tyan boards I've looked at had > dedicated IPMI/OPMA onboard. all HP machines have dedicated ports. > > but to me this has all the hallmarks of a religious issue, so... Hmmm... we try to take a more pragmatic approach. IPMI is great. When it doesn't get wedged. And every now and then it does in fact take an operational excursion. Not often enough to be more than an annoyance, but often enough that you want to think about redundancy. Yeah, I know, its strange, but if your data center is remote, and going over to it is hard for any reason, redundancy is a *very good idea*. Switchable PDUs don't cost much more than plain old PDUs. Network access to them is generally easy to set up. They are a good backup to IPMI. But switchable PDUs don't give you console access. IPMI 2.0 can give you SOL (serial over lan, not the other meaning) So we usually suggest a console server to back that path up. The Supermicro units give you KVM over IP on selected motherboards. IPMI is great, but when it fails, you lose control. And console access. If this is important (that you never lose control/console access) then you need alternative paths. Given the relatively low cost of these control systems, its not such a bad idea to do this. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From gus at ldeo.columbia.edu Fri Aug 1 14:25:34 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: <48937F4E.7030400@ldeo.columbia.edu> Ricardo Reis Would you have used a 32-bit fftw library when you linked the program? I made this mistake this before. Both the 32- and 64-bit versions are installed on my Fedora Core 8 64-bit machine. 32-bit in /usr/lib/ 64-bit in /usr/lib64/ Would it be this the reason for the segmentation fault? Como dizia o A'lvaro de Campos, coitado, a compilac,a~o e' um comi'cio dentro da alma. :) Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus@ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Ricardo Reis wrote: > > which means... segfault. > > Hi all > > I've scourged the net for answers to no avail and the fftw project > seems to have grinded to a halt. Maybe someone has had this problem > and can throw some light. > > I've coded a small program that reads a vorticity field, uses FFTW2 > to send it from the physical to the spectral space and then computes > its energy spectrum. Everything works in my laptop (32 bit, Linux), > serial, threaded and mpi (using openmpi). In a 64 bit machine the mpi > version kaputs. Any thoughts? > > > > Ricardo Reis > > 'Non Serviam' > > PhD student @ Lasef > Computational Fluid Dynamics, High Performance Computing, Turbulence > http://www.lasef.ist.utl.pt > > & > > Cultural Instigator @ R?dio Zero > http://www.radiozero.pt > > http://www.flickr.com/photos/rrei > s/ > >------------------------------------------------------------------------ > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From rreis at aero.ist.utl.pt Sat Aug 2 04:25:27 2008 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: Hi Thanks for replying. Answering all the questions: - This is a debian box, X86_64 native. So all that is compiled is naturally 64 bit; - I've compiled myself the fftw-2.5.1 because the fftw3 has only experimental MPI suport, without Fortran bindings. I've asked if the project has stoped because the last release (fftw, 3.2 alpha) is dated Nov. 13, 2007 - I'm using openmpi, from the debian package. I've also compiled openmpi by hand and the same problem happens. I've compiled the latest LAM (although had to explicit the 4.1 version of gcc suite because I've found a problem with the 4.3. It says g++ isn't boolean capable). I can run other mpi codes in this machine (a pseudo-spectral DNS code I've parallized myself) with this openmpi instalation; - Using LAM it works for 1 processor. It blews up for more than 2. I can run my DNS code with lam without problem. - The only 64 bit caveat on the fftw notes relates to the declaration of the plan variables that should be integer(8). I've carefully done that. I even got to the extreme of placing -fdefault-integer-8 in the compilation flags of this code; - I can run this code as serial or threaded without problems; - The 32 bit test was my laptop, a 32 bit machine. The 64 bit on the 64 bit machine. No libraries are transported (svn co and make and so on...) - Yes, I've managed to run the tests (but they are C programs allas!). - The program only blows up when going to do the fft r2c (my first transform). Before that it is able to do another mpi functions. - Gus, Ode Triunfal by Alvaro de Campos is one of my favourite poems. The early XX century machine emotion fever of electricity. The furious hunger to be alive and eating the world full :) - I've tried it on another debian box, X86_64, with openmpi from debian and the same problem happens... - if I compile with -fdefault-integer-8 this is the error message 5068.0 $ mpirun -np 2 ~/bin/spec2.mpi Launching MPI program with 2 proc. [tenorio:21099] *** Process received signal *** [tenorio:21100] *** Process received signal *** [tenorio:21099] Signal: Segmentation fault (11) [tenorio:21099] Signal code: (128) [tenorio:21099] Failing at address: (nil) [tenorio:21099] [ 0] /lib/libpthread.so.0 [0x7f13ca893a90] [tenorio:21099] [ 1] /usr/lib/libopen-pal.so.0(_int_malloc+0x962) [0x7f13cb3057c2] [tenorio:21099] [ 2] /usr/lib/libopen-pal.so.0(malloc+0x8f) [0x7f13cb3068ef] [tenorio:21099] [ 3] /home/rreis/bin/spec2.mpi(MAIN__+0x79a) [0x40eb0a] [tenorio:21099] [ 4] /home/rreis/bin/spec2.mpi(main+0x2c) [0x46d3cc] [tenorio:21099] [ 5] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f13ca5501a6] [tenorio:21099] [ 6] /home/rreis/bin/spec2.mpi [0x407d59] [tenorio:21099] *** End of error message *** [tenorio:21100] Signal: Segmentation fault (11) [tenorio:21100] Signal code: (128) [tenorio:21100] Failing at address: (nil) [tenorio:21100] [ 0] /lib/libpthread.so.0 [0x7f858af35a90] [tenorio:21100] [ 1] /usr/lib/libopen-pal.so.0(_int_malloc+0x962) [0x7f858b9a77c2] [tenorio:21100] [ 2] /usr/lib/libopen-pal.so.0(malloc+0x8f) [0x7f858b9a88ef] [tenorio:21100] [ 3] /home/rreis/bin/spec2.mpi(MAIN__+0x79a) [0x40eb0a] [tenorio:21100] [ 4] /home/rreis/bin/spec2.mpi(main+0x2c) [0x46d3cc] [tenorio:21100] [ 5] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f858abf21a6] [tenorio:21100] [ 6] /home/rreis/bin/spec2.mpi [0x407d59] [tenorio:21100] *** End of error message *** mpirun noticed that job rank 0 with PID 21099 on node tenorio exited on signal 11 (Segmentation fault). 1 additional process aborted (not shown) - if I take the flag out 5070.0 $ mpirun -np 2 ~/bin/spec2.mpi Launching MPI program with 2 proc. Read field (DONE) [tenorio:21234] *** Process received signal *** [tenorio:21234] Signal: Segmentation fault (11) [tenorio:21234] Signal code: Address not mapped (1) [tenorio:21234] Failing at address: 0x4840 [tenorio:21234] [ 0] /lib/libpthread.so.0 [0x7fd57da65a90] [tenorio:21234] [ 1] /home/rreis/bin/spec2.mpi(rfftwnd_f77_mpi_+0x16) [0x40f676] [tenorio:21234] [ 2] /home/rreis/bin/spec2.mpi(MAIN__+0xb69) [0x40f1fe] [tenorio:21234] [ 3] /home/rreis/bin/spec2.mpi(main+0x2c) [0x46d6bc] [tenorio:21234] [ 4] /lib/libc.so.6(__libc_start_main+0xe6) [0x7fd57d7221a6] [tenorio:21234] [ 5] /home/rreis/bin/spec2.mpi [0x407d59] [tenorio:21234] *** End of error message *** mpirun noticed that job rank 0 with PID 21234 on node tenorio exited on signal 11 (Segmentation fault). 1 additional process aborted (not shown) Maybe I should try mpich or compile the openmpi with all bells and whistles and give it another run... greets, Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt & Cultural Instigator @ R?dio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ From pal at di.fct.unl.pt Sat Aug 2 04:57:37 2008 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275 In-Reply-To: References: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> Message-ID: <20670.89.180.225.196.1217678257.squirrel@www.di.fct.unl.pt> Thanks, Mark >> So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 >> DL145-G2 nodes with 2 dual-core 275 / 4GB each. > > it's worth making sure you have current bios installed. > Not the latest, but the previous; according to "Fixes" just a single, unrelated fix. Anyway I'm upgrading it... > >> 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted > > it may also be useful to run mcelog, which will tell you about > any ongoing _correctable_ ECC activity. No output in any of the 4 hosts; tried with/without --k8, --dmi, etc. (Just a side note, as it is being pursued in another thread): I have been quite happy with DL145-G2's IPMI and BMC board: I was able to power it remotely in every occasion, including after crashes. -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10763 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: pal@di.fct.unl.pt 2829-516 Caparica, PORTUGAL From rreis at aero.ist.utl.pt Sat Aug 2 05:49:11 2008 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: Hi all After backtracing and lots of going around I found out the problem. The routine to calculate the fft had a parameter for using or not using a buffer array which I wasn't passing through. Thanks all for your help and sorry to disturbe. there should be a way to force the fortan compiler to check every variable is passed in the interface... damn it. Greets, Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt & Cultural Instigator @ R?dio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ From csamuel at vpac.org Sun Aug 3 16:12:02 2008 From: csamuel at vpac.org (Chris Samuel) Date: Fri Dec 5 01:07:33 2008 Subject: [Beowulf] Re: Linux cluster authenticating against multiple Active Directory domains In-Reply-To: <1217575735.4977.1.camel@Vigor13> Message-ID: <1669319801.11171217805122430.JavaMail.root@mail.vpac.org> ----- "John Hearns" wrote: > On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: > > > We'd prefer to steer clear of Kerberos, it introduces > > arbitrary job limitations through ticket lives that > > are not tolerable for HPC work. > > Kerberos is heavily used at CERN. They have a solution for > that issue - the job can ask for an extension to the tickets. That's useful to know, though it doesn't help in this situation due to the fact that there is no GSSAPI support in the mainline Torque at present. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From gerry.creager at tamu.edu Mon Aug 4 05:04:15 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> Message-ID: <4896F03F.3080803@tamu.edu> Chris Samuel wrote: > ----- "Bogdan Costescu" wrote: > >> On Tue, 29 Jul 2008, Chris Samuel wrote: >> >>> 1) Use a mainline kernel, we've found benefit of that >>> over stock CentOS kernels. >> Care to comment on this statement ? > > a) We found that we got better performance out of > the mainline kernels than the CentOS ones; we guess > because they handle newer hardware better (RHEL is > meant to aim for stability over performance) Hadn't thought about this, but it makes a lot of sense. > b) We can use XFS for scratch space rather than being > tied to the RHEL One True Filesystem (ext3) which > (in our experience) can't handle large amounts of disk > I/O. Mirrors our experience, too. -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From landman at scalableinformatics.com Mon Aug 4 05:31:48 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> Message-ID: <4896F6B4.60602@scalableinformatics.com> Chris Samuel wrote: > ----- "Bogdan Costescu" wrote: > >> On Tue, 29 Jul 2008, Chris Samuel wrote: >> >>> 1) Use a mainline kernel, we've found benefit of that >>> over stock CentOS kernels. >> Care to comment on this statement ? > > a) We found that we got better performance out of > the mainline kernels than the CentOS ones; we guess > because they handle newer hardware better (RHEL is > meant to aim for stability over performance) This mirrors our experience, though RHEL stability under intense loads is questionable IMO (talking about the kernel BTW). We find that the missing drivers, the omitted drivers, the backported drivers along with some odd and often useless "features" (4k stacks anyone?) render the RHEL default kernels (and by definition the Centos kernels) less useful for HPC and storage tasks than what we build. Our current standard is a 2.6.23.14 kernel which is rock solid under load. Working on a 2.6.26 based version now (even though I am on vacation/holiday, I just updated it to 2.6.26.1 to address an observed crashing issue with the RDMA server) > b) We can use XFS for scratch space rather than being > tied to the RHEL One True Filesystem (ext3) which > (in our experience) can't handle large amounts of disk > I/O. Combine this with the small upper limit of ext3 partition sizes, the file size limits in ext3, the serialization in the journaling code (ext4 is extents based to help deal with this), ext3 just doesn't make much sense in a storage/HPC system (apart from possibly boot/root file system where performance is less critical). Yeah I have seen studies from folks whom had done 1E6 removes, file creates, and other things who claim xfs is slower than ext3. Yeah, those are bad benchmarks in that they really don't touch on real end user use cases for the most part (apart from possible large scale mail servers and other things like that). > > YMMV! Always ... and wish gas in ~$4USD region, you need to conserve . Having been in London a few months ago, seeing almost $10USD/gallon (3.75 liters), I am gonna stop complaining about our price over here. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From mark.kosmowski at gmail.com Mon Aug 4 13:37:30 2008 From: mark.kosmowski at gmail.com (Mark Kosmowski) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran Message-ID: > > Message: 4 > Date: Sat, 2 Aug 2008 13:49:11 +0100 (WEST) > From: Ricardo Reis > Subject: Re: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran > To: beowulf@beowulf.org > Message-ID: > Content-Type: text/plain; charset="iso-8859-15" > > > Hi all > > After backtracing and lots of going around I found out the problem. The > routine to calculate the fft had a parameter for using or not using a > buffer array which I wasn't passing through. > > Thanks all for your help and sorry to disturbe. > > there should be a way to force the fortan compiler to check every > variable is passed in the interface... damn it. > > Greets, > > Ricardo Reis > So, why did the 32-bit test case work? Shouldn't the same problem crash both systems if it is a code issue? In any event, I am glad you got your problem sorted out. Mark E. Kosmowski From matt at technoronin.com Mon Aug 4 13:54:19 2008 From: matt at technoronin.com (Matt Lawrence) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <4896F6B4.60602@scalableinformatics.com> References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> Message-ID: On Mon, 4 Aug 2008, Joe Landman wrote: > This mirrors our experience, though RHEL stability under intense loads is > questionable IMO (talking about the kernel BTW). We find that the missing > drivers, the omitted drivers, the backported drivers along with some odd and > often useless "features" (4k stacks anyone?) render the RHEL default kernels > (and by definition the Centos kernels) less useful for HPC and storage tasks > than what we build. Our current standard is a 2.6.23.14 kernel which is rock > solid under load. Working on a 2.6.26 based version now (even though I am on > vacation/holiday, I just updated it to 2.6.26.1 to address an observed > crashing issue with the RDMA server) Since I plan to continue running CentOS, it sounds like building a much later kernel rpm is the way I want to approach the problem. Will going to a much later kernel break any of the utilities? Other problems I can expect to see? What do you recommend for the kernel config? > Combine this with the small upper limit of ext3 partition sizes, the file > size limits in ext3, the serialization in the journaling code (ext4 is > extents based to help deal with this), ext3 just doesn't make much sense in a > storage/HPC system (apart from possibly boot/root file system where > performance is less critical). Yeah I have seen studies from folks whom had > done 1E6 removes, file creates, and other things who claim xfs is slower than > ext3. Yeah, those are bad benchmarks in that they really don't touch on real > end user use cases for the most part (apart from possible large scale mail > servers and other things like that). I have never had any problems with ext3. I had dinner with a friend who is an expert Linux sysadmin who was warning me to stay away from xfs. He cited lots of fragmentation problems that routinely locked up his systems. I am willing to be convinced otherwise, but he is a very sharp fellow. -- Matt It's not what I know that counts. It's what I can remember in time to use. From landman at scalableinformatics.com Mon Aug 4 15:02:17 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> Message-ID: <48977C69.1000703@scalableinformatics.com> Matt Lawrence wrote: > On Mon, 4 Aug 2008, Joe Landman wrote: > >> This mirrors our experience, though RHEL stability under intense loads >> is questionable IMO (talking about the kernel BTW). We find that the >> missing drivers, the omitted drivers, the backported drivers along >> with some odd and often useless "features" (4k stacks anyone?) render >> the RHEL default kernels (and by definition the Centos kernels) less >> useful for HPC and storage tasks than what we build. Our current >> standard is a 2.6.23.14 kernel which is rock solid under load. >> Working on a 2.6.26 based version now (even though I am on >> vacation/holiday, I just updated it to 2.6.26.1 to address an observed >> crashing issue with the RDMA server) > > Since I plan to continue running CentOS, it sounds like building a much > later kernel rpm is the way I want to approach the problem. Will going > to a much later kernel break any of the utilities? Other problems I can > expect to see? Doesn't break most things. We usually insert a new RPM and off it goes. > > What do you recommend for the kernel config? > >> Combine this with the small upper limit of ext3 partition sizes, the >> file size limits in ext3, the serialization in the journaling code >> (ext4 is extents based to help deal with this), ext3 just doesn't make >> much sense in a storage/HPC system (apart from possibly boot/root file >> system where performance is less critical). Yeah I have seen studies >> from folks whom had done 1E6 removes, file creates, and other things >> who claim xfs is slower than ext3. Yeah, those are bad benchmarks in >> that they really don't touch on real end user use cases for the most >> part (apart from possible large scale mail servers and other things >> like that). > > I have never had any problems with ext3. I had dinner with a friend who > is an expert Linux sysadmin who was warning me to stay away from xfs. > He cited lots of fragmentation problems that routinely locked up his > systems. I am willing to be convinced otherwise, but he is a very sharp > fellow. I haven't seen or heard anyone claim xfs 'routinely locks up their system'. I won't comment on your friends "sharpness". I will point out that several very large data stores/large cluster sites use xfs. By definition, no large data store can be built with ext3 (16 TB limit with patches, 8 TB in practice), so if your sharp friend is advising you to do this ... -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From matt at technoronin.com Mon Aug 4 17:35:47 2008 From: matt at technoronin.com (Matt Lawrence) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <48977C69.1000703@scalableinformatics.com> References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> <48977C69.1000703@scalableinformatics.com> Message-ID: On Mon, 4 Aug 2008, Joe Landman wrote: > I haven't seen or heard anyone claim xfs 'routinely locks up their system'. > I won't comment on your friends "sharpness". I will point out that several > very large data stores/large cluster sites use xfs. By definition, no large > data store can be built with ext3 (16 TB limit with patches, 8 TB in > practice), so if your sharp friend is advising you to do this ... He currently works for a phone company, so the amount of data is quite large, but the usage pattern is probably quite different. As far as skill level, I would rate him much higher than any of the folks I work with as far as being a sysadmin. So, any good info on kernel configuration when I go to build a new rpm? There are a huge number of options and you have obviously gone through them much more recently than I have. -- Matt It's not what I know that counts. It's what I can remember in time to use. From landman at scalableinformatics.com Mon Aug 4 17:47:36 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> <48977C69.1000703@scalableinformatics.com> Message-ID: <4897A328.9070004@scalableinformatics.com> Matt Lawrence wrote: > So, any good info on kernel configuration when I go to build a new rpm? Don't start with the distro .src.rpm for the kernel. Build your own, and integrate your patches manually. Best way is take the barebones kernel from kernel.org, do a 'make rpm-pkg' on it (will generate a source RPM and spec file for you). Then install this source rpm, and voila, you have a working spec file. Integrate your patches into this, and use this to build. Decide what you need to support on your machines to spec your kernel version. Late model 2.6.25.x support NFS over RDMA so if you want that, you need the latest flavor of this. Decide which file system options you want, and make sure to integrate them (as modules). Remove things that you wont use (ISDN, Telephony, ARCNET, ...) > There are a huge number of options and you have obviously gone through > them much more recently than I have. A make xconfig can be quite helpful in changing the .config. > > -- Matt > It's not what I know that counts. > It's what I can remember in time to use. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From gus at ldeo.columbia.edu Mon Aug 4 21:37:38 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: <4897D912.6040304@ldeo.columbia.edu> Salve Ricardo Reis and list Ricardo Reis wrote: > > Hi all > > After backtracing and lots of going around I found out the problem. > The routine to calculate the fft had a parameter for using or not > using a buffer array which I wasn't passing through. > > Thanks all for your help and sorry to disturbe. > > there should be a way to force the fortan compiler to check every > variable is passed in the interface... damn it. > Fortran 90 (and later) has this capability with module interfaces, which resemble to C function prototypes. However, I don't think FFTW uses it, although I haven't used FFTW in a while to be sure about it. Smuggling various parameter types across the same subroutine interface seem to have been a desired feature of older Fortran. Are there any Fortran compilers that can actually check the subroutine parameter number, type, array dimensions, etc, through mere compilation flags? I don't remember any, the language features are probably not enough to ensure such checks (except for Fortran 90 as noted above). Glad to know your code now works! Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus@ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > Greets, > > Ricardo Reis > > 'Non Serviam' > > PhD student @ Lasef > Computational Fluid Dynamics, High Performance Computing, Turbulence > http://www.lasef.ist.utl.pt > > & > > Cultural Instigator @ R?dio Zero > http://www.radiozero.pt > > http://www.flickr.com/photos/rrei > s/ > >------------------------------------------------------------------------ > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From prentice at ias.edu Tue Aug 5 07:15:57 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Kerberos + HPC In-Reply-To: <1217575735.4977.1.camel@Vigor13> References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> Message-ID: <4898609D.7060405@ias.edu> John Hearns wrote: > On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: > >> We'd prefer to steer clear of Kerberos, it introduces >> arbitrary job limitations through ticket lives that >> are not tolerable for HPC work. >> > Kerberos is heavily used at CERN. They have a solution for that issue - > the job can ask for an extension to the tickets. > Sorry, I don't have a reference handy but its worth documenting this for > the list. > If ANYONE has more information on how this is done at CERN, I'd be very interested in hearing about it. I know, I know... GIYF... -- Prentice From kus at free.net Tue Aug 5 09:34:22 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: Message-ID: In message from Matt Lawrence (Mon, 4 Aug 2008 19:35:47 -0500 (CDT)): >On Mon, 4 Aug 2008, Joe Landman wrote: >> I haven't seen or heard anyone claim xfs 'routinely locks up their >>system'. >> I won't comment on your friends "sharpness". I will point out that >>several >> very large data stores/large cluster sites use xfs. By definition, >>no large >> data store can be built with ext3 (16 TB limit with patches, 8 TB in >> practice), so if your sharp friend is advising you to do this ... > >He currently works for a phone company, so the amount of data is >quite large, but the usage pattern is probably quite different. As >far as skill level, I would rate him much higher than any of the >folks I work with as far as being a sysadmin. I work w/xfs for HPC since 1995: I used xfs w/SGI SMP servers under IRIX, and then on Linux/x86 clusters. I didn't have any hang-ups because of xfs. But xfs is optimal for work w/large files; when you work w/a lot of relative small files, xfs isn't the better choice. The question about fragmentation itself is more interesting. We have in xfs filesystem a set of small files (1st of all, input data) in addition to large (usually temporary) files. So the fragmentation may be present. xfs has a rich set of utilities, but AFAIK no defragmentation tools (I don't know what will be after xfsdump/xfsrestore). But which modern linux filesystems have defragmentation possibilities ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow From perry at piermont.com Tue Aug 5 09:59:30 2008 From: perry at piermont.com (Perry E. Metzger) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Kerberos + HPC In-Reply-To: <4898609D.7060405@ias.edu> (Prentice Bisbal's message of "Tue\, 05 Aug 2008 10\:15\:57 -0400") References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> Message-ID: <87hc9zjt3h.fsf@snark.cb.piermont.com> Prentice Bisbal writes: > John Hearns wrote: >> On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: >> >>> We'd prefer to steer clear of Kerberos, it introduces >>> arbitrary job limitations through ticket lives that >>> are not tolerable for HPC work. >> >> Kerberos is heavily used at CERN. They have a solution for that issue - >> the job can ask for an extension to the tickets. >> Sorry, I don't have a reference handy but its worth documenting this for >> the list. > > If ANYONE has more information on how this is done at CERN, I'd be very > interested in hearing about it. I know, I know... GIYF... I doubt they're dong anything unusual -- this is a completely normal thing any Kerberos setup deals with. You just stash the private key on the server, request a long ticket lifetime and refresh reasonably tickets frequently. Standard documentation can tell you how to do it -- just read the manuals. Perry -- Perry E. Metzger perry@piermont.com From prentice at ias.edu Tue Aug 5 10:07:03 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Kerberos + HPC In-Reply-To: <87hc9zjt3h.fsf@snark.cb.piermont.com> References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> <87hc9zjt3h.fsf@snark.cb.piermont.com> Message-ID: <489888B7.10806@ias.edu> Perry E. Metzger wrote: > Prentice Bisbal writes: >> John Hearns wrote: >>> On Fri, 2008-08-01 at 15:37 +1000, Chris Samuel wrote: >>> >>>> We'd prefer to steer clear of Kerberos, it introduces >>>> arbitrary job limitations through ticket lives that >>>> are not tolerable for HPC work. >>> Kerberos is heavily used at CERN. They have a solution for that issue - >>> the job can ask for an extension to the tickets. >>> Sorry, I don't have a reference handy but its worth documenting this for >>> the list. >> If ANYONE has more information on how this is done at CERN, I'd be very >> interested in hearing about it. I know, I know... GIYF... > > I doubt they're dong anything unusual -- this is a completely normal > thing any Kerberos setup deals with. You just stash the private key on > the server, request a long ticket lifetime and refresh reasonably > tickets frequently. Standard documentation can tell you how to do > it -- just read the manuals. > > Perry I don't believe it. It sounds way to simple! -- Prentice From jlb17 at duke.edu Tue Aug 5 11:10:33 2008 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: Message-ID: On Tue, 5 Aug 2008 at 8:34pm, Mikhail Kuzminsky wrote > xfs has a rich set of utilities, but AFAIK no defragmentation tools (I don't > know what will be after xfsdump/xfsrestore). But which modern linux Not true -- see xfs_fsr(8). Back in the IRIX days, it was recommended to run this regularly. However, ISTR that the current recommendation is "as needed, but it really shouldn't be needed". -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From prentice at ias.edu Tue Aug 5 13:38:08 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Kerberos + HPC In-Reply-To: <48989873.2090402@tuffmail.us> References: <371991977.6661217569032542.JavaMail.root@mail.vpac.org> <1217575735.4977.1.camel@Vigor13> <4898609D.7060405@ias.edu> <87hc9zjt3h.fsf@snark.cb.piermont.com> <489888B7.10806@ias.edu> <48989873.2090402@tuffmail.us> Message-ID: <4898BA30.1020201@ias.edu> Alan Louis Scheinine wrote: >> I don't believe it. It sounds way to simple! > > Perhaps the tricky part begins with the seemingly innocent > phrase "Standard documentation can tell you how to do > it -- just read the manuals." Are you saying RTFM? I've read the O'Reilly book on Kerberos several times, and I'm well-versed in Kerberos administration. I know how to adjust ticket TTLs. Perry suggests stashing the private key on the server and then refreshing the ticket automatically. How do you refresh the ticket automatically for a user while a job is waiting to run? That would have to be done by the queuing system, so the queuing system would have to be GSSAPI-aware. Someone already pointed out that Torque is NOT GSSAPI-aware so that leaves SGE and commercial applications. -- Prentice From gus at ldeo.columbia.edu Tue Aug 5 14:25:52 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? Message-ID: <4898C560.6040606@ldeo.columbia.edu> Hello Beowulf fans Is anybody using Infiniband to provide both MPI connection and parallel file system services on a Beowulf cluster? I thought to have a storage node that would serve a parallel file system to the beowulf nodes over IB (something like a NFS on steroids). The same IB net would also work as the MPI interconnect. Is this design possible? On a small cluster, does it require two separate IB physical networks (cards and switch), or can it be done with a single IB card per node and one switch? Is this design efficient? Are there other practical and cost effective alternatives to this idea? Would this type of design work with GigE instead of IB? I confess I know nothing about parallel file systems and IB. So, please forgive me if my questions are nonsense. I also appreciate any links to readings that would mitigate my ignorance on these subjects. Thank you, Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus@ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From kus at free.net Tue Aug 5 15:38:23 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: Message-ID: In message from Joshua Baker-LePain (Tue, 5 Aug 2008 14:10:33 -0400 (EDT)): >On Tue, 5 Aug 2008 at 8:34pm, Mikhail Kuzminsky wrote > >> xfs has a rich set of utilities, but AFAIK no defragmentation tools >>(I don't >> know what will be after xfsdump/xfsrestore). But which modern linux > >Not true -- see xfs_fsr(8). Thanks !! I didn't look to xfs details many years :-( - it's my mistake. > Back in the IRIX days, it was >recommended to run this regularly. I don't remember that xfs_fsr was included in IRIX 6.1-6.4 we used. Mikhail > However, ISTR that the current >recommendation is "as needed, but it really shouldn't be needed". > >-- >Joshua Baker-LePain >QB3 Shared Cluster Sysadmin >UCSF From hahn at mcmaster.ca Tue Aug 5 16:37:39 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <4898C560.6040606@ldeo.columbia.edu> References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: > Is anybody using Infiniband to provide both > MPI connection and parallel file system services on a Beowulf cluster? of course! many people have a strong opinion that sharing networks with file and mpi traffic is a bad thing, but I haven't seen anyone actually produce numbers. obviously contention increases the chances that a latency-sensitive operation (say, small synchronous mpi message) will be hurt by a stream of large file packets. and moreso when the fabric is not full-bisection - even if it's multiple cores sharing a node's single interface. but consider gigabit - 1500-byte packets consume <20 us of wire time, and most people who are using gb for mpi are expecting zero-byte latency quite a lot higher than that (say 50 us). by contrast, a max-size packet on old-gen SDR IB is about 4 us wire time, about the same as 0B latency. as has been pointed out here recently, the fabric will drop pretty significantly in performance once links become contended; this would make the latency-vs-bandwidth conflict more painful. (it also affects certain networks more than others - depending on their ability to adjust routes dynamically.) IMO, you have to ponder in your heart whether your expected workload will suffer from these issues. there is really no general rule, since workloads vary so widely in latency sensitivity and in bandwidth demands, all convolved with the fabric properties... if you have a well-defined workload, why not measure it? run an mpi app that has some sort of performance feedback while applying an increasing large-transfer NFS load... regards, mark hahn. From csamuel at vpac.org Tue Aug 5 16:44:15 2008 From: csamuel at vpac.org (Chris Samuel) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: Message-ID: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> ----- "Matt Lawrence" wrote: > I have never had any problems with ext3. I suspect you're not doing a lot of disk I/O, we found NFS servers using ext3 as a back end would crumble under the weight of lots of writes as ext3 is single threaded through the journal daemon. That means that you end up with all your NFS daemons blocking on that, stalling everything else. :-( > I had dinner with a friend who is an expert Linux > sysadmin who was warning me to stay away from xfs. There have been occasional bugs in XFS in older kernel releases, but then there have been bugs in other filesystems too. > He cited lots of fragmentation problems that routinely > locked up his systems. Never had that problem here. Does he know that he can use xfs_fsr to defragment XFS filesystems online ? Is he sure he's not hitting another kernel bug ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Tue Aug 5 16:47:57 2008 From: csamuel at vpac.org (Chris Samuel) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <4896F03F.3080803@tamu.edu> Message-ID: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> ----- "Gerry Creager" wrote: > Chris Samuel wrote: > > > b) We can use XFS for scratch space rather than being > > tied to the RHEL One True Filesystem (ext3) which > > (in our experience) can't handle large amounts of disk > > I/O. > > Mirrors our experience, too. I should point out that our actual NFS servers run Debian Linux not CentOS. Those who want to run the stock CentOS kernel might like to know that the "plus" repository includes an RPM for the XFS kernel module for the mainline kernel. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Tue Aug 5 16:56:28 2008 From: csamuel at vpac.org (Chris Samuel) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Kerberos + HPC In-Reply-To: <4898BA30.1020201@ias.edu> Message-ID: <188691817.38081217980588864.JavaMail.root@mail.vpac.org> ----- "Prentice Bisbal" wrote: > Someone already pointed out that Torque is NOT > GSSAPI-aware so that leaves SGE and commercial > applications. To be fair to the Torque devs I did say that the release versions don't have GSSAPI support, but there is a GSSAPI branch in SVN. But I don't think it gets much development and probably even less testing which leads to a chicken/egg situation. Nobody is going to deploy Kerberos on a cluster if it'll break the queueing system and nobody will get GSSAPI support into a queueing system if there's not clusters needing it which can tolerate downtime and lost jobs to get it working. cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From perry at piermont.com Tue Aug 5 17:06:33 2008 From: perry at piermont.com (Perry E. Metzger) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> (Chris Samuel's message of "Wed\, 6 Aug 2008 09\:44\:15 +1000 \(EST\)") References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> Message-ID: <87d4knhura.fsf@snark.cb.piermont.com> Chris Samuel writes: > ----- "Matt Lawrence" wrote: > >> I have never had any problems with ext3. > > I suspect you're not doing a lot of disk I/O, we > found NFS servers using ext3 as a back end would > crumble under the weight of lots of writes as ext3 > is single threaded through the journal daemon. > > That means that you end up with all your NFS daemons > blocking on that, stalling everything else. :-( Put your journal onto a battery backed RAM card or the equivalent on the RAID controller and it significantly speeds up dealing with a journal. Perry -- Perry E. Metzger perry@piermont.com From gerry.creager at tamu.edu Tue Aug 5 17:16:24 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> References: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> Message-ID: <4898ED58.8020001@tamu.edu> Chris Samuel wrote: > ----- "Gerry Creager" wrote: > >> Chris Samuel wrote: >> >>> b) We can use XFS for scratch space rather than being >>> tied to the RHEL One True Filesystem (ext3) which >>> (in our experience) can't handle large amounts of disk >>> I/O. >> Mirrors our experience, too. > > I should point out that our actual NFS servers run > Debian Linux not CentOS. > > Those who want to run the stock CentOS kernel might > like to know that the "plus" repository includes an > RPM for the XFS kernel module for the mainline kernel. And, of course, we do. Good point. -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From matt at technoronin.com Tue Aug 5 19:27:24 2008 From: matt at technoronin.com (Matt Lawrence) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> References: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> Message-ID: On Wed, 6 Aug 2008, Chris Samuel wrote: > Those who want to run the stock CentOS kernel might > like to know that the "plus" repository includes an > RPM for the XFS kernel module for the mainline kernel. It works well as long as you remember to install the xfs-progs package. I spent five minutes today going "where the heck is mkfs.xfs and the man pages?". -- Matt It's not what I know that counts. It's what I can remember in time to use. From matt at technoronin.com Tue Aug 5 19:37:13 2008 From: matt at technoronin.com (Matt Lawrence) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> Message-ID: On Wed, 6 Aug 2008, Chris Samuel wrote: > I suspect you're not doing a lot of disk I/O, we > found NFS servers using ext3 as a back end would > crumble under the weight of lots of writes as ext3 > is single threaded through the journal daemon. > > That means that you end up with all your NFS daemons > blocking on that, stalling everything else. :-( Could be. Given the long and sordid history of NFS, I prefer to not use it whenever there are practical alternatives. I'm also not a Solaris fanboy. So, different mindset that a lot of unix sysadmins. > There have been occasional bugs in XFS in older kernel > releases, but then there have been bugs in other filesystems > too. That could be it, he does spend a fair amount of time cleaning up systems that others have built. > Never had that problem here. > > Does he know that he can use xfs_fsr to defragment > XFS filesystems online ? He certainly does. He was talking about using OpenNMS to determine the best time to run it. He had lots of good things to say about how easy it is to track through performance data with it. > Is he sure he's not hitting another kernel bug ? It wouldn't surprise me. This is someone who I trust enough that if he warns me of something, I make a real effort to doublecheck if it is currently a problem. It doesn't mean he is always right, just that I think the research effort is a really good idea. -- Matt It's not what I know that counts. It's what I can remember in time to use. From landman at scalableinformatics.com Tue Aug 5 19:43:47 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> Message-ID: <48990FE3.3080500@scalableinformatics.com> As a note: I was pointed to a recent lockup (double lock acquisition) in XFS with NFS. I don't think I have seen this one in the wild myself. Right now I am fighting an NFS over RDMA crash in 2.6.26 which seems to have been cured in 2.6.26.1 . .2 is almost out, so will test with that as well. This said, our experience with xfs has been quite good (performance, reliability, etc). Some vendors kernels (2.6.18 ahem!) have some issues with xfs (and a bunch of other things), so we usually update them anyway. Joe Matt Lawrence wrote: > On Wed, 6 Aug 2008, Chris Samuel wrote: > >> I suspect you're not doing a lot of disk I/O, we >> found NFS servers using ext3 as a back end would >> crumble under the weight of lots of writes as ext3 >> is single threaded through the journal daemon. >> >> That means that you end up with all your NFS daemons >> blocking on that, stalling everything else. :-( > > Could be. Given the long and sordid history of NFS, I prefer to not use > it whenever there are practical alternatives. I'm also not a Solaris > fanboy. So, different mindset that a lot of unix sysadmins. > >> There have been occasional bugs in XFS in older kernel >> releases, but then there have been bugs in other filesystems >> too. > > That could be it, he does spend a fair amount of time cleaning up > systems that others have built. > >> Never had that problem here. >> >> Does he know that he can use xfs_fsr to defragment >> XFS filesystems online ? > > He certainly does. He was talking about using OpenNMS to determine the > best time to run it. He had lots of good things to say about how easy > it is to track through performance data with it. > >> Is he sure he's not hitting another kernel bug ? > > It wouldn't surprise me. > > This is someone who I trust enough that if he warns me of something, I > make a real effort to doublecheck if it is currently a problem. It > doesn't mean he is always right, just that I think the research effort > is a really good idea. > > -- Matt > It's not what I know that counts. > It's what I can remember in time to use. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From john.hearns at streamline-computing.com Tue Aug 5 23:57:26 2008 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <4898C560.6040606@ldeo.columbia.edu> References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: <1218005856.5116.5.camel@Vigor13> On Tue, 2008-08-05 at 17:25 -0400, Gus Correa wrote: > Hello Beowulf fans > > Is anybody using Infiniband to provide both > MPI connection and parallel file system services on a Beowulf cluster? > > I thought to have a storage node that would > serve a parallel file system to the beowulf nodes over IB > (something like a NFS on steroids). > The same IB net would also work as the MPI interconnect. > > Is this design possible? > Yes - just look at the fastest cluster in the world, Roadrunner. It uses Infiniband to access the Panasas parallel file system. In that architecture there are storage routers between the Infiniband and the Panasas. I'd imagine that TACC Ranger runs Lustre over Infiniband. From jiteshbdundas at gmail.com Wed Aug 6 01:07:37 2008 From: jiteshbdundas at gmail.com (jitesh dundas) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <1218005856.5116.5.camel@Vigor13> References: <4898C560.6040606@ldeo.columbia.edu> <1218005856.5116.5.camel@Vigor13> Message-ID: Dear Sir, I have this query for which I request your reply. It is possible to transfer data from one node to another using parrallel computing. However, I wish to know if it is possible to migrate the settings of one node to another, assuming we are using multiple computers, not necessarily in the same network. Would not security and performance issues come into picture here? Please excuse me for my ignorance, I have just started getting involved in Beowulf and I am trying to clear out my concepts. If you have any articles on Beowulf that could help, please do share them with me. Thanks & Regards, Jitesh Dundas On 8/6/08, John Hearns wrote: > On Tue, 2008-08-05 at 17:25 -0400, Gus Correa wrote: >> Hello Beowulf fans >> >> Is anybody using Infiniband to provide both >> MPI connection and parallel file system services on a Beowulf cluster? >> >> I thought to have a storage node that would >> serve a parallel file system to the beowulf nodes over IB >> (something like a NFS on steroids). >> The same IB net would also work as the MPI interconnect. >> >> Is this design possible? >> > > Yes - just look at the fastest cluster in the world, Roadrunner. > It uses Infiniband to access the Panasas parallel file system. In that > architecture there are storage routers between the Infiniband and the > Panasas. > > I'd imagine that TACC Ranger runs Lustre over Infiniband. > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From andrew at moonet.co.uk Wed Aug 6 02:26:27 2008 From: andrew at moonet.co.uk (andrew holway) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <4898C560.6040606@ldeo.columbia.edu> References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: Gus, It works(ish) and people are doing it but my research has shown that it is not yet stable. I have been talking to various companies offering lustre support. They have all told me that they can do it but none have been able to offer a reference site. I bet you the chaps behind roadrunner aren't going to be publishing any downtime figures. as mentioned by mark, If you try and force lots of stuff down the tubes you are going to break something. I guess its a _bit_ like torrents on a naff home router, lots and lots of small torrent connections filling up the nat table which cannot purge itself fast enough which mean that larger downloads time out as they fall off the bottom of the table. Data on how hard IB switches have to work would be interesting. I have a feeling that many people are taking their fabric to the very edge and back again! Perhaps someone can shed more light on the problems? ta Andy On Tue, Aug 5, 2008 at 10:25 PM, Gus Correa wrote: > Hello Beowulf fans > > Is anybody using Infiniband to provide both > MPI connection and parallel file system services on a Beowulf cluster? > > I thought to have a storage node that would > serve a parallel file system to the beowulf nodes over IB > (something like a NFS on steroids). > The same IB net would also work as the MPI interconnect. > > Is this design possible? > > On a small cluster, does it require two separate IB physical networks (cards > and switch), > or can it be done with a single IB card per node and one switch? > > Is this design efficient? > > Are there other practical and cost effective alternatives to this idea? > > Would this type of design work with GigE instead of IB? > > I confess I know nothing about parallel file systems and IB. > So, please forgive me if my questions are nonsense. > > I also appreciate any links to readings that would mitigate my ignorance > on these subjects. > > Thank you, > Gus Correa > > -- > --------------------------------------------------------------------- > Gustavo J. Ponce Correa, PhD - Email: gus@ldeo.columbia.edu > Lamont-Doherty Earth Observatory - Columbia University > P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From perry at piermont.com Wed Aug 6 06:41:35 2008 From: perry at piermont.com (Perry E. Metzger) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: (Matt Lawrence's message of "Tue\, 5 Aug 2008 21\:37\:13 -0500 \(CDT\)") References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> Message-ID: <87k5eu45ww.fsf@snark.cb.piermont.com> Matt Lawrence writes: > Could be. Given the long and sordid history of NFS, I prefer to not > use it whenever there are practical alternatives. NFS is a fine protocol and works very well. However, traditionally the Linux implementation of NFS has been of less than perfect quality. You shouldn't confuse NFS with NFS on Linux. Perry -- Perry E. Metzger perry@piermont.com From hahn at mcmaster.ca Wed Aug 6 07:15:22 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: > It works(ish) and people are doing it but my research has shown that > it is not yet stable. "not stable" sounds like a bit of a smear. file and mpi activity _do_ coexist on a single network - the only issue is possible contention. it's not like NFS somehow ionizes the wires so MPI packets sort out ;) > I have been talking to various companies > offering lustre support. They have all told me that they can do it but > none have been able to offer a reference site. my organization has at least 4 production clusters which use the interconnect for both MPI and file (lustre) traffic. ironically, our one IB cluster has no local filestore, but two are quadrics, one is myri 2g and one is plain old gigabit. actually, now that I think of it, we have ~6 other myri 2g clusters that also share the IC between MPI and NFS. > as mentioned by mark, If you try and force lots of stuff down the > tubes you are going to break something. I guess its a _bit_ like contention is possible, but mixing NFS+MPI doesn't change anything. you can still run into fabric contention with pure MPI - after all, it's not as if _every_ MPI program was equally latency-tolerant or only used sparse tinygrams. there is NOTHING wrong with using a single network for NFS and MPI - just consider, preferably measure, your workload's traffic beforehand. if you can handle NFS purely via gigabit (ie, ~80 MB/s), it's probably very cheap to add a decent gigabit switch. of course, you can just as easily see the same contention with the right mix of MPI traffic - no panacea. From gerry.creager at tamu.edu Wed Aug 6 07:59:59 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> <4898ED58.8020001@tamu.edu> Message-ID: <4899BC6F.8030404@tamu.edu> Robert Kubrick wrote: > Or use solid-state data disks? Does anybody here have experience with > SSD disk in HPC? Not on OUR budget! ;-) > On Aug 5, 2008, at 8:16 PM, Gerry Creager wrote: > >> Chris Samuel wrote: >>> ----- "Gerry Creager" wrote: >>>> Chris Samuel wrote: >>>> >>>>> b) We can use XFS for scratch space rather than being >>>>> tied to the RHEL One True Filesystem (ext3) which >>>>> (in our experience) can't handle large amounts of disk >>>>> I/O. >>>> Mirrors our experience, too. >>> I should point out that our actual NFS servers run >>> Debian Linux not CentOS. >>> Those who want to run the stock CentOS kernel might >>> like to know that the "plus" repository includes an >>> RPM for the XFS kernel module for the mainline kernel. >> >> And, of course, we do. Good point. -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From kus at free.net Wed Aug 6 08:39:43 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <4899BC6F.8030404@tamu.edu> Message-ID: In message from Gerry Creager (Wed, 06 Aug 2008 09:59:59 -0500): >Robert Kubrick wrote: >> Or use solid-state data disks? Does anybody here have experience >>with >> SSD disk in HPC? > >Not on OUR budget! ;-) It was the proposal for journal part only ;-) SSD/flash disks (for increasing of lifetime) attempt not to erase data really - if it's physically possible. But if I use practically whole HDD partition for scratch files (and therefore whole SSD) - IMHO it'll be impossible not to erase flash RAM. What will be w/SSD disk lifetime in that case ? Mikhail Kuzminsky Computer Assistance to Chemical Research Zelinsky Institute of Organic Chemistry Moscow From Craig.Tierney at noaa.gov Wed Aug 6 11:47:03 2008 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: <4899F1A7.805@noaa.gov> andrew holway wrote: > Gus, > > It works(ish) and people are doing it but my research has shown that > it is not yet stable. I have been talking to various companies > offering lustre support. They have all told me that they can do it but > none have been able to offer a reference site. > We are running our filesystems and MPI traffic over the same IB network. We are having no problems with this configuration. The system consists of two trees (each with ~70% bisection bandwidth) connected via an top level tree to share IB communications between the filesystems and the compute nodes. One side of the tree has ~350 woodcrest nodes, the other ~250 harpertown nodes. We don't run jobs between the two systems, but both systems share the same filesystems. Just to complicate matters, we are supporting both Rapidscale and Lustre (v1.6.5.1) on our nodes. The most obvious job contention we have seen on the IB network is at the filesystem, not between the filesystem traffic and the MPI traffic. We had some issues with the subnet manager initially, but we have worked through them. The latest version of Lustre has been quite stable in our environment. As another posted already stated, I suspect that this configuration would be of issue with codes that are very latency sensitive. Our codes are more latency sensitive than bandwidth sensitive, and we haven't seen any significant issues (and the configuration has been stable so far). Craig > I bet you the chaps behind roadrunner aren't going to be publishing > any downtime figures. > > as mentioned by mark, If you try and force lots of stuff down the > tubes you are going to break something. I guess its a _bit_ like > torrents on a naff home router, lots and lots of small torrent > connections filling up the nat table which cannot purge itself fast > enough which mean that larger downloads time out as they fall off the > bottom of the table. > > Data on how hard IB switches have to work would be interesting. I have > a feeling that many people are taking their fabric to the very edge > and back again! > > Perhaps someone can shed more light on the problems? > > ta > > Andy > > On Tue, Aug 5, 2008 at 10:25 PM, Gus Correa wrote: >> Hello Beowulf fans >> >> Is anybody using Infiniband to provide both >> MPI connection and parallel file system services on a Beowulf cluster? >> >> I thought to have a storage node that would >> serve a parallel file system to the beowulf nodes over IB >> (something like a NFS on steroids). >> The same IB net would also work as the MPI interconnect. >> >> Is this design possible? >> >> On a small cluster, does it require two separate IB physical networks (cards >> and switch), >> or can it be done with a single IB card per node and one switch? >> >> Is this design efficient? >> >> Are there other practical and cost effective alternatives to this idea? >> >> Would this type of design work with GigE instead of IB? >> >> I confess I know nothing about parallel file systems and IB. >> So, please forgive me if my questions are nonsense. >> >> I also appreciate any links to readings that would mitigate my ignorance >> on these subjects. >> >> Thank you, >> Gus Correa >> >> -- >> --------------------------------------------------------------------- >> Gustavo J. Ponce Correa, PhD - Email: gus@ldeo.columbia.edu >> Lamont-Doherty Earth Observatory - Columbia University >> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA >> --------------------------------------------------------------------- >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Craig Tierney (craig.tierney@noaa.gov) From rreis at aero.ist.utl.pt Tue Aug 5 02:57:42 2008 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: On Mon, 4 Aug 2008, Mark Kosmowski wrote: > So, why did the 32-bit test case work? Shouldn't the same problem > crash both systems if it is a code issue? I asked the same question myself... The function interface is: call rfftwnd_f77_mpi(plan_c2r, & 1, local_data, work, use_work, FFTW_NORMAL_ORDER) where use_work is an integer, value 1 if you use the work temporary array, 0 otherwise. This was the variable I wasn't passing. FFTW_NORMAL_ORDER instructs fftw to return a proper ordering of the array and FFTW_TRANSPOSE_ORDER cuts some comm steps making it more efficient (and then you have to workout the array ordering yourself). The wrapper function for this is (from rfftw_f77_mpi.c): void F77_FUNC_(rfftwnd_f77_mpi,RFFTWND_F77_MPI) (rfftwnd_mpi_plan *p, int *n_fields, fftw_real *local_data, fftw_real *work, int *use_work, int *ioutput_order) { fftwnd_mpi_output_order output_order = *ioutput_order ? FFTW_TRANSPOSED_ORDER : FFTW_NORMAL_ORDER; rfftwnd_mpi(*p, *n_fields, local_data, *use_work ? work : NULL, output_order); } and the code was blocking in the fftwnd_mpi_output_order output_order = *ioutput_order ? FFTW_TRANSPOSED_ORDER : FFTW_NORMAL_ORDER; line. So it must be a pointer issue revealed by the 64 bit, no? When I wasn't doing it "properly" the value of *ioutput_order wasn't set. greets, Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt & Cultural Instigator @ R?dio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ From robertkubrick at gmail.com Tue Aug 5 17:34:51 2008 From: robertkubrick at gmail.com (Robert Kubrick) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <4898ED58.8020001@tamu.edu> References: <373299326.37931217980077289.JavaMail.root@mail.vpac.org> <4898ED58.8020001@tamu.edu> Message-ID: Or use solid-state data disks? Does anybody here have experience with SSD disk in HPC? On Aug 5, 2008, at 8:16 PM, Gerry Creager wrote: > Chris Samuel wrote: >> ----- "Gerry Creager" wrote: >>> Chris Samuel wrote: >>> >>>> b) We can use XFS for scratch space rather than being >>>> tied to the RHEL One True Filesystem (ext3) which >>>> (in our experience) can't handle large amounts of disk >>>> I/O. >>> Mirrors our experience, too. >> I should point out that our actual NFS servers run >> Debian Linux not CentOS. >> Those who want to run the stock CentOS kernel might >> like to know that the "plus" repository includes an >> RPM for the XFS kernel module for the mainline kernel. > > And, of course, we do. Good point. > -- > Gerry Creager -- gerry.creager@tamu.edu > Texas Mesonet -- AATLT, Texas A&M University > Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 > Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From jimqiao at hotmail.com Wed Aug 6 09:29:26 2008 From: jimqiao at hotmail.com (Lei Qiao) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Torque manager Message-ID: Hello, forks, I am currently building a small Beowulf cluster and using Torque manager to schedule batch jobs. Now I have a problems with Torque command 'qstat'. as the manual said, all users' jobs can be seen when the command is issued. But for my case, I only see the my own jobs when i login with a regular user but i can see it when login with a root user ( before check the jobs, I remotely login with other regular users to run some jobs with 'qsub -l nodes=1 ./run.sh') I guess the reason is ssh and permission configuration. the root can access any user in any machine without password, whereas the regular user can not access the information of other users. Does anyone has similar experience or have the solution? Thanks for your help in advance. by the way, my cluster is Fedora Core 6 based, and good when running C program with MPI. Firewall is enabled and ssh, NFS and NIS is allowed. Lei Qiao Department of Electrical and Computer Engineering School of Engineering and Applied Sciences University of Rochester Rochester, NY 14627 _________________________________________________________________ Get more from your digital life. Find out how. http://www.windowslive.com/default.html?ocid=TXT_TAGLM_WL_Home2_082008 From jclinton at advancedclustering.com Wed Aug 6 11:31:09 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Can one Infiniband net support MPI and a parallel file system? In-Reply-To: <4898C560.6040606@ldeo.columbia.edu> References: <4898C560.6040606@ldeo.columbia.edu> Message-ID: <588c11220808061131l6c3ecf49hc45b5da7151e64b8@mail.gmail.com> On Tue, Aug 5, 2008 at 4:25 PM, Gus Correa wrote: > Is anybody using Infiniband to provide both > MPI connection and parallel file system services on a Beowulf cluster? > > I thought to have a storage node that would > serve a parallel file system to the beowulf nodes over IB > (something like a NFS on steroids). > The same IB net would also work as the MPI interconnect. > > Is this design possible? We have customers doing Lustre and MPI with IB successfully. They still have a good-old gigabit management network to fall back on: it makes sense to keep this around because gigabit is so low-cost by comparison and it's rock-solid. But, you should know that you need more than a single node to provide disk I/O before you start to see the performance benefit. I/O from a single node can--generally--barely fill a gigabit link. To exceed that gigabit level of performance, you'd need more than one storage node delivering storage to the Lustre network. > On a small cluster, does it require two separate IB physical networks (cards > and switch), > or can it be done with a single IB card per node and one switch? It can be done with a single IB network. > Is this design efficient? Generally speaking, MPI programs will not be fetching/writing data from/to storage at the same time they are doing MPI calls so there tends to not be very much contention to worry about at the node level. > Are there other practical and cost effective alternatives to this idea? If the cluster is small enough, using gigabit with a shared filesystem is preferred since IB's low latency has relatively little affect on the big source of latency in any storage system: the physical disks. It's not until you cross the gigabit bandwidth barrier that IB really starts to make sense--and that's a barrier that's not crossed that often in a small cluster. > Would this type of design work with GigE instead of IB? Yes, but you'd still want IB for low latency MPI traffic. > I confess I know nothing about parallel file systems and IB. > So, please forgive me if my questions are nonsense. Lustre and Panassas are certainly both stable options in this area. -- Jason D. Clinton Advanced Clustering Technologies, Inc. From rgb at phy.duke.edu Wed Aug 6 12:34:52 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <87k5eu45ww.fsf@snark.cb.piermont.com> References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> <87k5eu45ww.fsf@snark.cb.piermont.com> Message-ID: On Wed, 6 Aug 2008, Perry E. Metzger wrote: > > Matt Lawrence writes: >> Could be. Given the long and sordid history of NFS, I prefer to not >> use it whenever there are practical alternatives. > > NFS is a fine protocol and works very well. However, traditionally the > Linux implementation of NFS has been of less than perfect quality. You > shouldn't confuse NFS with NFS on Linux. And even on Linux machines, NFS has been, well, "functional" is a good way to describe it. For its primary original purpose, which is serving home directories or remote mount e.g. binaries in midsize and smaller workstation LANS, it is adequate and has worked well for us for almost ten years (not without some pain, mind you, but with no more pain than anythng else). For the last five or six years even most of the pain has gone away and things like automounting work most of the time with only rare hangs or stale mount problems (on highly reliable server hardware and with a very reliable network). Once upon a time, running NFS in a LAN that wasn't controlled at the port level was basically openly inviting anyone that could plug into a wired port to have open access to all exported files, and I'm not sure that has fundamentally changed as to change it would be very difficult. A host that is permitted to mount a directory is typically known only by IP number (which of course anybody can set to masquerade as any host) and no hard authentication tokens are required. Also, traffic is typically not encrypted IIRC so anybody can snoop the wire if they're on it. I once upon a time had a few lovely cracking tools that let me just mount any user's home directory with no special privileges from userspace -- it didn't even require rootspace. I think things are better now, but still think of it as a tool to use primarily on trusted internal networks for primarily bandwidth-limited (few larger files) and not stat-limited (man smaller files) traffic. rgb > > Perry > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From perry at piermont.com Wed Aug 6 12:44:23 2008 From: perry at piermont.com (Perry E. Metzger) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: (Robert G. Brown's message of "Wed\, 6 Aug 2008 15\:34\:52 -0400 \(EDT\)") References: <66655588.37901217979855077.JavaMail.root@mail.vpac.org> <87k5eu45ww.fsf@snark.cb.piermont.com> Message-ID: <87wsiuylm0.fsf@snark.cb.piermont.com> "Robert G. Brown" writes: > Once upon a time, running NFS in a LAN that wasn't controlled at the > port level was basically openly inviting anyone that could plug into a > wired port to have open access to all exported files, and I'm not sure > that has fundamentally changed as to change it would be very difficult. NFSv4 changes the security situation a bunch, but it is not widely implemented and deployed. One can also use IPSec with NFSv3 -- appropriate IPSec policies will assure that you get reasonable security, at the price of some performance because of the crypto. Perry -- Perry E. Metzger perry@piermont.com From gus at ldeo.columbia.edu Wed Aug 6 13:08:05 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Torque manager In-Reply-To: References: Message-ID: <489A04A5.9060803@ldeo.columbia.edu> Hi Lei Qiao and list It may be just the Torque/PBS configuration. As root, try: qmgr -c 'set server your-pbs-server-name query_other_jobs = True' Then try qstat again as a regular user. I hope this helps. Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus@ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Lei Qiao wrote: >Hello, forks, > >I am currently building a small Beowulf cluster and using Torque manager to schedule batch jobs. Now I have a problems with Torque command 'qstat'. >as the manual said, all users' jobs can be seen when the command is issued. But for my case, I only see the my own jobs when i login with a regular user but i can see it when login with a root user ( before check the jobs, I remotely login with other regular users to run some jobs with 'qsub -l nodes=1 ./run.sh') > >I guess the reason is ssh and permission configuration. the root can access any user in any machine without password, whereas the regular user can not access the information of other users. Does anyone has similar experience or have the solution? Thanks for your help in advance. > >by the way, my cluster is Fedora Core 6 based, and good when running C program with MPI. Firewall is enabled and ssh, NFS and NIS is allowed. > > >Lei Qiao >Department of Electrical and Computer Engineering >School of Engineering and Applied Sciences >University of Rochester >Rochester, NY 14627 > >_________________________________________________________________ >Get more from your digital life. Find out how. >http://www.windowslive.com/default.html?ocid=TXT_TAGLM_WL_Home2_082008 >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From jclinton at advancedclustering.com Wed Aug 6 12:56:51 2008 From: jclinton at advancedclustering.com (Jason Clinton) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Seeing ECC errors since upgraded from Opteron 246 to 275 In-Reply-To: <20670.89.180.225.196.1217678257.squirrel@www.di.fct.unl.pt> References: <2405.10.170.133.93.1217605242.squirrel@www.di.fct.unl.pt> <20670.89.180.225.196.1217678257.squirrel@www.di.fct.unl.pt> Message-ID: <588c11220808061256o7140c9aekdb9b88f7047d4dde@mail.gmail.com> On Sat, Aug 2, 2008 at 6:57 AM, Paulo Afonso Lopes wrote: > Thanks, Mark > >>> So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 >>> DL145-G2 nodes with 2 dual-core 275 / 4GB each. >> >> it's worth making sure you have current bios installed. >> > Not the latest, but the previous; according to "Fixes" just a single, > unrelated fix. Anyway I'm upgrading it... >> >>> 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted >> >> it may also be useful to run mcelog, which will tell you about >> any ongoing _correctable_ ECC activity. > > No output in any of the 4 hosts; tried with/without --k8, --dmi, etc. We have a tool on our website called "breakin" that is Linux 2.6.25.9 patched with K8 and K10f Opteron EDAC reporting facilities. It can usually find and identify failed RAM in fifteen minutes (two hours at most). The EDAC patches to the kernel aren't that great about naming the correct memory rank, though. Make sure you have multibit (sometimes says 4-bit) ECC enabled in your BIOS. http://www.advancedclustering.com/software/breakin.html From dnlombar at ichips.intel.com Wed Aug 6 14:56:01 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] fftw2, mpi, from 32 bit to 64 and fortran In-Reply-To: References: Message-ID: <20080806215601.GA2375@nlxdcldnl2.cl.intel.com> On Tue, Aug 05, 2008 at 02:57:42AM -0700, Ricardo Reis wrote: > On Mon, 4 Aug 2008, Mark Kosmowski wrote: > > > So, why did the 32-bit test case work? Shouldn't the same problem > > crash both systems if it is a code issue? Not necessarily given the error described below. > I asked the same question myself... The function interface is: > > call rfftwnd_f77_mpi(plan_c2r, & > 1, local_data, work, use_work, FFTW_NORMAL_ORDER) > > where use_work is an integer, value 1 if you use the work temporary > array, 0 otherwise. This was the variable I wasn't passing. ... > The wrapper function for this is (from rfftw_f77_mpi.c): > > void F77_FUNC_(rfftwnd_f77_mpi,RFFTWND_F77_MPI) > (rfftwnd_mpi_plan *p, int *n_fields, fftw_real *local_data, > fftw_real *work, int *use_work, int *ioutput_order) > .... So it must be a pointer issue revealed by the 64 bit, no? When I > wasn't doing it "properly" the value of *ioutput_order wasn't set. The value of the first element of local_data was used for the n_fields scalar. The work array was being laid down starting at the location of the use_work scalar. The FFTW_NORMAL_ORDER value was being interpreted as use_work scalar. Finally, ioutput_order scalar was some random value. So, a lot was going wrong there. It's just one of life's little, um, pleasures that it looked like it was working for your 32-bit test case. Don't worry, you'll likely do this again, as likely *every* one of us on this list has, too. BTW, Fortran passes by reference; that's why all args are pointers. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From csamuel at vpac.org Wed Aug 6 16:48:32 2008 From: csamuel at vpac.org (Chris Samuel) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <1675460398.47611218066218095.JavaMail.root@mail.vpac.org> Message-ID: <382428251.47851218066512841.JavaMail.root@mail.vpac.org> ----- "Robert G. Brown" wrote: > And even on Linux machines, NFS has been, well, "functional" > is a good way to describe it. It actually seems to work pretty well these days, our general config is: 1) No automounter 2) Hard mounts (so jobs just hang if they loose contact) 3) NFS over TCP (NFS over UDP is sooo 1990's :-)) 4) Jumbo frames (9000 byte MTUs) on the NFS network 5) NFS file server has hardwired fsid's to prevent stale file handles on a reboot 6) Debian, not RHEL on the server 7) XFS for /home on the server cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From kyron at neuralbs.com Wed Aug 6 17:44:39 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> Message-ID: <489A4577.5040508@neuralbs.com> Bogdan Costescu wrote: > On Tue, 29 Jul 2008, Chris Samuel wrote: > >> 1) Use a mainline kernel, we've found benefit of that >> over stock CentOS kernels. > > Care to comment on this statement ? I do ;) Simply download a kernel from kernel.org and build the kernel yourself and set: CONFIG_HZ_100=y CONFIG_HZ=100 CONFIG_PREEMPT_NONE=y And select the main stuff (HDD drivers) as built in and don't fsck around with the initrd stuff, that's only usefull for kernels that need to be generic and adapt to all hardware (ie: install CDs)...other than that, monolithic a kernel works fine ;) ...and such. I'd tell you to use the Gentoo Clustering LiveCD but that's work in progress...you could still build the cluster using Gentoo...if you're performance savvy...and want things like OpenMP capable compiler (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish afterthought of an RPM that pulls in a new glibc that breaks the install anyways ;) ...but, then again...no distribution war, seems people want the easy install solution and veil that fact with "it has to be supported" catch phrase Eh! Eric Thibodeau From landman at scalableinformatics.com Wed Aug 6 18:05:01 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A4577.5040508@neuralbs.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> Message-ID: <489A4A3D.1070000@scalableinformatics.com> Eric Thibodeau wrote: > And select the main stuff (HDD drivers) as built in and don't fsck > around with the initrd stuff, that's only usefull for kernels that need > to be generic and adapt to all hardware (ie: install CDs)...other than > that, monolithic a kernel works fine ;) Advantage of modules is you can upgrade them without upgrading the kernel. Go ahead, build in that e1000 driver. I dare yah... :( More to the point it does give some good flexibility for end users with a need to keep the core "separate" from the drivers for maintenance. Initrd is subtle and quick to anger. One must use burnt offerings to placate the spirits of initrd. Well, it would be a heck of a lot nicer if the tools were a little more forgiving ... Oh you don't have this driver in your initrd ... ok ... PANIC (mwahahahaha) > > > ...and such. I'd tell you to use the Gentoo Clustering LiveCD but that's > work in progress...you could still build the cluster using Gentoo...if > you're performance savvy...and want things like OpenMP capable compiler I have been hearing claims like this for a long time. I have not seen any real tests that back these claims up. Do you have any? Most of the arguments I have heard are "oh but its compiled with -O3" or whatever. Any decent HPC code person will tell you that that is most definitely not a guaranteed way to a faster system ... > (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish Er... We often use several different compilers in several different trees. Several gccs, pgi, icc, eieio ... you name it. All are integrated. > afterthought of an RPM that pulls in a new glibc that breaks the install Er ... not the slightest clue as to what you are talking about. I haven't seen gcc, icc, pgi, ... touch our glibc. Maybe I am missing the fun. Which ICC version is this? Which gcc is this, which glibc is this? -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From kyron at neuralbs.com Wed Aug 6 18:07:01 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: References: <841759592.334291217464770684.JavaMail.root@zimbra.vpac.org> <4896F6B4.60602@scalableinformatics.com> Message-ID: <489A4AB5.9090704@neuralbs.com> Matt Lawrence wrote: > On Mon, 4 Aug 2008, Joe Landman wrote: > >> This mirrors our experience, though RHEL stability under intense >> loads is questionable IMO (talking about the kernel BTW). We find >> that the missing drivers, the omitted drivers, the backported drivers >> along with some odd and often useless "features" (4k stacks anyone?) >> render the RHEL default kernels (and by definition the Centos >> kernels) less useful for HPC and storage tasks than what we build. >> Our current standard is a 2.6.23.14 kernel which is rock solid under >> load. Working on a 2.6.26 based version now (even though I am on >> vacation/holiday, I just updated it to 2.6.26.1 to address an >> observed crashing issue with the RDMA server) > > Since I plan to continue running CentOS, it sounds like building a > much later kernel rpm is the way I want to approach the problem. Will > going to a much later kernel break any of the utilities? Other > problems I can expect to see? > > What do you recommend for the kernel config? > >> Combine this with the small upper limit of ext3 partition sizes, the >> file size limits in ext3, the serialization in the journaling code >> (ext4 is extents based to help deal with this), ext3 just doesn't >> make much sense in a storage/HPC system (apart from possibly >> boot/root file system where performance is less critical). Yeah I >> have seen studies from folks whom had done 1E6 removes, file creates, >> and other things who claim xfs is slower than ext3. Yeah, those are >> bad benchmarks in that they really don't touch on real end user use >> cases for the most part (apart from possible large scale mail servers >> and other things like that). > > I have never had any problems with ext3. I had dinner with a friend > who is an expert Linux sysadmin who was warning me to stay away from > xfs. He cited lots of fragmentation problems that routinely locked up > his systems. I am willing to be convinced otherwise, but he is a very > sharp fellow. Check the kernel mailing list for XFS problems with RAID5 if you use mdadm...jsut a gentle suggestion ;) > > -- Matt > It's not what I know that counts. > It's what I can remember in time to use. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From kyron at neuralbs.com Wed Aug 6 18:33:10 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A4A3D.1070000@scalableinformatics.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> Message-ID: <489A50D6.4060302@neuralbs.com> Joe Landman wrote: > > > Eric Thibodeau wrote: > >> And select the main stuff (HDD drivers) as built in and don't fsck >> around with the initrd stuff, that's only usefull for kernels that >> need to be generic and adapt to all hardware (ie: install >> CDs)...other than that, monolithic a kernel works fine ;) > > Advantage of modules is you can upgrade them without upgrading the > kernel. Go ahead, build in that e1000 driver. I dare yah... :( Ok...I didn't put enought emphasis on "main" stuff....as in, _all you need to get the system booted, which essentially means HDD chipset drivers, the rest I do build as a module (NIC, video and such). > > More to the point it does give some good flexibility for end users > with a need to keep the core "separate" from the drivers for maintenance. > > Initrd is subtle and quick to anger. One must use burnt offerings to > placate the spirits of initrd. LOL! > > Well, it would be a heck of a lot nicer if the tools were a little > more forgiving ... Oh you don't have this driver in your initrd ... ok > ... PANIC (mwahahahaha) Pahahahahah... Point in case, I am building a CD-only cluster system (based on Gentoo) and I am currently _NOT_ using initrd because all that really needs to be built in is NFSroot support an all NICs I care to put in. Obviously this is a deprecated approach but it's proven to be the most effective and easy to maintain in my case. >> >> >> ...and such. I'd tell you to use the Gentoo Clustering LiveCD but >> that's work in progress...you could still build the cluster using >> Gentoo...if you're performance savvy...and want things like OpenMP >> capable compiler > > I have been hearing claims like this for a long time. I have not seen > any real tests that back these claims up. Do you have any? I'm actually working on such benchmarks. Did you know that compiling with the default ICC optimization will cause your bridge to crumble due to floating point assumptions?... Ok, so my computation have diverged horribly mostly because I am computing 47(vector size)*5000(K-Means clusters)*6,787,955(learning dataset)*5(iterations to convergence) for a total of 7,975,847,125,000 FLOPS (or about 8Tera FLOPS) as part of an iterative learning process, the error adds up. So performance is very sensitive to what your intended goal is too ;) > Most of the arguments I have heard are "oh but its compiled with > -O3" or whatever. Any decent HPC code person will tell you that that > is most definitely not a guaranteed way to a faster system ... Hey...as I stated above, one would have to be quite silly to claim -O3 as the all well and all good optimization solution. At least you can rest assured your solutions will add up correctly with GCC. To get a "faster" system, you really have to look at your app, use strace, ltrace and gprof, then you can play with that. What I _am_ saying though is that Gentoo _does_ empower the administrator by giving him the ability to customize the OS if a bottleneck is to be identified. > >> (gcc-4.3.1, or ICC ;) ) _integrated_ into your system (not a hackish > > Er... We often use several different compilers in several different > trees. Several gccs, pgi, icc, eieio ... you name it. All are > integrated. Are-you currently able to run GCC-4.3.x versions on your current setup, I'm actually eager to know. I'm still living under the ASSumption od binary distributions not coping too well with multi-library environments. Point in case, one of my colleagues _really_ wanted firefox 3 on his ubuntu system. The installer trickled down to having to uninstall glibc...and he forced it to YES (and this is just a browser, not something that is used to _make_ code and would be tied to glibc) > >> afterthought of an RPM that pulls in a new glibc that breaks the install > > Er ... not the slightest clue as to what you are talking about. I > haven't seen gcc, icc, pgi, ... touch our glibc. > > Maybe I am missing the fun. Which ICC version is this? Which gcc is > this, which glibc is this? > Sorry about that I might have been misleading, GCC is generally the one most sensitive to glibc, not the other ones although the latest ICC (10.1.x series) do claim compatibility with the GNU environment so it might get a little more dependency there. Cheers! Eric From landman at scalableinformatics.com Wed Aug 6 19:01:17 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri Dec 5 01:07:34 2008 Subject: [Beowulf] Building new cluster - estimate In-Reply-To: <489A50D6.4060302@neuralbs.com> References: <627174098.322311217331658556.JavaMail.root@zimbra.vpac.org> <489A4577.5040508@neuralbs.com> <489A4A3D.1070000@scalableinformatics.com> <489A50D6.4060302@neuralbs.com> Message-ID: <489A576D.4000005@scalableinformatics.com> Eric Thibodeau wrote: >> Advantage of modules is you can upgrade them without upgrading the >> kernel. Go ahead, build in that e1000 driver. I dare yah... :( > Ok...I didn't put enought emphasis on "main" stuff....as in, _all you > need to get the system booted, which essentially means HDD chipset > drivers, the rest I do build as a module (NIC, video and such). >> >> More to the point it does give some good flexibility for end users >> with a need to keep the core "separate" from the drivers for maintenance. >> >> Initrd is subtle and quick to anger. One must use burnt offerings to >> placate the spirits of initrd. > LOL! ... now I don't mean hardware burnt offerings ... smoke rising from your motherboard may not placate the spirits of initrd, they definitely may impede further operations ... >> >> Well, it would be a heck of a lot nicer if the tools were a little >> more forgiving ... Oh you don't have this driver in your initrd ... ok >> ... PANIC (mwahahahaha) > Pahahahahah... Point in case, I am building a CD-only cluster system > (based on Gentoo) and I am currently _NOT_ using initrd because all that > really needs to be built in is NFSroot support an all NICs I care to put > in. Obviously this is a deprecated approach but it's proven to be the > most effective and easy to maintain in my case. We build an integrated NFSroot and e1000 and a few other things for a customer. Fixed hardware for their cluster. From bare-metal-off to operational infiniband compute node in ~45-60 seconds (I say 45, but a few things took a little longer to start, like SGE). >>> >>> >>> ...and such. I'd tell you to use the Gentoo Clustering LiveCD but >>> that's work in progress...you could still build the cluster using >>> Gentoo...if you're performance savvy...and want things like OpenMP >>> capable compiler >> >> I have been hearing claims like this for a long time. I have not seen >> any real tests that back these claims up. Do you have any? > I'm actually working on such benchmarks. Did you know that compiling > with the default ICC optimization will cause your bridge to crumble due > to floating point assumptions?... > > Ok, so my computation have diverged horribly mostly because I am > computing 47(vector size)*5000(K-Means clusters)*6,787,955(learning > dataset)*5(iterations to convergence) for a total of 7,975,847,125,000 > FLOPS (or about 8Tera FLOPS) as part of an iterative learning process, > the error adds up. So performance is very sensitive to what your > intended goal is too ;) Hmmm.... sounds like a fun computation. Error definitely adds up. Renormalization is your friend (well, some times, assuming a linear system). >> Most of the arguments I have heard are "oh but its compiled wit