From landman at scalableinformatics.com Tue Dec 1 10:02:43 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 01 Dec 2009 13:02:43 -0500 Subject: [Beowulf] Forwarded from a long time reader having trouble posting Message-ID: <4B155A43.3010304@scalableinformatics.com> My apologies if this is bad form, I know Toon from his past participation on this list, and he asked me to forward. -------- Original Message -------- Dear all, I've been working on hpux-itanium for the last 2 years (and even unsubscribed to beowulf-ml during most of that time, my bad) but soon will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570, amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few questions on the config. 1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1 instead of going to the latest make a difference. 2) What are the advantages of the hpc version of rhel. I browsed the doc but unless having to compile mpi myself I do not see a difference or did I miss soth. 3) which filesystem is advisable knowing that we're calculating on large berkeley db databases thanks in advance, toon Toon Knapen toon.knapen at gmail.com ----------------------------------- -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From gerry.creager at tamu.edu Tue Dec 1 11:45:13 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Tue, 01 Dec 2009 13:45:13 -0600 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <4B155A43.3010304@scalableinformatics.com> References: <4B155A43.3010304@scalableinformatics.com> Message-ID: <4B157249.80701@tamu.edu> Toon, welcome back! I've been quite happy with CentOS 5.3 and we're experimenting with CentOS 5.4 now. I see good stability in 5.[34] and the incorporation of a couple of tools worth having in a distribution for 'Wulf use. I'd not recommend sticking with the old version, but of course, once you're established, not carelessly upgrading, either. gerry Joe Landman wrote: > My apologies if this is bad form, I know Toon from his past > participation on this list, and he asked me to forward. > > -------- Original Message -------- > > Dear all, > > I've been working on hpux-itanium for the last 2 years (and even > unsubscribed to beowulf-ml during most of that time, my bad) but soon > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570, > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few > questions on the config. > > 1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1 > instead of going to the latest make a difference. > 2) What are the advantages of the hpc version of rhel. I browsed the doc > but unless having to compile mpi myself I do not see a difference or did > I miss soth. > 3) which filesystem is advisable knowing that we're calculating on large > berkeley db databases > > thanks in advance, > > toon > > Toon Knapen toon.knapen at gmail.com > > ----------------------------------- > From lindahl at pbm.com Tue Dec 1 11:57:38 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 1 Dec 2009 11:57:38 -0800 Subject: [Beowulf] MPI Processes + Auto Vectorization In-Reply-To: <428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com> References: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com> <20091130225024.GA28311@nlxdcldnl2.cl.intel.com> <428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com> Message-ID: <20091201195738.GA24566@bx9.net> On Tue, Dec 01, 2009 at 01:14:13AM -0500, amjad ali wrote: > My question is that if we do not have free cpu cores in a PC or cluster (all > cores are running MPI processes), still the auto-vertorization is > beneficial? Or it is beneficial only if we have some free cpu cores locally? Perhaps you're confusing auto-parallelization and auto-vectorization? Auto-vectorization does not use any more cpu cores than unvectorized code. -- greg From tom.elken at qlogic.com Tue Dec 1 11:57:29 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Tue, 1 Dec 2009 11:57:29 -0800 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <4B157249.80701@tamu.edu> References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> Message-ID: <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> > On Behalf Of Gerald Creager > I've been quite happy with CentOS 5.3 and we're experimenting with > CentOS 5.4 now. I see good stability in 5.[34] I have to second the recommendation of 5.3 or 5.4. Some time ago, we saw significant performance improvements on Nehalem (Xeon X5570) in moving from RHEL 5.2 to 5.3. So I expect that moving from 5.1 to 5.[34] would also be a significant improvement in performance. Cheers, -Tom > and the incorporation > of > a couple of tools worth having in a distribution for 'Wulf use. I'd > not > recommend sticking with the old version, but of course, once you're > established, not carelessly upgrading, either. > > gerry > > Joe Landman wrote: > > My apologies if this is bad form, I know Toon from his past > > participation on this list, and he asked me to forward. > > > > -------- Original Message -------- > > > > Dear all, > > > > I've been working on hpux-itanium for the last 2 years (and even > > unsubscribed to beowulf-ml during most of that time, my bad) but soon > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570, > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few > > questions on the config. > > > > 1) our company is standardised on RHEL 5.1. Would sticking with rhel > 5.1 > > instead of going to the latest make a difference. > > 2) What are the advantages of the hpc version of rhel. I browsed the > doc > > but unless having to compile mpi myself I do not see a difference or > did > > I miss soth. > > 3) which filesystem is advisable knowing that we're calculating on > large > > berkeley db databases > > > > thanks in advance, > > > > toon > > > > Toon Knapen toon.knapen at gmail.com > > > > ----------------------------------- > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From gerry.creager at tamu.edu Tue Dec 1 13:04:32 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Tue, 01 Dec 2009 15:04:32 -0600 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> Message-ID: <4B1584E0.8060704@tamu.edu> A combination of mostly kernel improvements, and some useful middleware as RedHat and by extension, CentOS, seek to get farther into the cluster space. gerry Toon Knapen wrote: > Any idea why it gives better performance? Was it on memory bw intensive > apps? Could it be due to changes in the kernel that take into account > the Numa architecture (affinity) or ... > > On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken > wrote: > > > On Behalf Of Gerald Creager > > I've been quite happy with CentOS 5.3 and we're experimenting with > > CentOS 5.4 now. I see good stability in 5.[34] > > I have to second the recommendation of 5.3 or 5.4. > > Some time ago, we saw significant performance improvements on > Nehalem (Xeon X5570) in moving from RHEL 5.2 to 5.3. So I expect > that moving from 5.1 to 5.[34] would also be a significant > improvement in performance. > > Cheers, > -Tom > > > and the incorporation > > of > > a couple of tools worth having in a distribution for 'Wulf use. I'd > > not > > recommend sticking with the old version, but of course, once you're > > established, not carelessly upgrading, either. > > > > gerry > > > > Joe Landman wrote: > > > My apologies if this is bad form, I know Toon from his past > > > participation on this list, and he asked me to forward. > > > > > > -------- Original Message -------- > > > > > > Dear all, > > > > > > I've been working on hpux-itanium for the last 2 years (and even > > > unsubscribed to beowulf-ml during most of that time, my bad) > but soon > > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570, > > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have > a few > > > questions on the config. > > > > > > 1) our company is standardised on RHEL 5.1. Would sticking with > rhel > > 5.1 > > > instead of going to the latest make a difference. > > > 2) What are the advantages of the hpc version of rhel. I > browsed the > > doc > > > but unless having to compile mpi myself I do not see a > difference or > > did > > > I miss soth. > > > 3) which filesystem is advisable knowing that we're calculating on > > large > > > berkeley db databases > > > > > > thanks in advance, > > > > > > toon > > > > > > Toon Knapen toon.knapen at gmail.com > > > > > > ----------------------------------- > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin > > Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From amacater at galactic.demon.co.uk Tue Dec 1 13:42:18 2009 From: amacater at galactic.demon.co.uk (Andrew M.A. Cater) Date: Tue, 1 Dec 2009 21:42:18 +0000 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <4B1584E0.8060704@tamu.edu> References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> <4B1584E0.8060704@tamu.edu> Message-ID: <20091201214217.GA19804@galactic.demon.co.uk> On Tue, Dec 01, 2009 at 03:04:32PM -0600, Gerald Creager wrote: > A combination of mostly kernel improvements, and some useful middleware > as RedHat and by extension, CentOS, seek to get farther into the cluster > space. > gerry > Maybe also some licensing breaks on large volume licensing. Red Hat is primarily a sales and service organisation that also produces a Linux by-product :) The HPC variant is targetted at areas which deal in large clusters at cheaper than Red Hat Enterprise Linux for servers at equivalent volume, IIRC. > Toon Knapen wrote: >> Any idea why it gives better performance? Was it on memory bw intensive >> apps? Could it be due to changes in the kernel that take into account >> the Numa architecture (affinity) or ... >> >> On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken > > wrote: >> >> > On Behalf Of Gerald Creager >> > I've been quite happy with CentOS 5.3 and we're experimenting with >> > CentOS 5.4 now. I see good stability in 5.[34] >> >> I have to second the recommendation of 5.3 or 5.4. I like 5.2 and 5.4 but my normal experience is on relatively stock IBM hardware. As ever, YMMV. >> > > Dear all, >> > > >> > > I've been working on hpux-itanium for the last 2 years (and even >> > > unsubscribed to beowulf-ml during most of that time, my bad) >> but soon >> > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570, >> > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have >> a few >> > > questions on the config. >> > > >> > > 1) our company is standardised on RHEL 5.1. Would sticking with >> rhel >> > 5.1 >> > > instead of going to the latest make a difference. Since you have up to date hardware - also check on the necessary version of 3Ware drivers and where they are supported. The command line utilities are particularly useful. >> > > 2) What are the advantages of the hpc version of rhel. I >> browsed the >> > doc >> > > but unless having to compile mpi myself I do not see a >> difference or >> > did >> > > I miss soth. See above. >> > > 3) which filesystem is advisable knowing that we're calculating on >> > large >> > > berkeley db databases >> > > You get ext3 or Red Hat's cluster filesystem ?? GFS ??, I think. No xfs / Reiser by default. Check also with HP as to what file systems they would recommend. >> > > thanks in advance, >> > > >> > > toon >> > > >> > > Toon Knapen toon.knapen at gmail.com >> > > Always happy to pontificate :) All the best, Andy From tom.elken at qlogic.com Tue Dec 1 13:51:18 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Tue, 1 Dec 2009 13:51:18 -0800 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> Message-ID: <35AAF1E4A771E142979F27B51793A48887030F12E9@AVEXMB1.qlogic.org> Toon wrote: "Any idea why it gives better performance? Was it on memory bw intensive apps? Could it be due to changes in the kernel that take into account the Numa architecture (affinity)" This is very likely it. Dredging thru old e-mails, I see that, in Jan-09, the application of a then-current 2.6.28 kernel to a RHEL 5.2 system provided about a 2x improvement in 8-thread OpenMP STREAM performance. Subsequently, moving to RHEL 5.3 and its default kernel provided the same good STREAM performance. -Tom From: Toon Knapen [mailto:toon.knapen at gmail.com] Sent: Tuesday, December 01, 2009 12:53 PM To: Tom Elken Cc: gerry.creager at tamu.edu; landman at scalableinformatics.com; beowulf Subject: Re: [Beowulf] Forwarded from a long time reader having trouble posting Any idea why it gives better performance? Was it on memory bw intensive apps? Could it be due to changes in the kernel that take into account the Numa architecture (affinity) or ... On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken > wrote: > On Behalf Of Gerald Creager > I've been quite happy with CentOS 5.3 and we're experimenting with > CentOS 5.4 now. I see good stability in 5.[34] I have to second the recommendation of 5.3 or 5.4. Some time ago, we saw significant performance improvements on Nehalem (Xeon X5570) in moving from RHEL 5.2 to 5.3. So I expect that moving from 5.1 to 5.[34] would also be a significant improvement in performance. Cheers, -Tom > and the incorporation > of > a couple of tools worth having in a distribution for 'Wulf use. I'd > not > recommend sticking with the old version, but of course, once you're > established, not carelessly upgrading, either. > > gerry > > Joe Landman wrote: > > My apologies if this is bad form, I know Toon from his past > > participation on this list, and he asked me to forward. > > > > -------- Original Message -------- > > > > Dear all, > > > > I've been working on hpux-itanium for the last 2 years (and even > > unsubscribed to beowulf-ml during most of that time, my bad) but soon > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570, > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few > > questions on the config. > > > > 1) our company is standardised on RHEL 5.1. Would sticking with rhel > 5.1 > > instead of going to the latest make a difference. > > 2) What are the advantages of the hpc version of rhel. I browsed the > doc > > but unless having to compile mpi myself I do not see a difference or > did > > I miss soth. > > 3) which filesystem is advisable knowing that we're calculating on > large > > berkeley db databases > > > > thanks in advance, > > > > toon > > > > Toon Knapen toon.knapen at gmail.com > > > > ----------------------------------- > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From gerry.creager at tamu.edu Tue Dec 1 14:05:36 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Tue, 01 Dec 2009 16:05:36 -0600 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <20091201214217.GA19804@galactic.demon.co.uk> References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> <4B1584E0.8060704@tamu.edu> <20091201214217.GA19804@galactic.demon.co.uk> Message-ID: <4B159330.5070802@tamu.edu> Andrew M.A. Cater wrote: > On Tue, Dec 01, 2009 at 03:04:32PM -0600, Gerald Creager wrote: >> A combination of mostly kernel improvements, and some useful middleware >> as RedHat and by extension, CentOS, seek to get farther into the cluster >> space. >> gerry >> > > Maybe also some licensing breaks on large volume licensing. Red Hat is > primarily a sales and service organisation that also produces a Linux > by-product :) The HPC variant is targetted at areas which deal in large > clusters at cheaper than Red Hat Enterprise Linux for servers at > equivalent volume, IIRC. > >> Toon Knapen wrote: >>> Any idea why it gives better performance? Was it on memory bw intensive >>> apps? Could it be due to changes in the kernel that take into account >>> the Numa architecture (affinity) or ... >>> >>> On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken >> > wrote: >>> >>> > On Behalf Of Gerald Creager >>> > I've been quite happy with CentOS 5.3 and we're experimenting with >>> > CentOS 5.4 now. I see good stability in 5.[34] >>> >>> I have to second the recommendation of 5.3 or 5.4. > > I like 5.2 and 5.4 but my normal experience is on relatively stock IBM > hardware. As ever, YMMV. > >>> > > Dear all, >>> > > >>> > > I've been working on hpux-itanium for the last 2 years (and even >>> > > unsubscribed to beowulf-ml during most of that time, my bad) >>> but soon >>> > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570, >>> > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have >>> a few >>> > > questions on the config. >>> > > >>> > > 1) our company is standardised on RHEL 5.1. Would sticking with >>> rhel >>> > 5.1 >>> > > instead of going to the latest make a difference. > > Since you have up to date hardware - also check on the necessary version > of 3Ware drivers and where they are supported. The command line > utilities are particularly useful. > >>> > > 2) What are the advantages of the hpc version of rhel. I >>> browsed the >>> > doc >>> > > but unless having to compile mpi myself I do not see a >>> difference or >>> > did >>> > > I miss soth. > > See above. > >>> > > 3) which filesystem is advisable knowing that we're calculating on >>> > large >>> > > berkeley db databases >>> > > > > You get ext3 or Red Hat's cluster filesystem ?? GFS ??, I think. No xfs > / Reiser by default. Check also with HP as to what file systems they > would recommend. > > >>> > > thanks in advance, >>> > > >>> > > toon >>> > > >>> > > Toon Knapen toon.knapen at gmail.com >>> > > > > Always happy to pontificate :) I believe xfs is now available in 5.4. I'd have to check. We've found xfs to be our preference (but we're revisiting gluster and lustre). I've not played with gfs so far. gerry From lindahl at pbm.com Tue Dec 1 14:25:07 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 1 Dec 2009 14:25:07 -0800 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <4B155A43.3010304@scalableinformatics.com> References: <4B155A43.3010304@scalableinformatics.com> Message-ID: <20091201222507.GB17474@bx9.net> > 3) which filesystem is advisable knowing that we're calculating on large > berkeley db databases I've had friends tell me that I should never use long-lived berkeley db databases without a good backup-and-recovery or recreate-from-scratch plan. Berkeley db comes with a test suite for integrity, and last time I used it under Linux, it didn't pass. -- g From bill at cse.ucdavis.edu Tue Dec 1 14:50:53 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 01 Dec 2009 14:50:53 -0800 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <4B155A43.3010304@scalableinformatics.com> References: <4B155A43.3010304@scalableinformatics.com> Message-ID: <4B159DCD.6020102@cse.ucdavis.edu> Joe Landman wrote: > My apologies if this is bad form, I know Toon from his past > participation on this list, and he asked me to forward. > > -------- Original Message -------- Hi Toon, long time no type. > Dear all, > I've been working on hpux-itanium for the last 2 years (and even > unsubscribed to beowulf-ml during most of that time, my bad) but soon > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570, > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few > questions on the config. > 1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1 > instead of going to the latest make a difference. That's kind of strange. So you never patch? A patched RHEL 5.1 box auto upgrades to 5.4, doesn't it? Or is that something specific to CentOS? 5.1 is rather old I'd worry about poor support for hyperthreading, and things like GigE drivers when using a release from 2007, especially with hardware released in 2009. I believe the older kernels handle the extra cores rather poorly, and don't even recognize the intel CPUs as NUMA enabled. You didn't mention hardware or software RAID. I'd recommend RAID scrubbing, and if software that requires (I think) >= 2.6.21, although (I think) Redhat back ported it into their newest kernels in 5.4, or maybe 5.3. Definitely not in 5.1 though. > 2) What are the advantages of the hpc version of rhel. I browsed the doc > but unless having to compile mpi myself I do not see a difference or did > I miss soth. I've never seen the HPC version of RHEL, but I have build a cluster distribution based on RHEL a few times. It's pretty common to need to tweak the various cluster related pieces, like say tight integration which often requires tweaks to the MPI layer and the batch queue. I suspect the biggest advantage for the HPC version of RHEL is a cheaper per seat license. If you end up layering things on top of RHEL yourself I recommend cobbler, puppet, ganglia, and openmpi. > 3) which filesystem is advisable knowing that we're calculating on large > berkeley db databases I've not seen a particularly big difference on random workloads typical of databases. Are the databases bigger than ram? Does your 3ware have a battery? Allowing the raid controller to acknowledge writes before they hit the disk might be a big win (if your DB has lots of writes)? Can you afford a SSD to hold the berkeley DB? From csamuel at vpac.org Tue Dec 1 15:42:39 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed, 2 Dec 2009 10:42:39 +1100 (EST) Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <566807952.7177751259710895916.JavaMail.root@mail.vpac.org> Message-ID: <45900213.7177791259710959407.JavaMail.root@mail.vpac.org> ----- "Gerald Creager" wrote: > I believe xfs is now available in 5.4. I'd have to check. My meagre understanding based totally on rumours is that it's still a preview release and that you need a special support contract with Red Hat to get access. I'd love to know that I'm wrong there though! cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From artpoon at gmail.com Tue Dec 1 12:45:52 2009 From: artpoon at gmail.com (Art Poon) Date: Tue, 1 Dec 2009 12:45:52 -0800 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK Message-ID: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> Dear colleagues, I am in charge of managing a cluster at our research centre and am stuck with a vexing (to me) problem! (Disclaimer: I am a biologist by training and a mostly self-taught programmer. I am still learning about networking and cluster management, so please bear with me!) This is an asymmetric Intel Xeon cluster running 4 compute nodes on CentOS 5.4 and Scyld Clusterware 5. We managed to get it up and running using a dinky little NetGear 5-port 10/100/1000 switch. Now that I'm looking to expand the cluster, I need to get the managed switch working (an SMC 8824M, though we have several other switches available). What's got me and the IT guys stumped is that while the compute nodes boot via PXE from the head node without trouble on the NetGear, they barf with the SMC. To be specific, after the initial boot with a minimal Linux kernel, there is a "fatal error" with "timeout waiting for getfile" when the compute node attempts to download the provisioning image from head. However, when they were running Rocks before I arrived, the cluster worked fine with the SMC switch. I've tried resetting the SMC switch to factory defaults (with auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic. We've tried swapping out to another SMC switch but that didn't change anything. I'm grateful if you could weigh in with your expertise. Thank you, - Art. From toon.knapen at gmail.com Tue Dec 1 12:53:19 2009 From: toon.knapen at gmail.com (Toon Knapen) Date: Tue, 1 Dec 2009 21:53:19 +0100 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> Message-ID: Any idea why it gives better performance? Was it on memory bw intensive apps? Could it be due to changes in the kernel that take into account the Numa architecture (affinity) or ... On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken wrote: > > On Behalf Of Gerald Creager > > I've been quite happy with CentOS 5.3 and we're experimenting with > > CentOS 5.4 now. I see good stability in 5.[34] > > I have to second the recommendation of 5.3 or 5.4. > > Some time ago, we saw significant performance improvements on Nehalem (Xeon > X5570) in moving from RHEL 5.2 to 5.3. So I expect that moving from 5.1 to > 5.[34] would also be a significant improvement in performance. > > Cheers, > -Tom > > > and the incorporation > > of > > a couple of tools worth having in a distribution for 'Wulf use. I'd > > not > > recommend sticking with the old version, but of course, once you're > > established, not carelessly upgrading, either. > > > > gerry > > > > Joe Landman wrote: > > > My apologies if this is bad form, I know Toon from his past > > > participation on this list, and he asked me to forward. > > > > > > -------- Original Message -------- > > > > > > Dear all, > > > > > > I've been working on hpux-itanium for the last 2 years (and even > > > unsubscribed to beowulf-ml during most of that time, my bad) but soon > > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570, > > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few > > > questions on the config. > > > > > > 1) our company is standardised on RHEL 5.1. Would sticking with rhel > > 5.1 > > > instead of going to the latest make a difference. > > > 2) What are the advantages of the hpc version of rhel. I browsed the > > doc > > > but unless having to compile mpi myself I do not see a difference or > > did > > > I miss soth. > > > 3) which filesystem is advisable knowing that we're calculating on > > large > > > berkeley db databases > > > > > > thanks in advance, > > > > > > toon > > > > > > Toon Knapen toon.knapen at gmail.com > > > > > > ----------------------------------- > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > > Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From toon.knapen at gmail.com Wed Dec 2 03:47:19 2009 From: toon.knapen at gmail.com (Toon Knapen) Date: Wed, 2 Dec 2009 12:47:19 +0100 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <20091201214217.GA19804@galactic.demon.co.uk> References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> <4B1584E0.8060704@tamu.edu> <20091201214217.GA19804@galactic.demon.co.uk> Message-ID: > > > Maybe also some licensing breaks on large volume licensing. Red Hat is > primarily a sales and service organisation that also produces a Linux > by-product :) The HPC variant is targetted at areas which deal in large > clusters at cheaper than Red Hat Enterprise Linux for servers at > equivalent volume, IIRC. > AFAICT the HPC version is more expensive but you get extra tools for that such as benchmarks, pre-compiled mpi, batch-scheduler from Platform. But MPI is easy to intall and I would prefer other batch-schedulers instead of the one of Platform so ... I wonder if the kernel/distribution itself is more optimised ? > Since you have up to date hardware - also check on the necessary version > of 3Ware drivers and where they are supported. The command line > utilities are particularly useful. > thanks for the tip. > > > You get ext3 or Red Hat's cluster filesystem ?? GFS ??, I think. No xfs > / Reiser by default. Check also with HP as to what file systems they > would recommend. > No, we'll be using local file systems primarily and also a connection to a SAN but no global filesystem. -------------- next part -------------- An HTML attachment was scrubbed... URL: From toon.knapen at gmail.com Wed Dec 2 03:49:27 2009 From: toon.knapen at gmail.com (Toon Knapen) Date: Wed, 2 Dec 2009 12:49:27 +0100 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <4B159330.5070802@tamu.edu> References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> <4B1584E0.8060704@tamu.edu> <20091201214217.GA19804@galactic.demon.co.uk> <4B159330.5070802@tamu.edu> Message-ID: > > > I believe xfs is now available in 5.4. I'd have to check. We've found xfs > to be our preference (but we're revisiting gluster and lustre). I've not > played with gfs so far. > And why do you prefer xfs if I may ask. Performance? Do you many small files or large files? -------------- next part -------------- An HTML attachment was scrubbed... URL: From toon.knapen at gmail.com Wed Dec 2 03:53:28 2009 From: toon.knapen at gmail.com (Toon Knapen) Date: Wed, 2 Dec 2009 12:53:28 +0100 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <20091201222507.GB17474@bx9.net> References: <4B155A43.3010304@scalableinformatics.com> <20091201222507.GB17474@bx9.net> Message-ID: > > I've had friends tell me that I should never use long-lived berkeley > db databases without a good backup-and-recovery or recreate-from-scratch > plan. > > Berkeley db comes with a test suite for integrity, and last time I > used it under Linux, it didn't pass. > You mean that subsequent (minor) versions of bdb are not necessarily totally compatible ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From toon.knapen at gmail.com Wed Dec 2 03:59:55 2009 From: toon.knapen at gmail.com (Toon Knapen) Date: Wed, 2 Dec 2009 12:59:55 +0100 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <4B159DCD.6020102@cse.ucdavis.edu> References: <4B155A43.3010304@scalableinformatics.com> <4B159DCD.6020102@cse.ucdavis.edu> Message-ID: > > I believe the older kernels handle the extra cores rather poorly, and don't > even recognize the intel CPUs as NUMA enabled. You didn't mention hardware > or > software RAID. I'd recommend RAID scrubbing, and if software that requires > (I > think) >= 2.6.21, although (I think) Redhat back ported it into their > newest > kernels in 5.4, or maybe 5.3. Definitely not in 5.1 though. > For performance reasons we added the 3ware card to handle the raid. > I've not seen a particularly big difference on random workloads typical of > databases. Are the databases bigger than ram? Does your 3ware have a > battery? Allowing the raid controller to acknowledge writes before they > hit > the disk might be a big win (if your DB has lots of writes)? Can you > afford a > SSD to hold the berkeley DB? > The card has 512 MB of memory. I suppose it will cache the writes there. But the bdb's can be up to 70 GB large so we'll never be able to pull them in memory. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlb17 at duke.edu Wed Dec 2 10:14:32 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 2 Dec 2009 13:14:32 -0500 (EST) Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <45900213.7177791259710959407.JavaMail.root@mail.vpac.org> References: <45900213.7177791259710959407.JavaMail.root@mail.vpac.org> Message-ID: On Wed, 2 Dec 2009 at 10:42am, Chris Samuel wrote >> I believe xfs is now available in 5.4. I'd have to check. > > My meagre understanding based totally on rumours is > that it's still a preview release and that you need > a special support contract with Red Hat to get access. > > I'd love to know that I'm wrong there though! In CentOS, at least, the xfs module comes with the regular kernel (so I'm guessing it's the same with stock RHEL). What Red Hat is *not* shipping by default are any of the filesystem utilities, so you can't, e.g., actually mkfs an XFS filesystem. But you can get the xfsprogs RPM from the CentOS extras repo and that should work just fine. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From eugen at leitl.org Wed Dec 2 10:22:43 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 2 Dec 2009 19:22:43 +0100 Subject: [Beowulf] Intel shows 48-core 'datacentre on a chip' Message-ID: <20091202182243.GG17686@leitl.org> http://news.zdnet.co.uk/hardware/0,1000000091,39918721,00.htm Intel shows 48-core 'datacentre on a chip' * Tags: * Manycore, * Cloud, * Operating System, * Processor Rupert Goodwins ZDNet UK Published: 02 Dec 2009 18:00 GMT Intel has announced the Single-chip Cloud Computer (SCC), an experimental 48-core processor designed to encourage research and development in massively parallel computation. Measuring 567 square millimetres ? about the size of a postage stamp ? the SCC combines 24 dual-core processing elements, each with its own router, four DDR3 memory controllers capable of handling up to 8GB apiece, and a very fast on-chip network. Although no performance, speed or total bandwidth figures were revealed, the chip has 1.3 billion transistors, consumes up to 125 watts in operation, and will become available to researchers around the world on a standard-sized motherboard in the first half of 2010. Intel said it expects to sign up dozens of partners within six months, with more to come over time. "This is the prototype of the microprocessor of the future", Joseph Sch?tz, director of microprocessor and programming research at Intel Labs, said on Tuesday. The announcement took place at the company's R&D centre in Braunschweig, Germany, at the first of three SCC launch events around the world. "Before, if you needed to design software for this level of computing, you needed your own datacentre. Now, you just need your own chip," said Sch?tz. The SCC, previously code-named Rock Creek, is fabricated in 45nm technology. The on-chip network is configured as a 6x4 node, two-dimensional mesh. It has a bandwidth of 256GBps, and each core can run its own independent software as a fully functional IA-32 processor. Memory, 384KB of it, is shared between all cores, primarily to speed message passing, while power management can independently control eight variable-voltage and 28 variable-frequency areas of the chip. This controls power consumption, setting it between 25 and 125 watts. Intel SCC Intel's Single-chip Cloud Computer: 24 dual-core tiles on a 567mm2 die "We called it the Single-chip Cloud Computer, but it was very difficult to know how to name it", said Sch?tz. "You can easily envision many more cores, just as you could add more servers to a real datacentre. We could build relatively small systems with hundreds or thousands of cores." Unlike other manycore systems, the SCC does not maintain data integrity through cache coherency, where special circuitry keeps multiple caches in sync across cores. "Cache coherency consumes a lot of power but isn't that useful," said Sch?tz. "You're tempted to do it because you're on one die, but as soon as you go off-die, it doesn't work. This is a slice of the future in silicon, and we're giving to people to play with." Microsoft, a partner in the SCC project, has built support into its developer toolchain, according to Sch?tz. Professor Timothy Roscoe of ETH Zurich, who is developing the experimental Barrelfish operating system also in conjunction with Microsoft, told ZDNet UK that the SCC was a particularly good fit for his work. "Our multikernel architecture has many independent processors with different attributes, but sharing the same state", he said, "and the SCC looks a dream platform for testing many of our ideas". Intel said it will publish full details of the SCC at the International Solid-State Circuits Conference in San Francisco on 8 February, 2010. From john.hearns at mclaren.com Wed Dec 2 10:26:28 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 2 Dec 2009 18:26:28 -0000 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> I'm a new member to this list, but the research group that I work for has had a working cluster for many years. I am now looking at upgrading our current configuration. Go for a new cluster any time. Upgrading is fraught with pitfalls ? keep that old cluster running till it dies, but look at a fresh new machine. Nehalem is around now, and you can get a lot of power in a single node, not to mention GPUs. Mixing modern multi-core hardware with an older OS release which worked with those old disk drivers and Ethernet drivers will be a nightmare. I was wondering if anyone has actual experience with running more than one node from a single power supply. Even just two boards on one PSU would be nice. We will be using barely 200W per node for 50 nodes and it just seems like a big waste to buy 50 power supply units. I have read the old posts but did not see any reports of success. Look at the Supermicro twin systems, they have two motherboards in 1U or four motherboards in 2U. I believe HP have similar. Or of course any of the blade chassis ? Supermicro, HP, Sun and dare I say it SGI. On a smaller scale you could look at the ?personal supercomputers? from Cray and SGI. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Wed Dec 2 10:30:02 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 02 Dec 2009 13:30:02 -0500 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> Message-ID: <4B16B22A.8070603@scalableinformatics.com> Art Poon wrote: > Dear colleagues, [...] > What's got me and the IT guys stumped is that while the compute nodes > boot via PXE from the head node without trouble on the NetGear, they > barf with the SMC. To be specific, after the initial boot with a > minimal Linux kernel, there is a "fatal error" with "timeout waiting > for getfile" when the compute node attempts to download the > provisioning image from head. However, when they were running Rocks > before I arrived, the cluster worked fine with the SMC switch. Is it the switch of the dhcp/bootp/tftp setup thats the problem? Are you sure the tftp daemon is up, or bootp is configured correctly? Switches sometimes have broadcast storm suppression turned on, or worse, sometimes they have spanning tree turned on. You want the switch to be as dumb as you can possibly make it for most linux clusters. Fast, but dumb. > I've tried resetting the SMC switch to factory defaults (with > auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and > it doesn't seem to be demanding anything exotic. We've tried > swapping out to another SMC switch but that didn't change anything. This sounds more on the server software stack than the switch. Could you describe this? Are you using Scyld/Rocks for that? Rocks is quite sensitive to configuration issues, and really doesn't like altered configurations (it is possible to do, though non-trivial). -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From john.hearns at mclaren.com Wed Dec 2 10:36:47 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 2 Dec 2009 18:36:47 -0000 Subject: [Beowulf] Re: cluster fails to boot with managed switch,but 5-port switch works OK In-Reply-To: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0E75A725@milexchmb1.mil.tagmclarengroup.com> I've tried resetting the SMC switch to factory defaults (with auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic. We've tried swapping out to another SMC switch but that didn't change anything. No idea really, as I don't use SMC switches. First thing I would do would be to get a laptop with Linux on, and attach it to the SMC switch. Configure the Ethernet interface to use DHCP, then ifconfig it down then ifcoinfig it up. Run tcpdump eth0 as you do this, and tail -f /var/log/messages If it gets a DHCP address (and of course on the cluster head node there is a pool of free DHCP addresses configured) the test tftp file transfer. Set the tftp daemon on the head node to run in debug mode, and start it up. Try a tftp get of a test file on the laptop. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From james.p.lux at jpl.nasa.gov Wed Dec 2 10:52:25 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Wed, 2 Dec 2009 10:52:25 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> Message-ID: From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Ross Tucker Sent: Wednesday, November 25, 2009 1:54 PM To: beowulf at beowulf.org Subject: [Beowulf] New member, upgrading our existing Beowulf cluster Greetings! I'm a new member to this list, but the research group that I work for has had a working cluster for many years. I am now looking at upgrading our current configuration. I was wondering if anyone has actual experience with running more than one node from a single power supply. Even just two boards on one PSU would be nice. We will be using barely 200W per node for 50 nodes and it just seems like a big waste to buy 50 power supply units. I have read the old posts but did not see any reports of success. Best regards, Ross Tucker ------ Unless your time is free, it's probably not cost effective. You'd have to come up with the following things: Packaging that accommodates two boards. Cabling from the PSU to the board that splits the power to two destinations Somehow managing the power on/standby/off controls coming from the mobo to the PSU Making sure that any voltage sequencing requirements are met. Making sure that with your new cabling, you meet the voltage regulation requirements. There's also the issue of EMI/EMC compliance, if you are concerned about such things. Given the low cost of power supplies, particularly in large quantities, and the ability to use commodity (low price) packaging in the traditional 1 PSU per mobo configuration, you'd have to have a really good reason to consider this. Jim Lux From dag at sonsorol.org Wed Dec 2 11:12:13 2009 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed, 02 Dec 2009 14:12:13 -0500 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> Message-ID: <4B16BC0D.7040003@sonsorol.org> Not sure if you are looking at DIY or commercial options but this has been done well on a commercial scale by at least some integrators. I've never used them in a cluster but they make great virtualization platforms. This is just one example, the marketing term is "1U Twin Server" http://www.siliconmechanics.com/c1159/1u-twin-servers.php Other vendors have various other options and the range of "dedicated" vs "shared" resources among the twin and quad server configurations can vary quite a bit. There is a good chance someone is already making a box with the specs you are interested in. Regards, Chris Lux, Jim (337C) wrote: > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Ross Tucker > Sent: Wednesday, November 25, 2009 1:54 PM > To: beowulf at beowulf.org > Subject: [Beowulf] New member, upgrading our existing Beowulf cluster > > Greetings! > > I'm a new member to this list, but the research group that I work for has had a working cluster for many years. I am now looking at upgrading our current configuration. I was wondering if anyone has actual experience with running more than one node from a single power supply. Even just two boards on one PSU would be nice. We will be using barely 200W per node for 50 nodes and it just seems like a big waste to buy 50 power supply units. I have read the old posts but did not see any reports of success. > > Best regards, > Ross Tucker From mathog at caltech.edu Wed Dec 2 11:40:09 2009 From: mathog at caltech.edu (David Mathog) Date: Wed, 02 Dec 2009 11:40:09 -0800 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK Message-ID: > What's got me and the IT guys stumped is that while the compute nodes boot via PXE from the head node without trouble on the NetGear, they barf with the SMC. To be specific, after the initial boot with a minimal Linux kernel, there is a "fatal error" with "timeout waiting for getfile" when the compute node attempts to download the provisioning image from head. However, when they were running Rocks before I arrived, the cluster worked fine with the SMC switch. Use tcpdump or some equivalent. Run it once with the dumb switch, once with the managed one, and then compare and contrast. > I've tried resetting the SMC switch to factory defaults (with auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic. We've tried swapping out to another SMC switch but that didn't change anything. Detach from the world at large then turn off the firewall on the master. (Probably not it this time, but whenever there are network problems always rule out the firewall before spending time on anything else.) Ipv6 vs. Ipv4? By which I mean, once the kernel boots, perhaps it goes to ipv6, which the netgear handles properly, but maybe that is turned off on the SMC? Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mwill at penguincomputing.com Wed Dec 2 11:48:17 2009 From: mwill at penguincomputing.com (Michael Will) Date: Wed, 02 Dec 2009 11:48:17 -0800 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but5-port switch works OK Message-ID: <00cc01ca7388$68bf5259$3504650a@penguincomputing.com> I don't know anything about smc switches, but for cisco switches I had to enable 'spanning-tree portfast default' before to allow a pxe booting node to stay up. Maybe the smc switch has something similar that prevents the port from being fully useable until some spanning tree algorithm terminates? Cheers, Michael ----- Original Message ----- From:"Hearns, John" To:"beowulf at beowulf.org" Sent:12/2/2009 10:38 AM Subject:RE: [Beowulf] Re: cluster fails to boot with managed switch,but5-port switch works OK I've tried resetting the SMC switch to factory defaults (with auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic. We've tried swapping out to another SMC switch but that didn't change anything. No idea really, as I don't use SMC switches. First thing I would do would be to get a laptop with Linux on, and attach it to the SMC switch. Configure the Ethernet interface to use DHCP, then ifconfig it down then ifcoinfig it up. Run tcpdump eth0 as you do this, and tail -f /var/log/messages If it gets a DHCP address (and of course on the cluster head node there is a pool of free DHCP addresses configured) the test tftp file transfer. Set the tftp daemon on the head node to run in debug mode, and start it up. Try a tftp get of a test file on the laptop. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From landman at scalableinformatics.com Wed Dec 2 11:58:27 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 02 Dec 2009 14:58:27 -0500 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: References: Message-ID: <4B16C6E3.3060808@scalableinformatics.com> David Mathog wrote: >> What's got me and the IT guys stumped is that while the compute nodes > boot via PXE from the head node without trouble on the NetGear, they > barf with the SMC. To be specific, after the initial boot with a > minimal Linux kernel, there is a "fatal error" with "timeout waiting for > getfile" when the compute node attempts to download the provisioning > image from head. However, when they were running Rocks before I > arrived, the cluster worked fine with the SMC switch. Wondering aloud whether or not the ethernet driver has been correctly included in the kernel/initrd for the PXE booted image. I've seen/experienced this before, PXE works fine, the kernel boots, and is missing the ethernet driver. Usually happens with newer hardware and older kernels. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From james.p.lux at jpl.nasa.gov Wed Dec 2 12:02:03 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Wed, 2 Dec 2009 12:02:03 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <4B16BC0D.7040003@sonsorol.org> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <4B16BC0D.7040003@sonsorol.org> Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Chris Dagdigian > Sent: Wednesday, December 02, 2009 11:12 AM > Cc: beowulf at beowulf.org; Ross Tucker > Subject: Re: [Beowulf] New member, upgrading our existing Beowulf cluster > > > Not sure if you are looking at DIY or commercial options but this has > been done well on a commercial scale by at least some integrators. > > I've never used them in a cluster but they make great virtualization > platforms. > > This is just one example, the marketing term is "1U Twin Server" > > http://www.siliconmechanics.com/c1159/1u-twin-servers.php > > Other vendors have various other options and the range of "dedicated" vs > "shared" resources among the twin and quad server configurations can > vary quite a bit. There is a good chance someone is already making a box > with the specs you are interested in. > I note that those chassis seem to have a PSU specifically designed for driving two mobos, and are rated fairly high (980W for the first one on the page referenced). I took Ross's original request to be one of using existing power supplies and adding a mobo (essentially the DIY option). From cap at nsc.liu.se Wed Dec 2 12:28:07 2009 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Wed, 2 Dec 2009 21:28:07 +0100 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> Message-ID: <200912022128.07314.cap@nsc.liu.se> On Wednesday 02 December 2009, Hearns, John wrote: > I'm a new member to this list, but the research group that I work for has > had a working cluster for many years. I am now looking at upgrading our > current configuration. ... > Mixing modern multi-core hardware with an older OS release which worked > with those old disk drivers and Ethernet drivers will be a nightmare. But why run an older OS release? Something like CentOS-5.latest will run fine on your new hardware and it's no problem getting all sorts of old HPC code to run on it (disclaimer: of course you can find a zillion apps that break on any given OS...). > I was wondering if anyone has actual experience with running more than one > node from a single power supply. ... > Look at the Supermicro twin systems, they have two motherboards in 1U or > four motherboards in 2U. > > I believe HP have similar. They have 4-nodes in 2U (it has the added benefint of using large 8cm fans instead of those inefficient 1U fans...). Supermicro also has a 4-nodes in 2U. > Or of course any of the blade chassis ? Supermicro, HP, Sun and dare I say > it SGI. We've typically found that blade chassi type hardware is far from cost effective for HPC, but YMMV. > On a smaller scale you could look at the ?personal supercomputers? from > Cray and SGI. Even less cost effective (I think). > The contents of this email are confidential and for the exclusive use of > the intended recipient... Good job sending it to a public e-mail list then. > If you receive this email in error you should not > copy it, retransmit it, use it or disclose its contents but should return > it to the sender immediately and delete your copy. /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From bill at cse.ucdavis.edu Wed Dec 2 12:36:17 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 02 Dec 2009 12:36:17 -0800 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> Message-ID: <4B16CFC1.8060603@cse.ucdavis.edu> Art Poon wrote: > I've tried resetting the SMC switch to factory defaults (with > auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it > doesn't seem to be demanding anything exotic. We've tried swapping out to > another SMC switch but that didn't change anything. I had a very unpleasant experience with an SMC switch awhile back. I was having problems trying to bootstrap a rocks cluster. Turns out the SMC (and Dell relabel) was so evil that it warranted a mention in the Rocks FAQ. I believe the solution was to manually turn on edge node routing or similar on each port. Unfortunately there was a bug and you could only turn on the first 16 ports. There was a fix with new firmware, but there were 2 firmware images and you couldn't tell which from looking at the switch. Said firmware upgrade caused other problems. Eventually it worked well enough. I've used quite a variety of switches without problem, I was shocked that a default switch config wouldn't work with DHCP and PXEboot. > > I'm grateful if you could weigh in with your expertise. > > Thank you, > - Art. > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rpnabar at gmail.com Wed Dec 2 13:42:04 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 2 Dec 2009 15:42:04 -0600 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but5-port switch works OK In-Reply-To: <00cc01ca7388$68bf5259$3504650a@penguincomputing.com> References: <00cc01ca7388$68bf5259$3504650a@penguincomputing.com> Message-ID: On Wed, Dec 2, 2009 at 1:48 PM, Michael Will wrote: > I don't know anything about smc switches, but for cisco switches I had to enable 'spanning-tree portfast default' before to allow a >pxe booting node to stay up. Maybe the smc switch has something similar that prevents the port from being fully useable until some >spanning tree algorithm terminates? +1 for the spanning tree suggestion. I needed to do the same on my Dell Catalyst. Check if "DHCP squash" on the port connected to the master node is enabled. This can prevent DHCP. Just a thought. -- Rahul From hearnsj at googlemail.com Wed Dec 2 14:13:34 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed, 2 Dec 2009 22:13:34 +0000 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <200912022128.07314.cap@nsc.liu.se> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <200912022128.07314.cap@nsc.liu.se> Message-ID: <9f8092cc0912021413h3b61b827peed73cf99cf39ec9@mail.gmail.com> 2009/12/2 Peter Kjellstrom : > > Good job sending it to a public e-mail list then. > You know fine well that such disclaimers are inserted by corporate email servers. Keep your sarcasm to yourself. From mwill at penguincomputing.com Wed Dec 2 14:14:54 2009 From: mwill at penguincomputing.com (Michael Will) Date: Wed, 02 Dec 2009 14:14:54 -0800 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but5-port switch works OK Message-ID: <00da01ca739c$e443c877$3504650a@penguincomputing.com> I got another one for you from penguins support team: enable port forward Sent from Moxier Mail (http://www.moxier.com) ----- Original Message ----- From:"Rahul Nabar" To:"Michael Will" Cc:"Hearns, John" , "beowulf at beowulf.org" Sent:12/2/2009 1:42 PM Subject:Re: [Beowulf] Re: cluster fails to boot with managed switch, but5-port switch works OK On Wed, Dec 2, 2009 at 1:48 PM, Michael Will wrote: > I don't know anything about smc switches, but for cisco switches I had to enable 'spanning-tree portfast default' before to allow a >pxe booting node to stay up. Maybe the smc switch has something similar that prevents the port from being fully useable until some >spanning tree algorithm terminates? +1 for the spanning tree suggestion. I needed to do the same on my Dell Catalyst. Check if "DHCP squash" on the port connected to the master node is enabled. This can prevent DHCP. Just a thought. -- Rahul From csamuel at vpac.org Wed Dec 2 14:15:07 2009 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 3 Dec 2009 09:15:07 +1100 (EST) Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: Message-ID: <1122079789.7251391259792107624.JavaMail.root@mail.vpac.org> ----- "Joshua Baker-LePain" wrote: > What Red Hat is *not* shipping by default are any > of the filesystem utilities, so you can't, e.g., > actually mkfs an XFS filesystem. But you can get > the xfsprogs RPM from the CentOS extras repo and > that should work just fine. Ah yes, that was what it was explained to me as! Mea culpa, I blame the Beowulf bash.. ;-) -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From lindahl at pbm.com Wed Dec 2 14:29:35 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 2 Dec 2009 14:29:35 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <9f8092cc0912021413h3b61b827peed73cf99cf39ec9@mail.gmail.com> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <200912022128.07314.cap@nsc.liu.se> <9f8092cc0912021413h3b61b827peed73cf99cf39ec9@mail.gmail.com> Message-ID: <20091202222935.GF12204@bx9.net> On Wed, Dec 02, 2009 at 10:13:34PM +0000, John Hearns wrote: > You know fine well that such disclaimers are inserted by corporate > email servers. Actually, I had no idea, probably a lot of other people don't either. Can't you work for a company that doesn't have disclaimers? ;-) -- greg From csamuel at vpac.org Wed Dec 2 14:36:14 2009 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 3 Dec 2009 09:36:14 +1100 (EST) Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: Message-ID: <1384774161.7252631259793374343.JavaMail.root@mail.vpac.org> ----- "Toon Knapen" wrote: > And why do you prefer xfs if I may ask. Performance? For us, yes, plus the fact that ext3 is (maybe was, but not from what I've heard) single threaded through the journal daemon so if you get a lot of writers (say NFS daemons for instance) you can get horribly backlogged and end up with horrendous load averages. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From james.p.lux at jpl.nasa.gov Wed Dec 2 15:19:08 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Wed, 2 Dec 2009 15:19:08 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <20091202222935.GF12204@bx9.net> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <200912022128.07314.cap@nsc.liu.se> <9f8092cc0912021413h3b61b827peed73cf99cf39ec9@mail.gmail.com> <20091202222935.GF12204@bx9.net> Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Greg Lindahl > Sent: Wednesday, December 02, 2009 2:30 PM > To: beowulf at beowulf.org > Subject: Re: [Beowulf] New member, upgrading our existing Beowulf cluster > > On Wed, Dec 02, 2009 at 10:13:34PM +0000, John Hearns wrote: > > > You know fine well that such disclaimers are inserted by corporate > > email servers. > > Actually, I had no idea, probably a lot of other people don't either. > > Can't you work for a company that doesn't have disclaimers? ;-) > > -- greg Surely you jest... Of course.. but the reality is that a lot of folks work for places that have multiple stakeholders of one sort or another who want *different* disclaimers, notwithstanding that the disclaimer is of dubious legal value. Having had entirely too many discussions and training on this, here's some not so obvious observations.. (obligatory IANAL) "This system is private and subject to monitoring. Do not use if not authorized"... that one comes out as a startup or login banner a lot, and you think... doh, you're already connected, is this going to scare you off? Nope.. that's not why it's there.. it's to provide a legal basis for prosecution and collection of the data. If you don't warn that you monitor, then log files, etc, might not be admissible as evidence. If you don't say that "hey this isn't open to the public", then a defense of "I didn't know" can be raised. The "Information is confidential. If not addressed to you destroy it and notify" one is in the same sort of classification..but this one is for trade secrets. If you didn't identify it, then you can't claim it's a trade secret. And, if you haven't put the "if its not for you, don't look", then an inadvertent disclosure could be (possibly) legally copied and passed on. It's also, to a lesser degree, protection for the recipient...There's been more than one case of someone in Silicon Valley getting proprietary info by mistake (oops, there's two John Smiths.. or We put Joe's info in John's envelope and vice versa).. Company A that "lost" the info then sues company B employing the unwitting recipient to enjoin that recipient from working on anything that might be competitive. If the employee happens to be the key toiler on Company B's product that's going to whip Company A in the market, you can see there is a problem. In fact, the mere possibility (not even threat) of this kind of thing can be more effective than a non-compete agreement( which would be legally unenforceable in California in most cases) But that's just civil stuff...Now we get to the exciting one... "The information in this email may be subject to export controls" Yep.. that's the one that warns you that now that it's in your hot little hands, you assume the responsibility and potential prison term if you transmit it to someone you shouldn't. Doesn't matter that the originator might have been stupid and shouldn't have sent it, now it's your little baby to worry about. Mind you, I find this kind of amusing when below things like birthday party invitations or, one of the first times I saw it, printed on the bottom of the packing slip for a tube of 74LS138s back in 1979. Yep.. the evil doers gain a strategic advantage by knowing that I've got those 3-8 decoders in my garage, corroding away in their tube, just in case I need them 30 years later. (I'll bet there's more than one person on this list with parts in their house or desk drawer, not soldered into something, with date codes beginning with a 6 or 7, or even a 3 digit code) Jim From hahn at mcmaster.ca Wed Dec 2 16:02:13 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 2 Dec 2009 19:02:13 -0500 (EST) Subject: [Beowulf] Intel shows 48-core 'datacentre on a chip' In-Reply-To: <20091202182243.GG17686@leitl.org> References: <20091202182243.GG17686@leitl.org> Message-ID: > SCC combines 24 dual-core processing elements, each with its own router, four > DDR3 memory controllers capable of handling up to 8GB apiece, and a very fast sounds a fair amount like larrabee to me. > you needed your own datacentre. Now, you just need your own chip," said that has to be one of the most asinine things I've heard recently. > The on-chip network is configured as a 6x4 node, two-dimensional mesh. It has hmm, larrabee rumors indicate a dual-ring bus, not 2d mesh ala tilera. non-coherent cores sharing a small number of dram interfaces also sounds like an interesting trick. it implies that at some level, there's a table controlling which cores see which chunks of memory... From csamuel at vpac.org Wed Dec 2 18:35:01 2009 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 3 Dec 2009 13:35:01 +1100 (EST) Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> Message-ID: <1558552513.7267981259807701826.JavaMail.root@mail.vpac.org> ----- "Art Poon" wrote: > To be specific, after the initial boot with a > minimal Linux kernel, there is a "fatal error" > with "timeout waiting for getfile" when the > compute node attempts to download the provisioning > image from head. I've seen similar issues with Cisco switches in IBM Cluster 1350 systems where the switch was in its default configuration. The fix was to configure each port pointing to a compute node as an "edge port" to suppress the switches instinct to (IIRC) try and snoop for spanning tree information when bringing the port up as that meant that the vital packets were being dropped. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From ashley at pittman.co.uk Thu Dec 3 01:20:58 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 03 Dec 2009 09:20:58 +0000 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: <4B16C6E3.3060808@scalableinformatics.com> References: <4B16C6E3.3060808@scalableinformatics.com> Message-ID: <1259832058.6352.60.camel@alpha> On Wed, 2009-12-02 at 14:58 -0500, Joe Landman wrote: > David Mathog wrote: > >> What's got me and the IT guys stumped is that while the compute nodes > > boot via PXE from the head node without trouble on the NetGear, they > > barf with the SMC. To be specific, after the initial boot with a > > minimal Linux kernel, there is a "fatal error" with "timeout waiting for > > getfile" when the compute node attempts to download the provisioning > > image from head. However, when they were running Rocks before I > > arrived, the cluster worked fine with the SMC switch. > > Wondering aloud whether or not the ethernet driver has been correctly > included in the kernel/initrd for the PXE booted image. I've > seen/experienced this before, PXE works fine, the kernel boots, and is > missing the ethernet driver. Or the new distro you are trying enumerates the ethernet devices differently and it's trying to load the getfile from a different unconnected ethernet port. That's fairly common as well. It could even be worse that than in that the enumeration could be non-deterministic to really confuse you. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From prentice at ias.edu Thu Dec 3 07:25:49 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 03 Dec 2009 10:25:49 -0500 Subject: [Beowulf] Cluster Users in Clusters Linux and Windows In-Reply-To: <4788ffe70911261015t2817fcd4i55044d692b1aed64@mail.gmail.com> References: <4788ffe70911261015t2817fcd4i55044d692b1aed64@mail.gmail.com> Message-ID: <4B17D87D.1040608@ias.edu> Leonardo Machado Moreira wrote: > Hi! > > I am trying to create a cluster with only two machines. > > The server will be a Linux machine, an Arch Linux distribution to be > more specific. The slave machine will be a Windows 7 machine. While may be technically possible, why would you want to do this? If this is for your work, computers are so cheap that from a business point of view, it makes more sense to just by an additional computer and put Arch Linux on it than to waste your time trying to make this work. If you in it the technical challenge and sense of adventure, then good for you, and good luck. > I have found it is possible, but I was looking and have found that each > machine on the cluster must have the same user for the cluster. I would recommend having both systems use the same LDAP server or Active Directory (AD) server. I have made Linux systems use AD for LDAP/Kerberos servers. It's not that hard, but you need AD to support Posix/Unix attributes like shell, home directory, GECOS field. Most new versions of AD have this built in, earlier versions require an additional package Microsoft Services for Unix (also know as SFU or msSFU, or MSSFU) that can be downloaded from Microsoft. I wouldn'd try this unless you are very well-versed in LDAP and Kerberos administration. On RH-based Linux distros, /etc/ldap.conf should already has the necessary configuration for SFU in it, you just need to uncomment it. I never set up windows systems as LDAP clients. > > I was wondering how would I deal with it with the windows machine ? You'd have to have a windows-specific binary, for one. > > Do I have do implement a specific program in it? Would it found the rsh ? See above. SFU might come with a Wibndows implementation, but you'd have problems with the fact the programs might have different names, and would probably have different paths. That could confuse the queuing system (if your using one), and/or your MPI implementation. > > Thanks in advance! > > Leonardo Machado Moreira From prentice at ias.edu Thu Dec 3 07:40:12 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 03 Dec 2009 10:40:12 -0500 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B17DBDC.5060106@ias.edu> Hearns, John wrote: > > I was wondering if anyone has actual experience with running more than > one node from a single power supply. Even just two boards on one PSU > would be nice. We will be using barely 200W per node for 50 nodes and it > just seems like a big waste to buy 50 power supply units. I have read > the old posts but did not see any reports of success. > > Look at the Supermicro twin systems, they have two motherboards in 1U or > four motherboards in 2U. > > I believe HP have similar. What I learned at SC09: HP does make twin nodes similar to SuperMicro, but the HP nodes are not hot-swappable, if a single node goes down, you need to take down all the nodes in the chassis before you can remove the dead node. Not very practical. The SuperMicro nodes are definitely hot-swappable. > > Or of course any of the blade chassis ? Supermicro, HP, Sun and dare I > say it SGI. > > On a smaller scale you could look at the ?personal supercomputers? from > Cray and SGI. > The one problem with most blade chassis is that they are still relatively expensive. There's a lot of technology in the backplanes (ethernet, IB, KVM, etc), making them expensive. At SC09, I saw that Appro has a "dumb" blade chassis - the chassis only provides power, and the networking, KVM access, etc, are accessed from the front of the blade. The logic being this makes the chassis/blades cheaper since there's less custom (costly) technology. I didn't get any pricing information, so I don't know if the cost savings is real or just a marketing claim. -- Prentice From lynesh at Cardiff.ac.uk Thu Dec 3 07:56:57 2009 From: lynesh at Cardiff.ac.uk (Huw Lynes) Date: Thu, 03 Dec 2009 15:56:57 +0000 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <4B17DBDC.5060106@ias.edu> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <4B17DBDC.5060106@ias.edu> Message-ID: <1259855817.6329.3.camel@w609.insrv.cf.ac.uk> On Thu, 2009-12-03 at 10:40 -0500, Prentice Bisbal wrote: > Hearns, John wrote: > > > > I was wondering if anyone has actual experience with running more than > > one node from a single power supply. Even just two boards on one PSU > > would be nice. We will be using barely 200W per node for 50 nodes and it > > just seems like a big waste to buy 50 power supply units. I have read > > the old posts but did not see any reports of success. > > > > Look at the Supermicro twin systems, they have two motherboards in 1U or > > four motherboards in 2U. > > > > I believe HP have similar. > > What I learned at SC09: > > HP does make twin nodes similar to SuperMicro, but the HP nodes are not > hot-swappable, if a single node goes down, you need to take down all the > nodes in the chassis before you can remove the dead node. Not very > practical. The SuperMicro nodes are definitely hot-swappable. > The Supermicro 2U twin nodes with 4 mobos in each box are hot-swappable. The 1U twin nodes with 2 mobos in each box are not. Yes I don't understand why the 2U box isn't called "quad" either. Cheers, Huw -- Huw Lynes | Advanced Research Computing HEC Sysadmin | Cardiff University | Redwood Building, Tel: +44 (0) 29208 70626 | King Edward VII Avenue, CF10 3NB From jeff.johnson at aeoncomputing.com Wed Dec 2 10:34:20 2009 From: jeff.johnson at aeoncomputing.com (Jeff Johnson) Date: Wed, 02 Dec 2009 10:34:20 -0800 Subject: [Beowulf] Re: Beowulf Digest, Vol 70, Issue 4 In-Reply-To: <200912021821.nB2ILBIK030413@bluewest.scyld.com> References: <200912021821.nB2ILBIK030413@bluewest.scyld.com> Message-ID: <4B16B32C.60703@aeoncomputing.com> On 12/2/09 10:21 AM, beowulf-request at beowulf.org wrote: > ------------------------------ > > Message: 8 > Date: Tue, 1 Dec 2009 12:45:52 -0800 > From: Art Poon > Subject: [Beowulf] Re: cluster fails to boot with managed switch, but > 5-port switch works OK > To:beowulf at beowulf.org > Message-ID:<825EEAB3-C58F-46B8-A9C4-A806C5B682D3 at gmail.com> > Content-Type: text/plain; charset=us-ascii > > Dear colleagues, > > [snip] > > What's got me and the IT guys stumped is that while the compute nodes boot via PXE from the head node without trouble on the NetGear, they barf with the SMC. To be specific, after the initial boot with a minimal Linux kernel, there is a "fatal error" with "timeout waiting for getfile" when the compute node attempts to download the provisioning image from head. However, when they were running Rocks before I arrived, the cluster worked fine with the SMC switch. > > I've tried resetting the SMC switch to factory defaults (with auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic. We've tried swapping out to another SMC switch but that didn't change anything. > > I'm grateful if you could weigh in with your expertise. > I don't know if my $.02 here could be classified as 'expertise'. With that disclaimer out of the way I can say that SMC switches do have a tendency to have very old firmware when they are stocked in warehouses and they are not often updated. Their update process is a PITA compared to other switches out there. I have seen cases where their old firmware and STP (spanning tree protocol) causes enough delay when a port comes up on the switch for the first time in a pxe/dhcp operation that the process times out while the switch is trying to figure out if there are network loops. The firmware update can be obtained from www.smc.com and is at v2.3.0.0 updated in March. Check your switch to see where you are at now. The Netgear switches are layer-2 and too dumb to cause problems. > Thank you, > - Art. > > > > > ------------------------------ > > -- ------------------------------ Jeff Johnson Manager Aeon Computing jeff.johnson at aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 f: 858-412-3845 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117 From artpoon at gmail.com Wed Dec 2 10:42:28 2009 From: artpoon at gmail.com (Art Poon) Date: Wed, 2 Dec 2009 10:42:28 -0800 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: <4B16B22A.8070603@scalableinformatics.com> References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> <4B16B22A.8070603@scalableinformatics.com> Message-ID: <9AB77A0E-3092-4395-A299-410B8C97C095@gmail.com> Hi all, Thanks for your responses! I finally fixed this yesterday afternoon but neglected to update my post, my apologies. After discussing our problem to the Penguin Computing service rep, I reconfigured the switch to enable fast spanning-tree mode for compute node ports. That apparently fixed the problem and thanks to your feedback I am starting to understand why. Thanks again, - Art. On Dec 2, 2009, at 10:30 AM, Joe Landman wrote: > Art Poon wrote: >> Dear colleagues, > > [...] > >> What's got me and the IT guys stumped is that while the compute nodes >> boot via PXE from the head node without trouble on the NetGear, they >> barf with the SMC. To be specific, after the initial boot with a >> minimal Linux kernel, there is a "fatal error" with "timeout waiting >> for getfile" when the compute node attempts to download the >> provisioning image from head. However, when they were running Rocks >> before I arrived, the cluster worked fine with the SMC switch. > > Is it the switch of the dhcp/bootp/tftp setup thats the problem? Are you sure the tftp daemon is up, or bootp is configured correctly? > > Switches sometimes have broadcast storm suppression turned on, or worse, sometimes they have spanning tree turned on. You want the switch to be as dumb as you can possibly make it for most linux clusters. Fast, but dumb. > >> I've tried resetting the SMC switch to factory defaults (with >> auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and >> it doesn't seem to be demanding anything exotic. We've tried >> swapping out to another SMC switch but that didn't change anything. > > This sounds more on the server software stack than the switch. Could you describe this? Are you using Scyld/Rocks for that? > > Rocks is quite sensitive to configuration issues, and really doesn't like altered configurations (it is possible to do, though non-trivial). > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics Inc. > email: landman at scalableinformatics.com > web : http://scalableinformatics.com > http://scalableinformatics.com/jackrabbit > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 From vallard at benincosa.com Wed Dec 2 10:59:09 2009 From: vallard at benincosa.com (Vallard Benincosa) Date: Wed, 2 Dec 2009 10:59:09 -0800 Subject: [Beowulf] Re: cluster fails to boot with managed switch,but 5-port switch works OK In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0E75A725@milexchmb1.mil.tagmclarengroup.com> References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A725@milexchmb1.mil.tagmclarengroup.com> Message-ID: Spanning Trees are usually the biggest culprit in pxe booting and switch problems. For SMC switches I think you just need to do something like: telnet smc enable config interface ethernet 1/1 # if this is the interface your client node is on spanning-tree edge-port exit copy run start On Wed, Dec 2, 2009 at 10:36 AM, Hearns, John wrote: > > I've tried resetting the SMC switch to factory defaults (with > auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it > doesn't seem to be demanding anything exotic. We've tried swapping out > to another SMC switch but that didn't change anything. > > No idea really, as I don't use SMC switches. > First thing I would do would be to get a laptop with Linux on, and > attach it to the SMC switch. > Configure the Ethernet interface to use DHCP, then ifconfig it down then > ifcoinfig it up. > Run tcpdump eth0 as you do this, and tail -f /var/log/messages > > > If it gets a DHCP address (and of course on the cluster head node there > is a pool of free DHCP addresses configured) > the test tftp file transfer. Set the tftp daemon on the head node to run > in debug mode, and start it up. > Try a tftp get of a test file on the laptop. > > > The contents of this email are confidential and for the exclusive use of > the intended recipient. If you receive this email in error you should not > copy it, retransmit it, use it or disclose its contents but should return it > to the sender immediately and delete your copy. > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From crhea at mayo.edu Wed Dec 2 11:14:29 2009 From: crhea at mayo.edu (Cris Rhea) Date: Wed, 2 Dec 2009 13:14:29 -0600 Subject: [Beowulf] Re: Booting nodes with PXE... In-Reply-To: <200912021853.nB2IrssM031377@bluewest.scyld.com> References: <200912021853.nB2IrssM031377@bluewest.scyld.com> Message-ID: <20091202191429.GA6361@kaizen.mayo.edu> > > What's got me and the IT guys stumped is that while the compute nodes > > boot via PXE from the head node without trouble on the NetGear, they > > barf with the SMC. To be specific, after the initial boot with a > > minimal Linux kernel, there is a "fatal error" with "timeout waiting > > for getfile" when the compute node attempts to download the > > provisioning image from head. However, when they were running Rocks > > before I arrived, the cluster worked fine with the SMC switch. > > Switches sometimes have broadcast storm suppression turned on, or worse, > sometimes they have spanning tree turned on. You want the switch to be > as dumb as you can possibly make it for most linux clusters. Fast, but > dumb. As some have already commented, I'm assuming you have tested each service (DHCP, tftp, etc.). My bet is on "spanning tree", as mentioned above. Watch the Ethernet lights on the node when booting and see if the port comes alive/stable before you get the timeout. I've seen this in spades if "spanning tree portfast" isn't set on Cisco switches-- just takes too long to negotiate the GbE interface. --- Cris -- Cristopher J. Rhea Mayo Clinic - Research Computing Facility 200 First St SW, Rochester, MN 55905 crhea at Mayo.EDU (507) 284-0587 From mclewis at ucdavis.edu Wed Dec 2 13:06:43 2009 From: mclewis at ucdavis.edu (Michael Lewis) Date: Wed, 2 Dec 2009 13:06:43 -0800 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: <4B16CFC1.8060603@cse.ucdavis.edu> References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com> <4B16CFC1.8060603@cse.ucdavis.edu> Message-ID: <20091202210643.GX12440@durian.genomecenter.ucdavis.edu> On Wed, Dec 02, 2009 at 12:36:17PM -0800, Bill Broadley wrote: > Art Poon wrote: > > I've tried resetting the SMC switch to factory defaults (with > > auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it > > doesn't seem to be demanding anything exotic. We've tried swapping out to > > another SMC switch but that didn't change anything. > > I had a very unpleasant experience with an SMC switch awhile back. I was > having problems trying to bootstrap a rocks cluster. Turns out the SMC (and > Dell relabel) was so evil that it warranted a mention in the Rocks FAQ. I run the cluster that Bill is describing here. Indeed, the default configuration of the SMC switches was to have spanning-tree turned on on all ports. The symptom we had was that the machines would PXEboot fine and load a kernel, but then fail to DHCP later. Even worse, the switches would occasionally revert back to this setting if they lost power. Also, as Bill notes, there is a Dell rebrand of the same switch, which runs slightly different firmware. If you've got one of those, get the firmware from the Dell site, not from SMC. > I believe the solution was to manually turn on edge node routing or similar on > each port. Unfortunately there was a bug and you could only turn on the first > 16 ports. There was a fix with new firmware, but there were 2 firmware images > and you couldn't tell which from looking at the switch. Said firmware upgrade > caused other problems. Here was the fix we used. For each port (replace 1/5 with 1/N for N=1..): Console#config Console(config)#interface ethernet 1/5 Console(config-if)#spanning-tree edge-port And don't forget to write back to flash when you're done. After the firmware updates, we haven't had the issue of the configuration resetting anymore, but we've also upgraded the cluster to a better switch. The SMCs now run other non-cluster servers. -- Michael Lewis | mclewis at ucdavis.edu Systems Administrator | Voice: (530) 754-7978 Genome Center | University of California, Davis | From christiansuhendra at gmail.com Wed Dec 2 22:16:00 2009 From: christiansuhendra at gmail.com (christian suhendra) Date: Thu, 3 Dec 2009 14:16:00 +0800 Subject: [Beowulf] lamexec vs mpirun Message-ID: hello guys i've buliding a cluster system with mpich-1.2.7p1,when i test my machine it work..but when i run the mpirun it doesn't work ini cluster...but its running and i got this error It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. and then i try to run lamexec it work.. i wanna ask you guys the different between of lamexec and mpirun thank you for your advice regards christian -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Thu Dec 3 11:02:00 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 3 Dec 2009 11:02:00 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <4B17DBDC.5060106@ias.edu> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <4B17DBDC.5060106@ias.edu> Message-ID: <20091203190200.GA647@bx9.net> On Thu, Dec 03, 2009 at 10:40:12AM -0500, Prentice Bisbal wrote: > if a single node goes down, you need to take down all the > nodes in the chassis before you can remove the dead node. Not very > practical. Eh? What's so hard about marking the other nodes as unusable in your batch system, and waiting for them to become free? -- greg From Glen.Beane at jax.org Thu Dec 3 11:22:23 2009 From: Glen.Beane at jax.org (Glen Beane) Date: Thu, 3 Dec 2009 14:22:23 -0500 Subject: [Beowulf] lamexec vs mpirun In-Reply-To: Message-ID: On 12/3/09 1:16 AM, "christian suhendra" wrote: hello guys i've buliding a cluster system with mpich-1.2.7p1,when i test my machine it work..but when i run the mpirun it doesn't work ini cluster...but its running and i got this error It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. and then i try to run lamexec it work.. i wanna ask you guys the different between of lamexec and mpirun thank you for your advice You are not using mpich, you are using LAM-MPI. LAM uses daemons that must be running on all of the nodes before mpirun can launch the MPI executables. The lamexec command essentially does a "lamboot" and "mpirun" in a single command. By the way, LAM is deprecated, and its users are encouraged to switch to OpenMPI. If you were intending to use mpich and have it installed on your system LAM is being found first in your path and not mpich. -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153 -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Thu Dec 3 11:29:10 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 3 Dec 2009 14:29:10 -0500 (EST) Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <20091203190200.GA647@bx9.net> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net> Message-ID: >> if a single node goes down, you need to take down all the >> nodes in the chassis before you can remove the dead node. Not very >> practical. > > Eh? What's so hard about marking the other nodes as unusable in your > batch system, and waiting for them to become free? depends on your max job length. but yeah, idling three nodes for a week is not going to be noticable in anything but a quite small cluster... From Glen.Beane at jax.org Thu Dec 3 11:31:58 2009 From: Glen.Beane at jax.org (Glen Beane) Date: Thu, 3 Dec 2009 14:31:58 -0500 Subject: [Beowulf] lamexec vs mpirun In-Reply-To: Message-ID: On 12/3/09 2:22 PM, "Glen Beane" wrote: On 12/3/09 1:16 AM, "christian suhendra" wrote: hello guys i've buliding a cluster system with mpich-1.2.7p1,when i test my machine it work..but when i run the mpirun it doesn't work ini cluster...but its running and i got this error It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. and then i try to run lamexec it work.. i wanna ask you guys the different between of lamexec and mpirun thank you for your advice You are not using mpich, you are using LAM-MPI. LAM uses daemons that must be running on all of the nodes before mpirun can launch the MPI executables. The lamexec command essentially does a "lamboot" and "mpirun" in a single command. By the way, LAM is deprecated, and its users are encouraged to switch to OpenMPI. If you were intending to use mpich and have it installed on your system LAM is being found first in your path and not mpich. Actually, I take this back - lamexec is not like lamboot and mpirun in one, for some reason I had LAM's mpiexec command on my mind, which if I remember right is equivalent to lamboot + mpirun (It has been a while). lamexec is like a distributed shell that launches non-mpi programs on nodes running the LAM daemon (lamd). but my main point still stands, you aren't using mpich even if you think you are. -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlb17 at duke.edu Thu Dec 3 11:35:45 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Thu, 3 Dec 2009 14:35:45 -0500 (EST) Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net> Message-ID: On Thu, 3 Dec 2009 at 2:29pm, Mark Hahn wrote >>> if a single node goes down, you need to take down all the >>> nodes in the chassis before you can remove the dead node. Not very >>> practical. >> >> Eh? What's so hard about marking the other nodes as unusable in your >> batch system, and waiting for them to become free? > > depends on your max job length. but yeah, idling three nodes for a week > is not going to be noticable in anything but a quite small cluster... But doesn't the engineer in you just bristle at the (admittedly, rather slight) inefficiency? Call me OCD (you wouldn't be the first), but it just bugs me. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From hahn at mcmaster.ca Thu Dec 3 11:54:13 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 3 Dec 2009 14:54:13 -0500 (EST) Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net> Message-ID: >> depends on your max job length. but yeah, idling three nodes for a week >> is not going to be noticable in anything but a quite small cluster... > > But doesn't the engineer in you just bristle at the (admittedly, rather > slight) inefficiency? Call me OCD (you wouldn't be the first), but it just > bugs me. well, what I prefer to do is set a reservation on the node that starts one max-job-period in the future. that means shorter jobs get to use the node until then. ultimately, it depends both on how much individual hotpluggability costs (the HW changes don't sound trivial to me) as well as the joblength/etc. I doubt I'd choose to pay more than a few percent for HPability, assuming 2-4 nodes-per-box and ~<7d job limits... From prentice at ias.edu Thu Dec 3 11:54:47 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 03 Dec 2009 14:54:47 -0500 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <20091203190200.GA647@bx9.net> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net> Message-ID: <4B181787.2050407@ias.edu> Greg Lindahl wrote: > On Thu, Dec 03, 2009 at 10:40:12AM -0500, Prentice Bisbal wrote: > >> if a single node goes down, you need to take down all the >> nodes in the chassis before you can remove the dead node. Not very >> practical. > > Eh? What's so hard about marking the other nodes as unusable in your > batch system, and waiting for them to become free? > > -- greg > I didn't say it was hard - just impractical. ;) I thought the same thing when HP told me the nodes weren't hot-swappable. But then when I learned the SuperMicros were hot swappable, I figured if SuperMicro can do it, why not HP? I'm sure you'll agree that taking just one node down instead of 4 is more convenient, and is less likely to draw the ire of your number-crunchers. -- Prentice From lindahl at pbm.com Thu Dec 3 12:07:45 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 3 Dec 2009 12:07:45 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net> Message-ID: <20091203200745.GA14991@bx9.net> On Thu, Dec 03, 2009 at 02:35:45PM -0500, Joshua Baker-LePain wrote: > But doesn't the engineer in you just bristle at the (admittedly, rather > slight) inefficiency? Call me OCD (you wouldn't be the first), but it > just bugs me. No. If I saved a bit of $$ with the funky case, that's probably a lot more than I lose from having some idle nodes now and then. -- greg From Greg at keller.net Thu Dec 3 12:17:56 2009 From: Greg at keller.net (Greg Keller) Date: Thu, 3 Dec 2009 14:17:56 -0600 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: <200912031846.nB3Ikesg029202@bluewest.scyld.com> References: <200912031846.nB3Ikesg029202@bluewest.scyld.com> Message-ID: <481727C4-110B-4D18-8361-DA12469D70FF@Keller.net> >>>> What's got me and the IT guys stumped is that while the compute >>>> nodes >>> boot via PXE from the head node without trouble on the NetGear, they >>> barf with the SMC. To be specific, after the initial boot with a >>> minimal Linux kernel, there is a "fatal error" with "timeout >>> waiting for >>> getfile" when the compute node attempts to download the provisioning >>> image from head. However, when they were running Rocks before I >>> arrived, the cluster worked fine with the SMC switch. This is very common with Spanning tree enabled. Essentially, once the port has a physical link light it may take a while before spanning tree allows traffic to actually flow through the port. Longer than a typical timeout. When loading/reloading the driver there seems to be an instantaneous drop of the link that forces a new delay cycle. With the Dell PowerConnect (SMC Rebrand??) series you have to "enable" port fast or "disable" spanning tree to avoid this delay before traffic passes. I generally do both. The Web based GUI is sufficiently bad enough to make this more difficult than it needs to be, but you can globally disable spanning tree through it. I use the command line, connect to interface range all, and then configure my ports as: ! enable config interface range ethernet all spanning-tree disable spanning-tree portfast mtu 9216 exit ! Hope this helps! Cheers! Greg Technical Principal R Systems NA, inc. From lindahl at pbm.com Thu Dec 3 12:21:25 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 3 Dec 2009 12:21:25 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster Message-ID: <20091203202125.GF14991@bx9.net> On Thu, Dec 03, 2009 at 02:54:47PM -0500, Prentice Bisbal wrote: > I didn't say it was hard - just impractical. ;) I suspect we have different ideas about what is impractical. I take nodes out of service gracefully all the time to fix fans, intermittant memory errors, system disks going bad, and the like. If anyone complained, I would explain that I'm saving them from seeing interrupted jobs. -- greg From gerry.creager at tamu.edu Thu Dec 3 12:30:54 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Thu, 03 Dec 2009 14:30:54 -0600 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> <4B1584E0.8060704@tamu.edu> <20091201214217.GA19804@galactic.demon.co.uk> <4B159330.5070802@tamu.edu> Message-ID: <4B181FFE.2070401@tamu.edu> Toon Knapen wrote: > > I believe xfs is now available in 5.4. I'd have to check. We've > found xfs to be our preference (but we're revisiting gluster and > lustre). I've not played with gfs so far. > > > > And why do you prefer xfs if I may ask. Performance? Do you many small > files or large files? Our experience was that XFS was both the most reliable and best performing of the various file systems we experimented with. While the majority of our files are relatively small, it also performs well with large files. XFS repair is usually a simple manner and very fast, when the XFS equivalent of an fsck is required. gerry From gerry.creager at tamu.edu Thu Dec 3 13:01:15 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Thu, 03 Dec 2009 15:01:15 -0600 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <4B181787.2050407@ias.edu> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net> <4B181787.2050407@ias.edu> Message-ID: <4B18271B.4050005@tamu.edu> Prentice Bisbal wrote: > Greg Lindahl wrote: >> On Thu, Dec 03, 2009 at 10:40:12AM -0500, Prentice Bisbal wrote: >> >>> if a single node goes down, you need to take down all the >>> nodes in the chassis before you can remove the dead node. Not very >>> practical. >> Eh? What's so hard about marking the other nodes as unusable in your >> batch system, and waiting for them to become free? >> >> -- greg >> > > I didn't say it was hard - just impractical. ;) > > I thought the same thing when HP told me the nodes weren't > hot-swappable. But then when I learned the SuperMicros were hot > swappable, I figured if SuperMicro can do it, why not HP? > > I'm sure you'll agree that taking just one node down instead of 4 is > more convenient, and is less likely to draw the ire of your > number-crunchers. Because of our ill-advised choice of a specific APC rack, whenever I've got to remove one of the supermicro's from the 2uTwin chassis, I have to power 'em all off. I then have to ease the chassis out so I've enough room to get a node out. I learned to "just do it" after a 2 month whine-in. gerry From jlforrest at berkeley.edu Thu Dec 3 13:50:27 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Thu, 03 Dec 2009 13:50:27 -0800 Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <4B181FFE.2070401@tamu.edu> References: <4B155A43.3010304@scalableinformatics.com> <4B157249.80701@tamu.edu> <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org> <4B1584E0.8060704@tamu.edu> <20091201214217.GA19804@galactic.demon.co.uk> <4B159330.5070802@tamu.edu> <4B181FFE.2070401@tamu.edu> Message-ID: <4B1832A3.4030507@berkeley.edu> Gerald Creager wrote: > Toon Knapen wrote: >> I believe xfs is now available in 5.4. I'd have to check. >> We've >> found xfs to be our preference (but we're revisiting gluster and >> lustre). I've not played with gfs so far. I've used xfs in several CentOS servers. In general, it works great but I've had some scary looking crashes with 'xfs' in the call stack (see below). I stupidly haven't taken the time to send the crashes the the xfs people so maybe these problems are known and fixed. I also haven't tried xfs with CentOS 5.4, which contains official support in the kernel. I'm sticking with ext3 for now, not because it's great, but because it's reliable and good enough. I'd love to switch to xfs, though. Here's one stack trace: Sep 28 08:41:08 frank kernel: Filesystem "md1": XFS internal error xfs_btree_check_sblock at line 307 of file fs/xfs/xfs_btree.c. Caller 0xffffffff8836deaf Sep 28 08:41:08 frank kernel: Sep 28 08:41:08 frank kernel: Call Trace: Sep 28 08:41:08 frank kernel: [] :xfs:xfs_btree_check_sblock+0xaf/0xbe Sep 28 08:41:08 frank kernel: [] :xfs:xfs_inobt_increment+0x156/0x17e Sep 28 08:41:08 frank kernel: [] :xfs:xfs_dialloc+0x4d0/0x80c Sep 28 08:41:08 frank kernel: [] :xfs:xfs_ialloc+0x5f/0x57f Sep 28 08:41:08 frank kernel: [] :xfs:xfs_dir_ialloc+0x86/0x2b7 Sep 28 08:41:08 frank kernel: [] :xfs:xlog_grant_log_space+0x204/0x25c Sep 28 08:41:08 frank kernel: [] :xfs:xfs_create+0x237/0x45c Sep 28 08:41:08 frank kernel: [] :xfs:xfs_attr_get+0x8e/0x9f Sep 28 08:41:08 frank kernel: [] :xfs:xfs_vn_mknod+0x144/0x215 Sep 28 08:41:08 frank kernel: [] vfs_create+0xe6/0x158 Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From rpnabar at gmail.com Thu Dec 3 15:08:39 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Dec 2009 17:08:39 -0600 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: <481727C4-110B-4D18-8361-DA12469D70FF@Keller.net> References: <200912031846.nB3Ikesg029202@bluewest.scyld.com> <481727C4-110B-4D18-8361-DA12469D70FF@Keller.net> Message-ID: On Thu, Dec 3, 2009 at 2:17 PM, Greg Keller wrote: > This is very common with Spanning tree enabled. ?Essentially, once the port > has a physical link light it may take a while before spanning tree allows > traffic to actually flow through the port. I've had to do this same thing that Greg describes with my Dell power connect (port fast and spanning tree disable) but on a more fundamental level: Why is the long delay with spanning tree? And is it possible to (a) reduce this delay (b) Increase the timeout for PXE / DHCP? -- Rahul From csamuel at vpac.org Thu Dec 3 16:53:39 2009 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 4 Dec 2009 11:53:39 +1100 (EST) Subject: [Beowulf] Forwarded from a long time reader having trouble posting In-Reply-To: <4B181FFE.2070401@tamu.edu> Message-ID: <1523002622.7329231259888019196.JavaMail.root@mail.vpac.org> ----- "Gerald Creager" wrote: > XFS repair is usually a simple manner and very fast, You can also do online growing of XFS filesystems (if you use LVM for instance) which is dead useful. Added bonus - you can also do online defragmentation of XFS filesystems with xfs_fsr (for some reason packaged in the xfsdump .deb). cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Thu Dec 3 17:57:07 2009 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 4 Dec 2009 12:57:07 +1100 (EST) Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <20091203190200.GA647@bx9.net> Message-ID: <588682529.7331271259891827276.JavaMail.root@mail.vpac.org> ----- "Greg Lindahl" wrote: > Eh? What's so hard about marking the other nodes > as unusable in your batch system, and waiting for > them to become free? If you've got a job running on there for a month or two then there's a fairly high opportunity cost involved. -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From lindahl at pbm.com Thu Dec 3 18:11:22 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 3 Dec 2009 18:11:22 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <588682529.7331271259891827276.JavaMail.root@mail.vpac.org> References: <20091203190200.GA647@bx9.net> <588682529.7331271259891827276.JavaMail.root@mail.vpac.org> Message-ID: <20091204021122.GA17286@bx9.net> On Fri, Dec 04, 2009 at 12:57:07PM +1100, Chris Samuel wrote: > If you've got a job running on there for a month > or two then there's a fairly high opportunity cost > involved. That kind of policy has a fairly high opportunity cost, even before you factor in linked nodes. E.g. you see a system disk going bad, but the user will lose all their output unless the job runs for 4 more weeks... -- greg From bill at cse.ucdavis.edu Thu Dec 3 18:19:08 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Thu, 03 Dec 2009 18:19:08 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <20091204021122.GA17286@bx9.net> References: <20091203190200.GA647@bx9.net> <588682529.7331271259891827276.JavaMail.root@mail.vpac.org> <20091204021122.GA17286@bx9.net> Message-ID: <4B18719C.5080008@cse.ucdavis.edu> Greg Lindahl wrote: > On Fri, Dec 04, 2009 at 12:57:07PM +1100, Chris Samuel wrote: > >> If you've got a job running on there for a month >> or two then there's a fairly high opportunity cost >> involved. > > That kind of policy has a fairly high opportunity cost, even before > you factor in linked nodes. E.g. you see a system disk going bad, but > the user will lose all their output unless the job runs for 4 more > weeks... Indeed. You'd hope that such long running jobs would checkpoint. Seems like the perfect place for virtualization. Seems like for mostly CPU bound jobs the overhead is getting pretty low. Then you get all kinds of benefits: * Checkpointing * Migration * easy backfill Seems like it would be real popular with the admins. Anyone doing this? From csamuel at vpac.org Thu Dec 3 18:32:12 2009 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 4 Dec 2009 13:32:12 +1100 (EST) Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <195152680.7331771259893408253.JavaMail.root@mail.vpac.org> Message-ID: <1193336713.7333671259893932486.JavaMail.root@mail.vpac.org> ----- "Greg Lindahl" wrote: > That kind of policy has a fairly high opportunity > cost, even before you factor in linked nodes. Well we cannot dictate to our users what they do, we set a maximum walltime of 3 months and tell users that they should checkpoint (if they have control of the application and have coding skills). > E.g. you see a system disk going bad, but the user > will lose all their output unless the job runs for > 4 more weeks... We run SMART tests and the like trying to proactively spot bad disks (and other hardware) prior to failures, but yes, that's inevitable. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From richard.walsh at comcast.net Thu Dec 3 18:34:39 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Fri, 4 Dec 2009 02:34:39 +0000 (UTC) Subject: [Beowulf] Dual head or service node related question ... Message-ID: <816343207.8941431259894079516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> All, In the typical cluster, with a single head node, that node provides login services, batch job submission services, and often supports a shared file space mounted via NFS from the head node to the compute nodes. This approach works reasonably well for not-too-large cluster systems. What is viewed as the best practice (or what are people doing) on something like an SGI ICE system with multiple service or head nodes? Does one service node generally assume the same role as the head node above (serving NFS, logins, and running services like PBS pro)? Or ... if NFS is used, is it perhaps served from another service node and mounted both on the login node and the compute nodes? Read-only? Is it better to support a shared file space via Lustre across all the nodes? The architecture chosen has implications ... for instance in the common case above PBS Pro would be installed on the head node, perhaps in the shared space, and its server and scheduler would be run by /etc/init.d/pbs off of the shared partition. The bin and sbin commands would shared by the compute nodes. In a case where the login service node and the shared file space, NFS service node are different, PBS installation must be done on the NFS service node in the case where the share space is mounted read-only and only the commands and man pages would be installed on the login node. What are the implications for other user applications the one would like to install in the share space for use from the login nodes? Some might have write requirements into the installation directory? Does this indicate that the NFS partition should be mounted read-write on the login node, but read-only on the compute nodes? Comments and suggestions, particularly from those that have set things up on SGI ICE cluster systems would be much appreciated. rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: From csamuel at vpac.org Thu Dec 3 18:34:42 2009 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 4 Dec 2009 13:34:42 +1100 (EST) Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <4B18719C.5080008@cse.ucdavis.edu> Message-ID: <1587074959.7333741259894082772.JavaMail.root@mail.vpac.org> ----- "Bill Broadley" wrote: > Indeed. You'd hope that such long running jobs would checkpoint. They didn't at first, but when we asked them to they decided they were going to checkpoint all 2GB of RAM every minute. Over NFS. We got them to fix that.. > Seems like the perfect place for virtualization. ..or blcr! [...] > Seems like it would be real popular with the admins. > Anyone doing this? How does it deal with pinned DMA memory on NICs ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From lindahl at pbm.com Thu Dec 3 18:41:29 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 3 Dec 2009 18:41:29 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <1193336713.7333671259893932486.JavaMail.root@mail.vpac.org> References: <195152680.7331771259893408253.JavaMail.root@mail.vpac.org> <1193336713.7333671259893932486.JavaMail.root@mail.vpac.org> Message-ID: <20091204024129.GA25462@bx9.net> > > E.g. you see a system disk going bad, but the user > > will lose all their output unless the job runs for > > 4 more weeks... > > We run SMART tests and the like trying to proactively > spot bad disks (and other hardware) prior to failures, > but yes, that's inevitable. It's not inevitable that the policy be that 3 month jobs are allowed. But you know me: I never saw a battle I didn't want to fight :-) Arrr, mateys, this be the BOFH, and I'm heere to educate you about the right way to use this here supercomputer... my way... or walk the plank! -- greg From hahn at mcmaster.ca Thu Dec 3 22:30:58 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 4 Dec 2009 01:30:58 -0500 (EST) Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <20091204024129.GA25462@bx9.net> References: <195152680.7331771259893408253.JavaMail.root@mail.vpac.org> <1193336713.7333671259893932486.JavaMail.root@mail.vpac.org> <20091204024129.GA25462@bx9.net> Message-ID: >>> E.g. you see a system disk going bad, but the user >>> will lose all their output unless the job runs for >>> 4 more weeks... until fairly recently (sometime this year), we didn't constrain the length of jobs. we now have a 1 week limit - generally argued on the basis of expecting longer jobs to checkpoint. we also provide blcr for serial/threaded jobs. I have mixed feelings about this. the purpose of organizations providing HPC is to _enable_, not obstruct. in some cases, this could mean working with a group to find an alternative better than, for instance, not checkpointing a resource-intensive job. our node/power failure rates are pretty low - not enough to justify a 1-week limit. but to he honest, the main motive is probably to increase cluster churn - essentially improving scheduler fairness. > It's not inevitable that the policy be that 3 month jobs are allowed. if a length limit is to be justified based on probability-of-failure, it should be ~ 1/nnodes; if fail-cost-based, 1/ncpus. unfortunately, the other extreme would be a sort of "invisible hand" where users experimentally derive the failure rate by their rate of failed jobs jobs ;( personally, I think facilities should permit longer jobs, though perhaps only after discussing the risks and alternatives. an economic approach might reward checkpointing with a fairshare bonus - merely rewarding short jobs seems wrong-headed. From h-bugge at online.no Thu Dec 3 23:29:29 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Fri, 4 Dec 2009 08:29:29 +0100 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <1587074959.7333741259894082772.JavaMail.root@mail.vpac.org> References: <1587074959.7333741259894082772.JavaMail.root@mail.vpac.org> Message-ID: <8B9ED9B6-6D97-4F77-B8EF-A8A0A0D49131@online.no> Hi, On Dec 4, 2009, at 3:34 , Chris Samuel wrote: > > How does it deal with pinned DMA memory on NICs ? What we did in Platform (Scali) MPI, was to drain the HPC interconnect, then close it down. The problem was then reduced to checkpoint (e.g. using BLCR) N processes. Continuing from checkpoint and restarting from it would both re-open the HPC fabric (could be on another physical medium though). You could take the checkpoint on IB and restart using Gbe. Combined with an agnostic interconnect support, this feature allows you in the case of a failing IB HCA (or failing switch port or cable) to restart from last the checkpoint, runn M-1 nodes communicating with other M-2 IB capable nodes using IB, and the last node communicating with the M-1 nodes using Gbe. Traditional checkpointing requires snap-shot of the file-system in the general case (and restore of the correct snap-shot at restart), whereas checkpoint-and-kill (for migration or preemptive batch scheduling) does not require integration with file-systems. H?kon From john.hearns at mclaren.com Fri Dec 4 01:24:23 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 4 Dec 2009 09:24:23 -0000 Subject: [Beowulf] Dual head or service node related question ... In-Reply-To: <816343207.8941431259894079516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <816343207.8941431259894079516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <68A57CCFD4005646957BD2D18E60667B0E7D0152@milexchmb1.mil.tagmclarengroup.com> What is viewed as the best practice (or what are people doing) on something like an SGI ICE system with multiple service or head nodes? Does one service node generally assume the same role as the head node above (serving NFS, logins, and running services like PBS pro)? Or ... if NFS is used, is it perhaps served from another service node and mounted both on the login node and the compute nodes? Two service nodes which act as login/batch submission nodes. PBSpro configured to fail over between them (ie one is the PBS primary server). Separate server for storage ? SGI connect these storage servers via the Infiniband fabric, and use multiple Infiniband ports to spread the load ? you can easily configure this at cluster install time, ie. every nth node connects to a different Infiniband port on the storage server. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at mclaren.com Fri Dec 4 01:45:43 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 4 Dec 2009 09:45:43 -0000 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <20091204024129.GA25462@bx9.net> References: <195152680.7331771259893408253.JavaMail.root@mail.vpac.org><1193336713.7333671259893932486.JavaMail.root@mail.vpac.org> <20091204024129.GA25462@bx9.net> Message-ID: <68A57CCFD4005646957BD2D18E60667B0E7D01AE@milexchmb1.mil.tagmclarengroup.com> It's not inevitable that the policy be that 3 month jobs are allowed. Three MONTHS. Some celebrities careers are shorter than that these days. If people running jobs like this don't checkpoint, they deserve everything they get. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From reuti at staff.uni-marburg.de Fri Dec 4 03:13:08 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri, 4 Dec 2009 12:13:08 +0100 Subject: [Beowulf] Dual head or service node related question ... In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0E7D0152@milexchmb1.mil.tagmclarengroup.com> References: <816343207.8941431259894079516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> <68A57CCFD4005646957BD2D18E60667B0E7D0152@milexchmb1.mil.tagmclarengroup.com> Message-ID: <8A56A1CE-7199-4EB9-BC3B-76BCF5ABB8D3@staff.uni-marburg.de> Hi, Am 04.12.2009 um 10:24 schrieb Hearns, John: > What is viewed as the best practice (or what are people doing) on > > something like an SGI ICE system with multiple service or head nodes? > > Does one service node generally assume the same role as the > > head node above (serving NFS, logins, and running services like > > PBS pro)? Or ... if NFS is used, is it perhaps served from another > > service node and mounted both on the login node and the compute > > nodes? > I don't know for the original system you mentioned. We use SGE (not PBSpro) and I prefer putting it's qmaster also on the fileserver (the additional load by the fileserver is easier to predict than the varying work of interactive users). Then you can have as many login/ submission machines as you like or need - there is no daemon running at all on them (though it might be different for PBSpro). The submission machines just need read access to /usr/sge or whereever it's installed to source the settings file and have access to the commands. Nevertheless it could be installed w/o NFS access at all - even the nodes could spare NFS, but you would lose some fucntionality and need some kind of file-staging for the jobs files. SGE's options regarding NFS are explained here: http:// gridengine.sunsource.net/howto/nfsreduce.html The options having just local spool directories fits my needs best. Maybe PBSpro has similar possibilities. How is PBSpro doing its spooling - do they have some kind of database like SGE? Is anyone putting the qmaster(s) in separate virtual machine(s) on the file server for failover - I got this idea recently? -- Reuti > > > Two service nodes which act as login/batch submission nodes. > > PBSpro configured to fail over between them (ie one is the PBS > primary server). > > Separate server for storage ? SGI connect these storage servers via > the Infiniband fabric, > > and use multiple Infiniband ports to spread the load ? you can > easily configure this at cluster install time, > > ie. every nth node connects to a different Infiniband port on the > storage server. > From christiansuhendra at gmail.com Fri Dec 4 06:47:37 2009 From: christiansuhendra at gmail.com (christian suhendra) Date: Fri, 4 Dec 2009 22:47:37 +0800 Subject: [Beowulf] mpirun command Message-ID: hello guys... im using mpich-1.2.7p1 installed on my PC but when i run mpirun i've got this error : suhendra18 at cluster2:/mirror/mpich-1.2.7p1/examples$ mpirun -np 1 cpi ----------------------------------------------------------------------------- It seems that there is no lamd running on the host cluster2. This indicates that the LAM/MPI runtime environment is not operating. The LAM/MPI runtime environment is necessary for MPI programs to run (the MPI program tired to invoke the "MPI_Init" function). Please run the "lamboot" command the start the LAM/MPI runtime environment. See the LAM/MPI documentation for how to invoke "lamboot" across multiple machines. ----------------------------------------------------------------------------- did my mpich switch to LAM/MPI?? so what should i do?? regards christian -------------- next part -------------- An HTML attachment was scrubbed... URL: From gus at ldeo.columbia.edu Fri Dec 4 13:04:51 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 04 Dec 2009 16:04:51 -0500 Subject: [Beowulf] mpirun command In-Reply-To: References: Message-ID: <4B197973.5000604@ldeo.columbia.edu> Hi Christian Your default mpirun seems to be the old LAM MPI. Do "which mpirun", "mpirun --showme". You can use the full path name to your MPICH mpirun. You should also use the full path name to MPICH mpicc to compile cpi.c, for compatibility. It is likely that both are somewhere in your /mirror/mpich-1.2.7p1/ directory tree. I would guess /mirror/mpich-1.2.7p1/mpicc and /mirror/mpich-1.2.7p1/mpirun, but you need to check. Well, MPICH-1 is also old, and often problematic. Better upgrade from MPICH1 to MPICH2 and/or from LAM/MPI to OpenMPI, which are open source and relatively easy to build using the free Gnu compilers (gcc, gfortran): http://www.mcs.anl.gov/research/projects/mpich2/ http://www.open-mpi.org/ You may also take a look at their mailing lists, for specific help w.r.t. MPI. If you are on Linux, there are also MPICH2 RPMs for some Linux distributions: http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads My $0.02 Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- christian suhendra wrote: > hello guys... > im using mpich-1.2.7p1 installed on my PC > but when i run mpirun > i've got this error : > suhendra18 at cluster2:/mirror/mpich-1.2.7p1/examples$ mpirun -np 1 cpi > ----------------------------------------------------------------------------- > > It seems that there is no lamd running on the host cluster2. > > This indicates that the LAM/MPI runtime environment is not operating. > The LAM/MPI runtime environment is necessary for MPI programs to run > (the MPI program tired to invoke the "MPI_Init" function). > > Please run the "lamboot" command the start the LAM/MPI runtime > environment. See the LAM/MPI documentation for how to invoke > "lamboot" across multiple machines. > ----------------------------------------------------------------------------- > did my mpich switch to LAM/MPI?? so what should i do?? > > > regards > christian > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From djholm at fnal.gov Fri Dec 4 13:22:09 2009 From: djholm at fnal.gov (Don Holmgren) Date: Fri, 04 Dec 2009 15:22:09 -0600 (CST) Subject: [Beowulf] mpirun command In-Reply-To: References: Message-ID: Your PC is likely running a Linux distribution that has LAM installed by default, and "mpirun" is in your path ahead of mpich's mpirun. You can confirm this with `which mpirun`. Try /mirror/mpich-1.2.7p1/bin/mpirun -np 1 cpi instead. Or make sure that /mirror/mpich-1.2.7p1/bin is at the front of your path. Don On Fri, 4 Dec 2009, christian suhendra wrote: > hello guys... > im using mpich-1.2.7p1 installed on my PC > but when i run mpirun > i've got this error : > suhendra18 at cluster2:/mirror/mpich-1.2.7p1/examples$ mpirun -np 1 cpi > ----------------------------------------------------------------------------- > > It seems that there is no lamd running on the host cluster2. > > This indicates that the LAM/MPI runtime environment is not operating. > The LAM/MPI runtime environment is necessary for MPI programs to run > (the MPI program tired to invoke the "MPI_Init" function). > > Please run the "lamboot" command the start the LAM/MPI runtime > environment. See the LAM/MPI documentation for how to invoke > "lamboot" across multiple machines. > ----------------------------------------------------------------------------- > did my mpich switch to LAM/MPI?? so what should i do?? > > > regards > christian From bcostescu at gmail.com Fri Dec 4 13:23:06 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Fri, 4 Dec 2009 22:23:06 +0100 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: <481727C4-110B-4D18-8361-DA12469D70FF@Keller.net> References: <200912031846.nB3Ikesg029202@bluewest.scyld.com> <481727C4-110B-4D18-8361-DA12469D70FF@Keller.net> Message-ID: On Thu, Dec 3, 2009 at 9:17 PM, Greg Keller wrote: > Essentially, once the port > has a physical link light it may take a while before spanning tree allows > traffic to actually flow through the port. ?Longer than a typical timeout. The time taken to activate the link is around 60s, but I've been told that it can be even higher. I've seen many times laptops randomly not getting addresses via DHCP because the DHCP timeout and the STP time on a Cisco switch were both around 60s - makes for very frustrating network diagnostic. > ?When loading/reloading the driver there seems to be an instantaneous drop > of the link that forces a new delay cycle. Most likely the PXE stack doesn't reset the link; the link is up soon after the computer is powered on so, by the time the POST has finished, the link is active. Again most likely, the Linux driver does a link reset as part of the initialization; I remember that the 3c59x driver was changed ~6years ago to not do this anymore (at Don Becker's suggestion, IIRC) and it would allow the established link to remain active, making DHCP succeed all the time. Bogdan From Greg at keller.net Fri Dec 4 14:36:00 2009 From: Greg at keller.net (Greg Keller) Date: Fri, 4 Dec 2009 16:36:00 -0600 Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK In-Reply-To: References: <200912031846.nB3Ikesg029202@bluewest.scyld.com> <481727C4-110B-4D18-8361-DA12469D70FF@Keller.net> Message-ID: <77FC36D1-4E0C-49DA-BA31-8ECB88318434@keller.net> On Dec 4, 2009, at 3:23 PM, Bogdan Costescu wrote: >> When loading/reloading the driver there seems to be an >> instantaneous drop >> of the link that forces a new delay cycle. > Most likely the PXE stack doesn't reset the link; the link is up soon > after the computer is powered on so, by the time the POST has > finished, the link is active. Again most likely, the Linux driver does > a link reset as part of the initialization; I remember that the 3c59x > driver was changed ~6years ago to not do this anymore (at Don Becker's > suggestion, IIRC) and it would allow the established link to remain > active, making DHCP succeed all the time. That's true for some ports. Most IPMI (duplex'd) ports seem to come up at 100Mb and then switch to 1Gb at some point in the Post. Normally PXE seems to work but later in the boot it fails to get a DHCP the address, so I suspect you are correct for many cases where the System brings up the 1Gb Link early in the post before the PXE. I like the fix you mention where 3com based cards don't reset the link. Most Lan On Motherboards seem to be Broadcom or Intel e1000 based in my world... but it would be kuel if "they" figured out the same magic for those drivers. Ultimately I think it's a workaround for overly cautious defaults on switches, but some times it's easier to drive around the pothole than fix it. Cheers! Greg From christiansuhendra at gmail.com Fri Dec 4 15:57:15 2009 From: christiansuhendra at gmail.com (christian suhendra) Date: Fri, 4 Dec 2009 11:57:15 -1200 Subject: [Beowulf] Re: mpicc error In-Reply-To: References: Message-ID: 2009/12/4 christian suhendra > hello..may ask your favor?? > when i run mpicc i've got error..(error attached) > i need your help..would you check my listing program if its true or false > i wanna make matrix operation and the input takes from file txt,, i'm using > C in my matrix and use canon algorithm.. > but i'm not sure of what i've done.. > please..my deadline almost 3 week.. > thank you for your advice..God Bless.. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Michael.Frese at NumerEx-LLC.com Sun Dec 6 12:12:18 2009 From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese) Date: Sun, 06 Dec 2009 13:12:18 -0700 Subject: [Beowulf] Cluster Communications Test App Message-ID: <6.2.5.6.2.20091206102842.0c040d70@NumerEx-LLC.com> We'd like to test our cluster communications hardware and software -- MPI/TCP over GigE -- with something besides our favorite -- and failing -- parallel application. Is there some sort of parallel NetPipe out there that sends and receives scads of messages and keeps track of integrity and performance results? Mike From cap at nsc.liu.se Mon Dec 7 02:38:40 2009 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Mon, 7 Dec 2009 11:38:40 +0100 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <4B17DBDC.5060106@ias.edu> References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com> <4B17DBDC.5060106@ias.edu> Message-ID: <200912071138.44455.cap@nsc.liu.se> On Thursday 03 December 2009, Prentice Bisbal wrote: > Hearns, John wrote: > > I was wondering if anyone has actual experience with running more than > > one node from a single power supply. Even just two boards on one PSU > > would be nice. We will be using barely 200W per node for 50 nodes and it > > just seems like a big waste to buy 50 power supply units. I have read > > the old posts but did not see any reports of success. > > > > Look at the Supermicro twin systems, they have two motherboards in 1U or > > four motherboards in 2U. > > > > I believe HP have similar. > > What I learned at SC09: > > HP does make twin nodes similar to SuperMicro, but the HP nodes are not > hot-swappable, Almost true, the DL1000 is not hot-swapable, the SL6000 kind of is (you can pull a 1U sub-unit which can be either one or two nodes). > if a single node goes down, you need to take down all the > nodes in the chassis before you can remove the dead node. Not very > practical. The SuperMicro nodes are definitely hot-swappable. Almost true, Supermicro has (or at least had) both hot-swap and non-hot-swap. /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From brockp at umich.edu Mon Dec 7 07:05:51 2009 From: brockp at umich.edu (Brock Palen) Date: Mon, 7 Dec 2009 10:05:51 -0500 Subject: [Beowulf] MPI-3 Podcast, Message-ID: <5D32BF5F-FB5D-4D6D-BCED-A8F887F9F01A@umich.edu> Of interest for members of the list is our recent show about MPI-3 and the MPI standards process: http://www.rce-cast.com/index.php/Podcast/rce-22-mpi-3-forum.html RSS, and iTunes subscribe are available on that page, Brock Palen and Jeff Squyres speak with Dr. Bill Gropp of University of Illinois Urbana-Champaign Urbana and Dr. Richard Graham of ORNL (Oak Ridge National Lab) on the MPI Forum, MPI-2.2 and the upcoming MPI-3 standards for parallel programming with MPI. William Gropp is the Paul and Cynthia Saylor Professor in the Department of Computer Science and Deputy Directory for Research for the Institute of Advanced Computing Applications and Technologies at the University of Illinois in Urbana-Champaign. He received his Ph.D. in Computer Science from Stanford University in 1982 and worked at Yale University and Argonne National Laboratory. His research interests are in parallel computing, software for scientific computing, and numerical methods for partial differential equations, and he is well known for the MPICH2 and PETSc libraries. Richard Graham has been at ONRL since Jan, 2007 and is the Group Leader for the Application Performance Tools group in the Computer Science and Mathematics division at ONRL, and is a Distinguished member of the Research Staff. Prior to joining ORNL he spent eight years at ORNL serving in a range of technical and managerial roles, leaving as the acting group leader for the Advanced Computing Laboratory. He is currently chairman of the MPI Forum, and is leading the MPI-3 effort. He led the LA-MPI development effort, and is one of three founders of the Open MPI project. Dr. Graham received his PhD in Theoretical Chemistry from Texas A&M University in 1990, and a BS in Chemistry from Seattle Pacific University in 1983. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 From csamuel at vpac.org Mon Dec 7 16:51:27 2009 From: csamuel at vpac.org (Chris Samuel) Date: Tue, 8 Dec 2009 11:51:27 +1100 (EST) Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <20091204024129.GA25462@bx9.net> Message-ID: <767373014.7554451260233487577.JavaMail.root@mail.vpac.org> ----- "Greg Lindahl" wrote: > It's not inevitable that the policy be that 3 > month jobs are allowed. Our users own our organisation (we are a partnership of the 8 universities in our state), generally they get what they ask for (within budget constraints). :-) cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Mon Dec 7 16:55:09 2009 From: csamuel at vpac.org (Chris Samuel) Date: Tue, 8 Dec 2009 11:55:09 +1100 (EST) Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <1671342945.7554501260233643669.JavaMail.root@mail.vpac.org> Message-ID: <954667269.7554531260233709381.JavaMail.root@mail.vpac.org> ----- "H?kon Bugge" wrote: > What we did in Platform (Scali) MPI, was to drain > the HPC interconnect, then close it down. The problem > was then reduced to checkpoint (e.g. using BLCR) > N processes. I suspect this is what Open-MPI does too, but I don't know if the VM based systems can migrate such jobs without this application layer support. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From prentice at ias.edu Tue Dec 8 07:50:28 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 08 Dec 2009 10:50:28 -0500 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <4B18719C.5080008@cse.ucdavis.edu> References: <20091203190200.GA647@bx9.net> <588682529.7331271259891827276.JavaMail.root@mail.vpac.org> <20091204021122.GA17286@bx9.net> <4B18719C.5080008@cse.ucdavis.edu> Message-ID: <4B1E75C4.6090707@ias.edu> Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ Bill Broadley wrote: > Greg Lindahl wrote: >> On Fri, Dec 04, 2009 at 12:57:07PM +1100, Chris Samuel wrote: >> >>> If you've got a job running on there for a month >>> or two then there's a fairly high opportunity cost >>> involved. >> That kind of policy has a fairly high opportunity cost, even before >> you factor in linked nodes. E.g. you see a system disk going bad, but >> the user will lose all their output unless the job runs for 4 more >> weeks... > > Indeed. You'd hope that such long running jobs would checkpoint. You'd hope that. Most of my current clusters users are scientific researchers in academia, not computer scientists. While some are extremely computer savvy, others have learned just enough about programming to do their calculations. Expecting the latter to write code with checkpointing is unrealistic, and working in academia, I can't force them to. Which is why taking down 4 nodes instead of just one is less than ideal. -- Prentice From jbardin at bu.edu Tue Dec 8 09:22:27 2009 From: jbardin at bu.edu (james bardin) Date: Tue, 8 Dec 2009 12:22:27 -0500 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <4B1E75C4.6090707@ias.edu> References: <20091203190200.GA647@bx9.net> <588682529.7331271259891827276.JavaMail.root@mail.vpac.org> <20091204021122.GA17286@bx9.net> <4B18719C.5080008@cse.ucdavis.edu> <4B1E75C4.6090707@ias.edu> Message-ID: On Tue, Dec 8, 2009 at 10:50 AM, Prentice Bisbal wrote: > You'd hope that. Most of my current clusters users are scientific > researchers in academia, not computer scientists. While some are > extremely computer savvy, others have learned just enough about > programming to do their calculations. Expecting the latter to write code > with checkpointing is unrealistic, and working in academia, I can't > force them to. Which is why taking down 4 nodes instead of just one is > less than ideal. > I find it's still advantageous to push them to learn it. A researcher working with a tight deadline for a grant will often see the light when a hardware failure loses them a month or more of data processing. It really is in their own best interests to learn about their tools. From h-bugge at online.no Tue Dec 8 09:26:50 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Tue, 8 Dec 2009 18:26:50 +0100 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: <954667269.7554531260233709381.JavaMail.root@mail.vpac.org> References: <954667269.7554531260233709381.JavaMail.root@mail.vpac.org> Message-ID: On Dec 8, 2009, at 1:55 , Chris Samuel wrote: > I suspect this is what Open-MPI does too, but I > don't know if the VM based systems can migrate > such jobs without this application layer support. Anyone who knows about migration of VMs where the MPI processes use ibverbs (i.e. user space access to the HCAs)? H?kon From james.p.lux at jpl.nasa.gov Tue Dec 8 09:56:49 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 8 Dec 2009 09:56:49 -0800 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: Message-ID: On 12/8/09 9:22 AM, "james bardin" wrote: > On Tue, Dec 8, 2009 at 10:50 AM, Prentice Bisbal wrote: > >> You'd hope that. Most of my current clusters users are scientific >> researchers in academia, not computer scientists. While some are >> extremely computer savvy, others have learned just enough about >> programming to do their calculations. Expecting the latter to write code >> with checkpointing is unrealistic, and working in academia, I can't >> force them to. Which is why taking down 4 nodes instead of just one is >> less than ideal. >> > > I find it's still advantageous to push them to learn it. A researcher > working with a tight deadline for a grant will often see the light > when a hardware failure loses them a month or more of data processing. > It really is in their own best interests to learn about their tools. What about some form of "image checkpoint" like "hibernation"... Should be application unaware, just snapshots memory. From prentice at ias.edu Tue Dec 8 10:52:28 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 08 Dec 2009 13:52:28 -0500 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster In-Reply-To: References: Message-ID: <4B1EA06C.3070307@ias.edu> Lux, Jim (337C) wrote: > > > On 12/8/09 9:22 AM, "james bardin" wrote: > >> On Tue, Dec 8, 2009 at 10:50 AM, Prentice Bisbal wrote: >> >>> You'd hope that. Most of my current clusters users are scientific >>> researchers in academia, not computer scientists. While some are >>> extremely computer savvy, others have learned just enough about >>> programming to do their calculations. Expecting the latter to write code >>> with checkpointing is unrealistic, and working in academia, I can't >>> force them to. Which is why taking down 4 nodes instead of just one is >>> less than ideal. >>> >> I find it's still advantageous to push them to learn it. A researcher >> working with a tight deadline for a grant will often see the light >> when a hardware failure loses them a month or more of data processing. >> It really is in their own best interests to learn about their tools. > > > What about some form of "image checkpoint" like "hibernation"... Should be > application unaware, just snapshots memory. That's fine when the problem is on one system and there's only one system image to worry about check pointing once you start spreading the job around to multiple systems, things get complicated, especially if your node is heterogeneous w.r.t hardware. I fear we're straying off the topic of the original post... -- Prentice From jeff.johnson at aeoncomputing.com Mon Dec 7 12:11:44 2009 From: jeff.johnson at aeoncomputing.com (Jeff Johnson) Date: Mon, 07 Dec 2009 12:11:44 -0800 Subject: [Beowulf] Re: Beowulf Digest, Vol 70, Issue 17 In-Reply-To: <200912072000.nB7K0BYf028155@bluewest.scyld.com> References: <200912072000.nB7K0BYf028155@bluewest.scyld.com> Message-ID: <4B1D6180.2070604@aeoncomputing.com> On 12/7/09 12:00 PM, beowulf-request at beowulf.org wrote: > Date: Sun, 06 Dec 2009 13:12:18 -0700 > From: "Michael H. Frese" > Subject: [Beowulf] Cluster Communications Test App > > [...] > Is there some sort of parallel NetPipe out there that sends and > receives scads of messages and keeps track of integrity and > performance results? > What used to be Pallas, now Intel's IMB (v3.2) is what you are looking for: http://software.intel.com/en-us/articles/intel-mpi-benchmarks/ Compile against your MPI environment and run. Very detailed instructions are provided. I would suggest the Alltoall test method within IMB, it is a good way to stress your topology. --Jeff -- ------------------------------ Jeff Johnson Manager Aeon Computing jeff.johnson at aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 f: 858-412-3845 m: 619-204-9061 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117 From jellogum at gmail.com Wed Dec 9 10:19:30 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Wed, 9 Dec 2009 10:19:30 -0800 Subject: [Beowulf] Sony PS3, random news Message-ID: DoD buys PS3 for HPC. CNN brief at http://scitech.blogs.cnn.com/2009/12/09/military-purchases-2200-ps3s/ Clip from report: "Though a single 3.2 GHz cell processor can deliver over 200 GFLOPS, whereas the Sony PS3 configuration delivers approximately 150 GFLOPS, the approximately tenfold cost difference per GFLOP makes the Sony PS3 the only viable technology for HPC applications." -- Jeremy Baker PO 297 Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From amjad11 at gmail.com Wed Dec 9 17:22:44 2009 From: amjad11 at gmail.com (amjad ali) Date: Wed, 9 Dec 2009 20:22:44 -0500 Subject: [Beowulf] scalability Message-ID: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com> Hi all, I have, with my group, a small cluster of about 16 nodes (each one with single socket Xeon 3085 or 3110; And I face problem of poor scalability. Its network is quite ordinary GiGE (perhaps DLink DGS-1024D 24-Port 10/100/1000), store and forward switch, of price about $250 only. ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf How should I work on that for better scalability? What could be better affordable options of fast switches? (Myrinet, Infiniband are quite costly). When buying a switch what should we see in it? What latency? Thank you very much. -------------- next part -------------- An HTML attachment was scrubbed... URL: From amjad11 at gmail.com Wed Dec 9 17:46:25 2009 From: amjad11 at gmail.com (amjad ali) Date: Wed, 9 Dec 2009 20:46:25 -0500 Subject: [Beowulf] cluster sharing Message-ID: <428810f20912091746k74184ef4n125c2b5ada94606a@mail.gmail.com> Hi all, I am usually running my parallel jobs on the university cluster of several hundred nodes (accessible to all). So often I observe very different total MPI_Wtime value when I run my program (even of same problem size) on different occasion. How should I make a reasonable performance measurement in such a case? Is it fine to get speedup/measurements on such a shared cluster? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gus at ldeo.columbia.edu Wed Dec 9 18:11:44 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 09 Dec 2009 21:11:44 -0500 Subject: [Beowulf] scalability In-Reply-To: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com> References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com> Message-ID: <4B2058E0.9050602@ldeo.columbia.edu> Hi Amjad There is relatively inexpensive Infiniband SDR: http://www.colfaxdirect.com/store/pc/showsearchresults.asp?customfield=5&SearchValues=65 http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=12 http://www.colfaxdirect.com/store/pc/viewCategories.asp?SFID=12&SFNAME=Brand&SFVID=50&SFVALUE=Mellanox&SFCount=0&page=0&pageStyle=m&idcategory=2&VS12=0&VS9=0&VS10=0&VS4=0&VS3=0&VS11=0 Not the latest greatest, but faster than Gigabit Ethernet. A better Gigabit Ethernet switch may help also, but I wonder if the impact will be as big as expected. However, are you sure the scalability problems you see are due to poor network connection? Could it be perhaps related to the code itself, or maybe to the processors' memory bandwidth? You could test if it is network running the program inside a node (say on 4 cores) and across 4 nodes with one core in use on each node, or other combinations (2 cores on 2 nodes). You could have an indication of the processors' scalability by timing program runs inside a single node using 1,2,3,4 cores. My experience with dual socket dual core Xeons vs. dual socket dual core Opterons, with the type of code we run here (ocean,atmosphere,climate models, which are not totally far from your CFD) is that Opterons scale close to linear, but Xeons get nearly stuck in terms of scaling when there are more than 2 processes (3 or 4) running in a single node. My two cents. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- amjad ali wrote: > Hi all, > > I have, with my group, a small cluster of about 16 nodes (each one with > single socket Xeon 3085 or 3110; And I face problem of poor scalability. > Its network is quite ordinary GiGE (perhaps DLink DGS-1024D 24-Port > 10/100/1000), store and forward switch, of price about $250 only. > ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf > > How should I work on that for better scalability? > > What could be better affordable options of fast switches? (Myrinet, > Infiniband are quite costly). > > When buying a switch what should we see in it? What latency? > > > Thank you very much. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From amjad11 at gmail.com Wed Dec 9 19:14:23 2009 From: amjad11 at gmail.com (amjad ali) Date: Wed, 9 Dec 2009 22:14:23 -0500 Subject: [Beowulf] scalability In-Reply-To: <4B2058E0.9050602@ldeo.columbia.edu> References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com> <4B2058E0.9050602@ldeo.columbia.edu> Message-ID: <428810f20912091914r4955515wa53f12611a848af6@mail.gmail.com> Hi Gus, your nice reply; as usual. I ran my code on single socket xeon node having two cores; It ran linear 97+% efficient. Then I ran my code on single socket xeon node having four cores ( Xeon 3220 -which really not a good quad core) I got the efficiency of around 85%. But on four single socket nodes I ran 4 processes (1 process on each node); I got the efficiency of around 62%. Yes, CFD codes are memory bandwidth bound usually. Thank you very much. run with 2core On Wed, Dec 9, 2009 at 9:11 PM, Gus Correa wrote: > Hi Amjad > > There is relatively inexpensive Infiniband SDR: > > http://www.colfaxdirect.com/store/pc/showsearchresults.asp?customfield=5&SearchValues=65 > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=12 > > http://www.colfaxdirect.com/store/pc/viewCategories.asp?SFID=12&SFNAME=Brand&SFVID=50&SFVALUE=Mellanox&SFCount=0&page=0&pageStyle=m&idcategory=2&VS12=0&VS9=0&VS10=0&VS4=0&VS3=0&VS11=0 > Not the latest greatest, but faster than Gigabit Ethernet. > A better Gigabit Ethernet switch may help also, > but I wonder if the impact will be as big as expected. > > However, are you sure the scalability problems you see are > due to poor network connection? > Could it be perhaps related to the code itself, > or maybe to the processors' memory bandwidth? > > You could test if it is network running the program inside a node > (say on 4 cores) and across 4 nodes with > one core in use on each node, or other combinations > (2 cores on 2 nodes). > > You could have an indication of the processors' scalability > by timing program runs inside a single node using 1,2,3,4 cores. > > My experience with dual socket dual core Xeons vs. > dual socket dual core Opterons, > with the type of code we run here (ocean,atmosphere,climate models, > which are not totally far from your CFD) is that Opterons > scale close to linear, but Xeons get nearly stuck in terms of scaling > when there are more than 2 processes (3 or 4) running in a single node. > > My two cents. > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > > amjad ali wrote: > >> Hi all, >> >> I have, with my group, a small cluster of about 16 nodes (each one with >> single socket Xeon 3085 or 3110; And I face problem of poor scalability. Its >> network is quite ordinary GiGE (perhaps DLink DGS-1024D 24-Port >> 10/100/1000), store and forward switch, of price about $250 only. >> ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf >> >> How should I work on that for better scalability? >> >> What could be better affordable options of fast switches? (Myrinet, >> Infiniband are quite costly). >> >> When buying a switch what should we see in it? What latency? >> >> >> Thank you very much. >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From toon.knapen at gmail.com Wed Dec 9 23:20:34 2009 From: toon.knapen at gmail.com (Toon Knapen) Date: Thu, 10 Dec 2009 08:20:34 +0100 Subject: [Beowulf] cluster sharing In-Reply-To: <428810f20912091746k74184ef4n125c2b5ada94606a@mail.gmail.com> References: <428810f20912091746k74184ef4n125c2b5ada94606a@mail.gmail.com> Message-ID: Hello Amjad, You need exclusive access to (part of) the cluster. Generally a batch scheduler is available that will schedule the jobs of multiple users on generally a FIFO basis, guaranteeing exclusive access and thus optimal performance. On Thu, Dec 10, 2009 at 2:46 AM, amjad ali wrote: > Hi all, > > I am usually running my parallel jobs on the university cluster of several > hundred nodes (accessible to all). So often I observe very different total > MPI_Wtime value when I run my program (even of same problem size) on > different occasion. > > How should I make a reasonable performance measurement in such a case? Is > it fine to get speedup/measurements on such a shared cluster? > > > Thanks. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From deadline at eadline.org Thu Dec 10 05:45:30 2009 From: deadline at eadline.org (Douglas Eadline) Date: Thu, 10 Dec 2009 08:45:30 -0500 (EST) Subject: [Beowulf] scalability In-Reply-To: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com> References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com> Message-ID: <51346.192.168.1.1.1260452730.squirrel@mail.eadline.org> The performance of GigE can vary widely based on several issues: - chipset - driver - driver settings (interrupt coalescing etc.) - switch performance under load - switch manufacturer A well tuned GigE network is never as good as IB or Myrinet, but it can work well for some codes. You can also try using Open-MX instead of TCP for MPI communications. -- Doug > Hi all, > > I have, with my group, a small cluster of about 16 nodes (each one with > single socket Xeon 3085 or 3110; And I face problem of poor scalability. > Its > network is quite ordinary GiGE (perhaps DLink DGS-1024D 24-Port > 10/100/1000), store and forward switch, of price about $250 only. > ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf > > How should I work on that for better scalability? > > What could be better affordable options of fast switches? (Myrinet, > Infiniband are quite costly). > > When buying a switch what should we see in it? What latency? > > > Thank you very much. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From gus at ldeo.columbia.edu Thu Dec 10 10:43:07 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 10 Dec 2009 13:43:07 -0500 Subject: [Beowulf] scalability In-Reply-To: <428810f20912091914r4955515wa53f12611a848af6@mail.gmail.com> References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com> <4B2058E0.9050602@ldeo.columbia.edu> <428810f20912091914r4955515wa53f12611a848af6@mail.gmail.com> Message-ID: <4B21413B.7020000@ldeo.columbia.edu> Hi Amjad amjad ali wrote: > Hi Gus, > your nice reply; as usual. > > I ran my code on single socket xeon node having two cores; It ran linear > 97+% efficient. > > Then I ran my code on single socket xeon node having four cores ( Xeon > 3220 -which really not a good quad core) I got the efficiency of around 85%. > > But on four single socket nodes I ran 4 processes (1 process on each > node); I got the efficiency of around 62%. > This is about the same number I've got for an atmospheric model in a dual-socket dual-core Xeon computer. Somehow the memory path/bus on these systems is not very efficient, and saturates when more than two processes do intensive work concurrently. A similar computer configuration with dual- dual- AMD Opterons performed significantly better on the same atmospheric code (efficiency close to 90%). I was told that some people used to run two processes only on dual-socket dual-core Xeon nodes , leaving the other two cores idle. Although it is an apparent waste, the argument was that it paid off in terms of overall efficiency. Have you tried to run your programs this way on your cluster? Say, with one process only per node, and N nodes, then with two processes per node, and N/2 nodes, then with four processes per node, and N/4 nodes. This may tell what is optimal for the hardware you have. With OpenMPI you can use the "mpiexec" flags "-bynode" and "-byslot" to control this behavior. "man mpiexec" is your friend! :) > Yes, CFD codes are memory bandwidth bound usually. > Indeed, and so is most of our atmosphere/ocean/climate codes, which has a lot of CFD, but also radiative processes, mixing, thermodynamics, etc. However, most of our models use fixed grids, and I suppose some of your aerodynamics may use adaptive meshes, right? I guess you are doing aerodynamics, right? > Thank you very much. > My pleasure. I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > > > > > run with 2core > On Wed, Dec 9, 2009 at 9:11 PM, Gus Correa > wrote: > > Hi Amjad > > There is relatively inexpensive Infiniband SDR: > http://www.colfaxdirect.com/store/pc/showsearchresults.asp?customfield=5&SearchValues=65 > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=12 > http://www.colfaxdirect.com/store/pc/viewCategories.asp?SFID=12&SFNAME=Brand&SFVID=50&SFVALUE=Mellanox&SFCount=0&page=0&pageStyle=m&idcategory=2&VS12=0&VS9=0&VS10=0&VS4=0&VS3=0&VS11=0 > > Not the latest greatest, but faster than Gigabit Ethernet. > A better Gigabit Ethernet switch may help also, > but I wonder if the impact will be as big as expected. > > However, are you sure the scalability problems you see are > due to poor network connection? > Could it be perhaps related to the code itself, > or maybe to the processors' memory bandwidth? > > You could test if it is network running the program inside a node > (say on 4 cores) and across 4 nodes with > one core in use on each node, or other combinations > (2 cores on 2 nodes). > > You could have an indication of the processors' scalability > by timing program runs inside a single node using 1,2,3,4 cores. > > My experience with dual socket dual core Xeons vs. > dual socket dual core Opterons, > with the type of code we run here (ocean,atmosphere,climate models, > which are not totally far from your CFD) is that Opterons > scale close to linear, but Xeons get nearly stuck in terms of scaling > when there are more than 2 processes (3 or 4) running in a single node. > > My two cents. > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > > amjad ali wrote: > > Hi all, > > I have, with my group, a small cluster of about 16 nodes (each > one with single socket Xeon 3085 or 3110; And I face problem of > poor scalability. Its network is quite ordinary GiGE (perhaps > DLink DGS-1024D 24-Port 10/100/1000), store and forward switch, > of price about $250 only. > ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf > > How should I work on that for better scalability? > > What could be better affordable options of fast switches? > (Myrinet, Infiniband are quite costly). > > When buying a switch what should we see in it? What latency? > > > Thank you very much. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From jorg.sassmannshausen at strath.ac.uk Thu Dec 10 06:56:29 2009 From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=) Date: Thu, 10 Dec 2009 14:56:29 +0000 Subject: [Beowulf] Re: scalability In-Reply-To: <200912101350.nBADnxv1012984@bluewest.scyld.com> References: <200912101350.nBADnxv1012984@bluewest.scyld.com> Message-ID: <200912101456.29738.jorg.sassmannshausen@strath.ac.uk> Hi Doug, I have heard of Open-MX before, do you need special hardware for that? We are currently using one GB network here on the cluster (for everything: NFS, MPI...) and I would like to increase the performance for the parallel codes I am using (NWChem, cp2k, GAMESS) without dishing out too much money for IB or Myrinet as the cluster is too small for them (22 nodes right now). All the best J?rg On Thursday 10 December 2009 13:50:51 beowulf-request at beowulf.org wrote: > Date: Thu, 10 Dec 2009 08:45:30 -0500 (EST) > From: "Douglas Eadline" > Subject: Re: [Beowulf] scalability > To: "amjad ali" > Cc: Beowulf Mailing List > Message-ID: <51346.192.168.1.1.1260452730.squirrel at mail.eadline.org> > Content-Type: text/plain;charset=iso-8859-1 > > > The performance of GigE can vary widely based on several > issues: > > ?- chipset > ?- driver > ?- driver settings (interrupt coalescing etc.) > ?- switch performance under load > ?- switch manufacturer > > A well tuned GigE network is never as good as > IB or Myrinet, but it can work well for some codes. > You can also try using Open-MX instead of TCP > for MPI communications. > > -- > Doug -- ************************************************************* J?rg Sa?mannshausen Research Fellow University of Strathclyde Department of Pure and Applied Chemistry 295 Cathedral St. Glasgow G1 1XL email: jorg.sassmannshausen at strath.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From atchley at myri.com Thu Dec 10 11:47:06 2009 From: atchley at myri.com (Scott Atchley) Date: Thu, 10 Dec 2009 14:47:06 -0500 Subject: [Beowulf] Re: scalability In-Reply-To: <200912101456.29738.jorg.sassmannshausen@strath.ac.uk> References: <200912101350.nBADnxv1012984@bluewest.scyld.com> <200912101456.29738.jorg.sassmannshausen@strath.ac.uk> Message-ID: <4881CFFF-C36E-4AD6-BEDC-300474AD44C9@myri.com> On Dec 10, 2009, at 9:56 AM, J?rg Sa?mannshausen wrote: > I have heard of Open-MX before, do you need special hardware for that? No, any Ethernet driver on Linux. http://open-mx.org Scott From deadline at eadline.org Thu Dec 10 11:53:58 2009 From: deadline at eadline.org (Douglas Eadline) Date: Thu, 10 Dec 2009 14:53:58 -0500 (EST) Subject: [Beowulf] Re: scalability In-Reply-To: <200912101456.29738.jorg.sassmannshausen@strath.ac.uk> References: <200912101350.nBADnxv1012984@bluewest.scyld.com> <200912101456.29738.jorg.sassmannshausen@strath.ac.uk> Message-ID: <54300.192.168.1.1.1260474838.squirrel@mail.eadline.org> > Hi Doug, > > I have heard of Open-MX before, do you need special hardware for that? We > are > currently using one GB network here on the cluster (for everything: NFS, > MPI...) and I would like to increase the performance for the parallel > codes I > am using (NWChem, cp2k, GAMESS) without dishing out too much money for IB > or > Myrinet as the cluster is too small for them (22 nodes right now). > > All the best Open-MX will work over any Ethernet connection. If you have a recent kernel, it should work in an incremental fashion, that is, you can install it, build any MPI that supports MX and run it. It will co-exist just fine with TCP/IP traffic. Check the home page: http://open-mx.gforge.inria.fr/ I think the authors read this list. -- Doug > > J?rg > > On Thursday 10 December 2009 13:50:51 beowulf-request at beowulf.org wrote: >> Date: Thu, 10 Dec 2009 08:45:30 -0500 (EST) >> From: "Douglas Eadline" >> Subject: Re: [Beowulf] scalability >> To: "amjad ali" >> Cc: Beowulf Mailing List >> Message-ID: <51346.192.168.1.1.1260452730.squirrel at mail.eadline.org> >> Content-Type: text/plain;charset=iso-8859-1 >> >> >> The performance of GigE can vary widely based on several >> issues: >> >> ?- chipset >> ?- driver >> ?- driver settings (interrupt coalescing etc.) >> ?- switch performance under load >> ?- switch manufacturer >> >> A well tuned GigE network is never as good as >> IB or Myrinet, but it can work well for some codes. >> You can also try using Open-MX instead of TCP >> for MPI communications. >> >> -- >> Doug > > -- > ************************************************************* > J?rg Sa?mannshausen > Research Fellow > University of Strathclyde > Department of Pure and Applied Chemistry > 295 Cathedral St. > Glasgow > G1 1XL > > email: jorg.sassmannshausen at strath.ac.uk > web: http://sassy.formativ.net > > Please avoid sending me Word or PowerPoint attachments. > See http://www.gnu.org/philosophy/no-word-attachments.html > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From bernard at vanhpc.org Thu Dec 10 12:20:21 2009 From: bernard at vanhpc.org (Bernard Li) Date: Thu, 10 Dec 2009 12:20:21 -0800 Subject: [Beowulf] Sony PS3, random news In-Reply-To: References: Message-ID: Curious what software they use to provision these PS3s ;-) Cheers, Bernard On Wed, Dec 9, 2009 at 10:19 AM, Jeremy Baker wrote: > DoD buys PS3 for HPC. > > CNN brief at > > http://scitech.blogs.cnn.com/2009/12/09/military-purchases-2200-ps3s/ > > Clip from report: > > ?"Though a single 3.2 GHz cell processor can deliver over 200 GFLOPS, > whereas the Sony PS3 configuration delivers approximately 150 GFLOPS, the > approximately tenfold cost difference per GFLOP makes the Sony PS3 the only > viable technology for HPC applications." > > > -- > Jeremy Baker > PO 297 > Johnson, VT > 05656 > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From gerry.creager at tamu.edu Thu Dec 10 13:22:10 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Thu, 10 Dec 2009 15:22:10 -0600 Subject: [Beowulf] Sony PS3, random news In-Reply-To: References: Message-ID: <4B216682.7060506@tamu.edu> Grand Theft Auto? More'n'likely, YellowDog Linux, which is what the folks in Colorado adapted, under contract with Sony, to generate for the PS3. gerry Bernard Li wrote: > Curious what software they use to provision these PS3s ;-) > > Cheers, > > Bernard > > On Wed, Dec 9, 2009 at 10:19 AM, Jeremy Baker wrote: >> DoD buys PS3 for HPC. >> >> CNN brief at >> >> http://scitech.blogs.cnn.com/2009/12/09/military-purchases-2200-ps3s/ >> >> Clip from report: >> >> "Though a single 3.2 GHz cell processor can deliver over 200 GFLOPS, >> whereas the Sony PS3 configuration delivers approximately 150 GFLOPS, the >> approximately tenfold cost difference per GFLOP makes the Sony PS3 the only >> viable technology for HPC applications." >> >> >> -- >> Jeremy Baker >> PO 297 >> Johnson, VT >> 05656 >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From csamuel at vpac.org Thu Dec 10 16:28:05 2009 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 11 Dec 2009 11:28:05 +1100 (EST) Subject: [Beowulf] cluster sharing In-Reply-To: <428810f20912091746k74184ef4n125c2b5ada94606a@mail.gmail.com> Message-ID: <1433546203.7750671260491285873.JavaMail.root@mail.vpac.org> ----- "amjad ali" wrote: > How should I make a reasonable performance measurement > in such a case? I would suggest that you request entire nodes through the batch scheduler. So if you are using Torque (PBS) for instance and wanted to run an 80 CPU job on dual socket quad core nodes you would request: #PBS -l nodes=10:ppn=8 The scheduler would then only allocate you nodes that were not used by other people. All the best, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Thu Dec 10 16:33:46 2009 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 11 Dec 2009 11:33:46 +1100 (EST) Subject: [Beowulf] scalability In-Reply-To: <4B21413B.7020000@ldeo.columbia.edu> Message-ID: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org> ----- "Gus Correa" wrote: > This is about the same number I've got for an atmospheric model > in a dual-socket dual-core Xeon computer. > Somehow the memory path/bus on these systems is not very efficient, > and saturates when more than two processes do > intensive work concurrently. > A similar computer configuration with dual- dual- > AMD Opterons performed significantly better on the same atmospheric > code (efficiency close to 90%). The issue is that there are Xeon's and then there are Xeon's. The older Woodcrest/Clovertown type CPUs had the standard Intel bottleneck of a single memory controller for both sockets. The newer Nehalem Xeon's have HyperTransport^W QPI which involves each socket having its own memory controller with connections to local RAM. This is essentially what AMD have been doing with Opteron for years and why they've traditionally done better than Intel with memory intensive codes. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From bernard at vanhpc.org Fri Dec 11 09:26:41 2009 From: bernard at vanhpc.org (Bernard Li) Date: Fri, 11 Dec 2009 09:26:41 -0800 Subject: [Beowulf] Sony PS3, random news In-Reply-To: <4B216682.7060506@tamu.edu> References: <4B216682.7060506@tamu.edu> Message-ID: Hi Gerry: On Thu, Dec 10, 2009 at 1:22 PM, Gerald Creager wrote: > Grand Theft Auto? ?More'n'likely, YellowDog Linux, which is what the folks > in Colorado adapted, under contract with Sony, to generate for the PS3. :-) Actually by now you could install pretty much any Linux flavour on the PS3. I was talking specifically about what system software they used to provision the 2000+ PS3s. You would still need to manually switch each PS3 to boot from "otheros", but beyond that it would be nice to have an automated system of provisioning the OS... Cheers, Bernard From gus at ldeo.columbia.edu Fri Dec 11 10:25:34 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 11 Dec 2009 13:25:34 -0500 Subject: [Beowulf] scalability In-Reply-To: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org> References: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org> Message-ID: <4B228E9E.8030407@ldeo.columbia.edu> Hi Chris Chris Samuel wrote: > ----- "Gus Correa" wrote: > >> This is about the same number I've got for an atmospheric model >> in a dual-socket dual-core Xeon computer. >> Somehow the memory path/bus on these systems is not very efficient, >> and saturates when more than two processes do >> intensive work concurrently. >> A similar computer configuration with dual- dual- >> AMD Opterons performed significantly better on the same atmospheric >> code (efficiency close to 90%). > > The issue is that there are Xeon's and then there > are Xeon's. > > The older Woodcrest/Clovertown type CPUs had the > standard Intel bottleneck of a single memory > controller for both sockets. > Yes, that is for fact, but didn't the Harpertown generation still have a similar problem? Amjad's Xeon small cluster machines are dual socket dual core, perhaps a bit older than the type I had used here (Intel Xeon 5160 3.00GHz) in standalone workstations and tested an atmosphere model with the efficiency numbers I mentioned above. According to Amjad: "I have, with my group, a small cluster of about 16 nodes (each one with single socket Xeon 3085 or 3110; And I face problem of poor scalability. " I lost track of the Intel number/naming convention. Are Amjad's and mine Woodcrest? Clovertown? Harpertown? > The newer Nehalem Xeon's have HyperTransport^W QPI > which involves each socket having its own memory > controller with connections to local RAM. That has been widely reported, at least in SPEC2000 type of tests. Unfortunately I don't have any Nehalem to play with our codes. However, please take a look at the ongoing discussion on the OpenMPI list about memory issues with Nehalem (perhaps combined with later versions of GCC) on MPI programs: http://www.open-mpi.org/community/lists/users/2009/12/11462.php http://www.open-mpi.org/community/lists/users/2009/12/11499.php http://www.open-mpi.org/community/lists/users/2009/12/11500.php http://www.open-mpi.org/community/lists/users/2009/12/11516.php http://www.open-mpi.org/community/lists/users/2009/12/11515.php > > This is essentially what AMD have been doing with > Opteron for years and why they've traditionally > done better than Intel with memory intensive codes. > Yes, and we're happy with their performance, memory bandwidth and scalability on the codes we run (mostly ocean/atmosphere/climate). Steady workhorses. Not advocating any manufacturer's cause, just telling our experience. > cheers, > Chris Cheers, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From tom.elken at qlogic.com Fri Dec 11 11:18:11 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Fri, 11 Dec 2009 11:18:11 -0800 Subject: [Beowulf] scalability In-Reply-To: <4B228E9E.8030407@ldeo.columbia.edu> References: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org> <4B228E9E.8030407@ldeo.columbia.edu> Message-ID: <35AAF1E4A771E142979F27B51793A48887030F1FE7@AVEXMB1.qlogic.org> > > The older Woodcrest/Clovertown type CPUs had the > > standard Intel bottleneck of a single memory > > controller for both sockets. > > Yes, that is for fact, but didn't the > Harpertown generation still have a similar problem? Yes. > > Amjad's Xeon small cluster machines are dual socket dual core, > perhaps a bit older than the type I had used here > (Intel Xeon 5160 3.00GHz) in standalone workstations ... > According to Amjad: > "I have, with my group, a small cluster of about 16 nodes > (each one with single socket Xeon 3085 or 3110; > And I face problem of poor scalability. " > > I lost track of the Intel number/naming convention. > Are Amjad's and mine Woodcrest? Yours (Xeon 5160) are Woodcrest. Amjad's are Conroe (Xeon 3085) and Wolfdale (Xeon 3110). But they all appear to be very similar. A good reference is: http://en.wikipedia.org/wiki/Xeon#3000-series_.22Conroe.22 and further down the page. Microarchitecturally they are probably virtually identical. Main differences differences in - L2 size (Wolfdale (Xeon 3110) had 6MB, and the others 4MB, all shared between 2 cores), - power dissipation, - process size, and - max # of CPUs in a system. -Tom > Clovertown? > Harpertown? > > > The newer Nehalem Xeon's have HyperTransport^W QPI > > which involves each socket having its own memory > > controller with connections to local RAM. > From gerry.creager at tamu.edu Fri Dec 11 11:28:40 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Fri, 11 Dec 2009 13:28:40 -0600 Subject: [Beowulf] scalability In-Reply-To: <4B228E9E.8030407@ldeo.columbia.edu> References: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org> <4B228E9E.8030407@ldeo.columbia.edu> Message-ID: <4B229D68.4060204@tamu.edu> Howdy! Gus Correa wrote: > Hi Chris > > Chris Samuel wrote: >> ----- "Gus Correa" wrote: >> >>> This is about the same number I've got for an atmospheric model >>> in a dual-socket dual-core Xeon computer. >>> Somehow the memory path/bus on these systems is not very efficient, >>> and saturates when more than two processes do >>> intensive work concurrently. >>> A similar computer configuration with dual- dual- >>> AMD Opterons performed significantly better on the same atmospheric >>> code (efficiency close to 90%). >> >> The issue is that there are Xeon's and then there >> are Xeon's. >> >> The older Woodcrest/Clovertown type CPUs had the >> standard Intel bottleneck of a single memory >> controller for both sockets. >> > > Yes, that is for fact, but didn't the > Harpertown generation still have a similar problem? Yes. > Amjad's Xeon small cluster machines are dual socket dual core, > perhaps a bit older than the type I had used here > (Intel Xeon 5160 3.00GHz) in standalone workstations > and tested an atmosphere model with the efficiency > numbers I mentioned above. > According to Amjad: > > "I have, with my group, a small cluster of about 16 nodes > (each one with single socket Xeon 3085 or 3110; > And I face problem of poor scalability. " What's the application? WRF may fail to scale even at modest numbers of cores if the domain size is sufficiently large. IS this a NWP code? (sorry for coming in late on the discussion... I've been hacking on RWFv3.1.1 with openMPI and PGI and seeing some interesting problems. gerry > I lost track of the Intel number/naming convention. > Are Amjad's and mine Woodcrest? > Clovertown? > Harpertown? > >> The newer Nehalem Xeon's have HyperTransport^W QPI >> which involves each socket having its own memory >> controller with connections to local RAM. > > That has been widely reported, at least in SPEC2000 type of tests. > Unfortunately I don't have any Nehalem to play with our codes. > However, please take a look at the ongoing discussion on the OpenMPI > list about memory issues with Nehalem > (perhaps combined with later versions of GCC) on MPI programs: > > http://www.open-mpi.org/community/lists/users/2009/12/11462.php > http://www.open-mpi.org/community/lists/users/2009/12/11499.php > http://www.open-mpi.org/community/lists/users/2009/12/11500.php > http://www.open-mpi.org/community/lists/users/2009/12/11516.php > http://www.open-mpi.org/community/lists/users/2009/12/11515.php > >> >> This is essentially what AMD have been doing with >> Opteron for years and why they've traditionally >> done better than Intel with memory intensive codes. >> > > Yes, and we're happy with their performance, memory bandwidth > and scalability on the codes we run (mostly ocean/atmosphere/climate). > Steady workhorses. > > Not advocating any manufacturer's cause, > just telling our experience. > >> cheers, >> Chris > > Cheers, > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From gus at ldeo.columbia.edu Fri Dec 11 15:34:45 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 11 Dec 2009 18:34:45 -0500 Subject: [Beowulf] Re: scalability In-Reply-To: <428810f20912101944w30e9fe36r64216533ac218d7d@mail.gmail.com> References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com> <4B2058E0.9050602@ldeo.columbia.edu> <428810f20912091914r4955515wa53f12611a848af6@mail.gmail.com> <4B21413B.7020000@ldeo.columbia.edu> <428810f20912101944w30e9fe36r64216533ac218d7d@mail.gmail.com> Message-ID: <4B22D715.3050809@ldeo.columbia.edu> Hi Amjad amjad ali wrote: > Hi Gus, > > I was told that some people used to run two processes only on > dual-socket dual-core Xeon nodes , leaving the other two cores idle. > Although it is an apparent waste, the argument was that it paid > off in terms of overall efficiency. > > > I guess I fully agree with this. > > > Have you tried to run your programs this way on your cluster? > Say, with one process only per node, and N nodes, > then with two processes per node, and N/2 nodes, > then with four processes per node, and N/4 nodes. > This may tell what is optimal for the hardware you have. > With OpenMPI you can use the "mpiexec" flags > "-bynode" and "-byslot" to control this behavior. > "man mpiexec" is your friend! :) > > > does mpich also provide this? > or it will be controlled by the scheduler?? A mixed answer. I think you can do this with MPICH2, but it is not so easy as it is with OpenMPI, depending on other things, particularly the mpiexec that you use. 1) You can use Torque/Maui and request full nodes, as Chris Samuel suggested to you in another thread. E.g.: #PBS -l nodes=10:ppn=8 This doesn't guarantee or requires that your job will run on a single core per node, but it ensures that nobody else will be running anything there but you. Hence, this is kind of a preliminary step. 2) If you use mpd and the native MPICH2 mpiexec to launch programs compiled with MPICH2, you could control where the processes run by using the "-machinefile" or the "-configfile" option. To take effect, you also need to tweak with the contents of your "machinefile"/"configfile". For instance, you could write a script to run inside the PBS script, but before the mpiexec command, to read the PBS_NODES file, and build the "machinefile"/"configfile" with selected nodes. That is more involved than with OpenMPI, but not too hard to do. You must read "man mpiexec" to do this right. 3) If you use the OSC mpiexec (http://www.osc.edu/~djohnson/mpiexec/index.php) to launch programs compiled with MPICH2, you can use the "-pernode" option to run in a single core per node, which is similar to OpenMPI "-bynode", and easy to do. 4) If you still use MPICH1, which is too old, unmaintained, and troublesome, then upgrade to OpenMPI or to MPICH2, and use the solutions proposed here and in previous emails. > > But still if it is a shared cluster (as in my case) then the cores you > left unbusy may be allocated to another process of another user by the > Batch scheduler. Right?? Unless you request full nodes, as Chris Samuel suggested: #PBS -l nodes=10:ppn=8 However, beware that this greedy and wasteful behavior may drive your system administrator and the other cluster users mad at you! :) Well, you can always justify it in the name of science, of course. ;) > > > > Yes, CFD codes are memory bandwidth bound usually. > > > Indeed, and so is most of our atmosphere/ocean/climate codes, > which has a lot of CFD, but also radiative processes, mixing, > thermodynamics, etc. > However, most of our models use fixed grids, and I suppose > some of your aerodynamics may use adaptive meshes, right? > I guess you are doing aerodynamics, right? > > > Amazing!! > but I would really love to know (infact, learn) which > factors/indications made you to guess so correctly. > Google is not only your friend. It is also *my* friend! :) Is the Amjad Ali Pasha listed here yourself or somebody else? http://www.aero.iitb.ac.in/aero/people/students/phd.html Dialog here is a two way road, a cooperative and open exchange. My identity is stamped on my signature block in all messages, no secret about it. Why not yours? > > I would offer you 6 cents. > 2 cents --- you missed below. > 2 extra. > 2 cents for next email. > > > Thank you very much. > My two Rupees. :) Best, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > > My pleasure. > > > I hope this helps, > > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > From rpnabar at gmail.com Fri Dec 11 22:59:43 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 12 Dec 2009 00:59:43 -0600 Subject: [Beowulf] Performance tuning for Jumbo Frames Message-ID: I have seen a considerable performance boost for my codes by using Jumbo Frames. But are there any systematic tools or strategies to select the optimum MTU size? I have it set as 9000. (Of course, all switiching hardware supports jumbo frames and no talking to the external world required of the interfaces) Have you guys found performance to be MTU sensitive? Also, are there any switch side parameters that can affect the performance of HPC codes? Specifically I was trying to run VASP which is known to be latency sensitive. I have a 10 Gig E network with a RDMA offload card and am getting average latencies (ping pong) using rping of around 14 microsecs in the MPI tests. Is there a way to figure out what percentage of this latency is in the switch and what %age in the stack, cards and cables? Just trying to figure out which are the battles one picks to fight. Any tips? -- Rahu From hearnsj at googlemail.com Sat Dec 12 00:45:56 2009 From: hearnsj at googlemail.com (John Hearns) Date: Sat, 12 Dec 2009 08:45:56 +0000 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: References: Message-ID: <9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com> 2009/12/12 Rahul Nabar : > Is there a way to > figure out what percentage of this latency is in the switch and what > %age in the stack, cards and cables? Just trying to figure out which > are the battles one picks to fight. I would say take the switch out and do a direct point-to-point link between two systems. Is this possible with 10gig ethernet? From rpnabar at gmail.com Sat Dec 12 01:02:03 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 12 Dec 2009 03:02:03 -0600 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com> References: <9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com> Message-ID: On Sat, Dec 12, 2009 at 2:45 AM, John Hearns wrote: > 2009/12/12 Rahul Nabar : >> ?Is there a way to >> figure out what percentage of this latency is in the switch and what >> %age in the stack, cards and cables? Just trying to figure out which >> are the battles one picks to fight. > > I would say take the switch out and do a direct point-to-point link > between two systems. > Is this possible with 10gig ethernet? Yes! Great idea. I never thought of that. I'll try that. Yes, I'm no expert, but as far I know a direct link is possible. -- Rahul From h-bugge at online.no Sat Dec 12 03:27:48 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Sat, 12 Dec 2009 12:27:48 +0100 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: References: Message-ID: <83F599A7-0A85-4887-9178-0D98E31EB0F0@online.no> On Dec 12, 2009, at 7:59 , Rahul Nabar wrote: > I have seen a considerable performance boost for my codes by using > Jumbo Frames. But are there any systematic tools or strategies to > select the optimum MTU size? I have it set as 9000. (Of course, all > switiching hardware supports jumbo frames and no talking to the > external world required of the interfaces) Have you guys found > performance to be MTU sensitive? Once (i.e. several years ago) I tested an application on Gbe and found that around 1/3rd of the MTU was the optimum. But I guess YMMV. H?kon From richard.walsh at comcast.net Sat Dec 12 05:39:30 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Sat, 12 Dec 2009 13:39:30 +0000 (UTC) Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <83F599A7-0A85-4887-9178-0D98E31EB0F0@online.no> Message-ID: <1399280631.2109511260625170185.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> >On Dec 12, 2009 H?kon Bugge wrote: > >On Dec 12, 2009, at 7:59 , Rahul Nabar wrote: > >> I have seen a considerable performance boost for my codes by using >> Jumbo Frames. But are there any systematic tools or strategies to >> select the optimum MTU size? I have it set as 9000. (Of course, all >> switiching hardware supports jumbo frames and no talking to the >> external world required of the interfaces) Have you guys found >> performance to be MTU sensitive? > >Once (i.e. several years ago) I tested an application on Gbe and found >that around 1/3rd of the MTU was the optimum. But I guess YMMV. I would seem that a larger MTU would help in at least two situations, clearly applications with very large messages, but also those that have transmission bursts of messages below the MTU that could take advantage of hardware coalescing. The common MTU of 1500 was not chosen arbitrarily, but was probably not tuned to serve the "average" HPC application. If one is going to run one application predominately, then that application might prefer the default or not. If one is running a suite of codes, the it will probably help some and hurt others and should be "weighted average" tuned. I think the first-order focal point for performance should be on writing good code while choosing the "right" MTU size is a third order concern. Here is a stupid question ... it is the >>Maximum<< Transmission Unit ... Right? You don't get it all the time ... how is the run-time value set and adjusted based on the quality of message traffic? Experts ... ?? rbw _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Sat Dec 12 07:54:23 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 12 Dec 2009 09:54:23 -0600 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <83F599A7-0A85-4887-9178-0D98E31EB0F0@online.no> References: <83F599A7-0A85-4887-9178-0D98E31EB0F0@online.no> Message-ID: On Sat, Dec 12, 2009 at 5:27 AM, H?kon Bugge wrote: > > Once (i.e. several years ago) I tested an application on Gbe and found that > around 1/3rd of the MTU was the optimum. But I guess YMMV. i.e MTU = 0.3 * 9000? or 0.3 * 1500 Sorry, I was confused. -- Rahul From patrick at myri.com Sat Dec 12 08:40:49 2009 From: patrick at myri.com (Patrick Geoffray) Date: Sat, 12 Dec 2009 11:40:49 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: References: Message-ID: <4B23C791.5070701@myri.com> Rahul, Rahul Nabar wrote: > I have seen a considerable performance boost for my codes by using > Jumbo Frames. But are there any systematic tools or strategies to > select the optimum MTU size? There is no optimal MTU size. This is the maximum payload you can fit in one packet, so there is no drawback to a bigger MTU. Actually, there is one in terms of wormhole switching, but switch contention is an issue happily ignored by most HPC users. > external world required of the interfaces) Have you guys found > performance to be MTU sensitive? A large MTU means fewer packets for the same amount of data transfered. In all stack processing, there is a per-packet overhead (decoding header, integrity, sequence number, etc) and a per-byte overhead (copy). A large MTU reduces the total per-packet overhead because there are less packets to process. Most 10GE NIC have no problems reaching line rate at 1500 Bytes (the standard Ethernet MTU), the problem is the host OS stack (mainly TCP) where the per-packet overhead is important. One trick that all 10GE NICs worth their salt are doing these days is to fake a large MTU at the OS level, while keeping the wire MTU at 1500 Bytes (for compatibility). This is called TSO (Transmit Send Offload) and LRO (Large Receive Offload). The OS stack is using a virtual MTU of 64K and the NIC does segmentation/reassembly in hardware, sort of. > Also, are there any switch side parameters that can affect the > performance of HPC codes? Specifically I was trying to run VASP which > is known to be latency sensitive. A large MTU has little to no impact on latency. > I have a 10 Gig E network with a > RDMA offload card and am getting average latencies (ping pong) using > rping of around 14 microsecs in the MPI tests. It is most likely due to the switch. Try back-to-back to measure without it. I don't know what hardware you are using, but you can get close to 10us latency over TCP with a standard 10GE NIC and interrupt coalescing disabled. With a NIC supporting OS-bypass (RDMA only make sense for bandwidth), you should get at least half that, ideally below 3us. Patrick From patrick at myri.com Sat Dec 12 08:41:48 2009 From: patrick at myri.com (Patrick Geoffray) Date: Sat, 12 Dec 2009 11:41:48 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com> References: <9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com> Message-ID: <4B23C7CC.4060004@myri.com> John Hearns wrote: > I would say take the switch out and do a direct point-to-point link > between two systems. > Is this possible with 10gig ethernet? Yes, no need for crossover cables with 10GE. Patrick From patrick at myri.com Sat Dec 12 10:45:35 2009 From: patrick at myri.com (Patrick Geoffray) Date: Sat, 12 Dec 2009 13:45:35 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <1399280631.2109511260625170185.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <1399280631.2109511260625170185.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <4B23E4CF.6010306@myri.com> Hi Richard, richard.walsh at comcast.net wrote: > I would seem that a larger MTU would help in at least two situations, > clearly applications with very large messages, but also those that > have transmission bursts of messages below the MTU that could > take advantage of hardware coalescing. Such coalescing is typically not done in hardware. TCP with Nagle will coalesce small messages going to the same destination. However, Nagle trades latency for bandwidth, so latency junkies such as HPC apps often disable Nagle when using TCP. Some MPI implementations do coalesce small messages, mainly to look good on packet rate benchmarks. > The common MTU of 1500 was not chosen arbitrarily, Never underestimate what a standard committee can do. Patrick From hearnsj at googlemail.com Sun Dec 13 02:23:42 2009 From: hearnsj at googlemail.com (John Hearns) Date: Sun, 13 Dec 2009 10:23:42 +0000 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <4B240378.6020101@earlham.edu> References: <9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com> <4B23C7CC.4060004@myri.com> <4B240378.6020101@earlham.edu> Message-ID: <9f8092cc0912130223r69c820bw60b2f47ca3281895@mail.gmail.com> 2009/12/12 Kevin Hunter : > > I don't have access to all hardware (obviously), but it's been my > experience that *any* NIC made in the last 6- years (10G, 1G, 100M) no > longer needs crossover cables to do direct NIC-to-NIC communication. 10gig uses fibre cables, CX-4 copper or only very recently 1000-GBASE-T so 'crossover' cables only applicable (or not) in the last case. From csamuel at vpac.org Sun Dec 13 18:26:04 2009 From: csamuel at vpac.org (Chris Samuel) Date: Mon, 14 Dec 2009 13:26:04 +1100 (EST) Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <1269675451.7906811260757522831.JavaMail.root@mail.vpac.org> Message-ID: <553360800.7906831260757564672.JavaMail.root@mail.vpac.org> ----- "Patrick Geoffray" wrote: > richard.walsh at comcast.net wrote: > > > The common MTU of 1500 was not chosen arbitrarily, > > Never underestimate what a standard committee can do. Viz the compromise of a 48 byte payload (53 byte cell) for ATM between the US and Europe.. http://en.wikipedia.org/wiki/Asynchronous_transfer_mode#Why_cells.3F cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From rpnabar at gmail.com Sun Dec 13 19:26:52 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 13 Dec 2009 21:26:52 -0600 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <4B23C791.5070701@myri.com> References: <4B23C791.5070701@myri.com> Message-ID: On Sat, Dec 12, 2009 at 10:40 AM, Patrick Geoffray wrote: > Rahul, > > Rahul Nabar wrote: >> >> I have seen a considerable performance boost for my codes by using >> Jumbo Frames. But are there any systematic tools or strategies to >> select the optimum MTU size? > > There is no optimal MTU size. This is the maximum payload you can fit in one > packet, so there is no drawback to a bigger MTU. Thanks! So I could push it beyond 9000 as well? Reason is I've seen a steady boost in performance so far. 1500 < 4000 < 9000. Maybe my performance continues to increase beyond 9000 too? -- Rahul From bcostescu at gmail.com Mon Dec 14 02:59:21 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Mon, 14 Dec 2009 11:59:21 +0100 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: References: Message-ID: On Sat, Dec 12, 2009 at 7:59 AM, Rahul Nabar wrote: > I have seen a considerable performance boost for my codes by using > Jumbo Frames. But are there any systematic tools or strategies to > select the optimum MTU size? I have it set as 9000. I played with this as well several times and found variable results. At one time, the memory allocation proved to be the limiting factor: because the page size was 4K, a packet with a MTU smaller than that would fit into one page, while a packet of 9000bytes would require 3 contiguous pages, making the search more time consuming; when plotting the bandwidth vs. MTU it peaked at just below 4K, so an increased MTU was beneficial compared with the default 1500bytes one, but only as long as it fits in one page. At another time, the switch was more likely to drop large frames under high load (maybe something to do with internal memory management), so the 9000bytes frames worked most of the time while the 1500bytes ones worked all the time... At yet another time, the high interrupt load generated by the 1500bytes fragments would make the computer unstable (probably an Athlon MP based system, but memory is fuzzy), so larger frames and/or interrupt coalescing was the only way to actually use that computer. The MTU can be set to higher values than 9000bytes if all the components involved support it - switch, network cards and driver. I remember seeing 2-3 years ago for some network equipment a MTU of 16K - but again the memory is fuzzy on what equipment that was - so it's definitely possible to have it higher than 9000bytes. Usually setting a too large MTU would be seen in bandwidth testing - if fragments above a certain MTU are dropped or only partly transferred, there will be retransmissions and the useful bandwidth will drop significantly (and for a trained eye, the statistics of the network driver and stack will provide clues as well). > Also, are there any switch side parameters that can affect the > performance of HPC codes? Most (all ?) switches do their job in hardware, to arrive at wire speed. There is usually nothing that can be set to affect the way the engine works. Cheers, Bogdan From Greg at keller.net Mon Dec 14 07:15:56 2009 From: Greg at keller.net (Greg Keller) Date: Mon, 14 Dec 2009 09:15:56 -0600 Subject: [Beowulf] Re: scalability In-Reply-To: <200912121641.nBCGfhxi010197@bluewest.scyld.com> References: <200912121641.nBCGfhxi010197@bluewest.scyld.com> Message-ID: <22C7676A-4693-4BA5-9348-653AC8E65935@Keller.net> On Dec 12, 2009, at 10:41 AM, beowulf-request at beowulf.org wrote: > > From: Gus Correa > > Hi Amjad > > amjad ali wrote: >> Hi Gus, >> >> I was told that some people used to run two processes only on >> dual-socket dual-core Xeon nodes , leaving the other two cores >> idle. >> Although it is an apparent waste, the argument was that it paid >> off in terms of overall efficiency. >> >> >> I guess I fully agree with this. >> The other reason I see folks choose this is licensing. In some cases the "cost" of the license tokens to use the extra cores is too expensive given the minimal benefit since they still compete for Memory Bandwidth, CPU Cache, or less commonly Network Bandwidth/Latency. ... >> >> But still if it is a shared cluster (as in my case) then the cores >> you >> left unbusy may be allocated to another process of another user by >> the >> Batch scheduler. Right?? > > Unless you request full nodes, as Chris Samuel suggested: > #PBS -l nodes=10:ppn=8 > > However, beware that this greedy and wasteful behavior > may drive your system administrator and > the other cluster users mad at you! :) > Well, you can always justify it in the name of science, of course. ;) > They may thank you for not sharing a "maxed out" node. If someone is doing benchmarking or running a similarly sensitive code you're saving everyone a lot of head scratching. In my world users sharing nodes is almost always trouble looking for a way to ruin a weekend. Cheers! Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Mon Dec 14 10:20:20 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 14 Dec 2009 12:20:20 -0600 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <20091214125720.218a9397.chekh@pcbi.upenn.edu> References: <4B23C791.5070701@myri.com> <20091214125720.218a9397.chekh@pcbi.upenn.edu> Message-ID: On Mon, Dec 14, 2009 at 11:57 AM, Alex Chekholko wrote: > On Sun, 13 Dec 2009 21:26:52 -0600 > Well, remember, your hardware has to support it, first. Right. I am checking with the Switch and eth Card specs to make sure now. -- Rahul From cap at nsc.liu.se Mon Dec 14 10:35:26 2009 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Mon, 14 Dec 2009 19:35:26 +0100 Subject: [Beowulf] Sony PS3, random news In-Reply-To: References: Message-ID: <200912141935.26586.cap@nsc.liu.se> On Wednesday 09 December 2009, Jeremy Baker wrote: > DoD buys PS3 for HPC. > > CNN brief at > > http://scitech.blogs.cnn.com/2009/12/09/military-purchases-2200-ps3s/ > > Clip from report: > > "Though a single 3.2 GHz cell processor can deliver over 200 GFLOPS, > whereas the Sony PS3 configuration delivers approximately 150 GFLOPS, the > approximately tenfold cost difference per GFLOP makes the Sony PS3 the only > viable technology for HPC applications." Given that the "new" PS3s does not support linux (or any "other OS" for that matter) this price/performance sweet spot may be going away... /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From rpnabar at gmail.com Mon Dec 14 11:35:14 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 14 Dec 2009 13:35:14 -0600 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <4B23C791.5070701@myri.com> References: <4B23C791.5070701@myri.com> Message-ID: On Sat, Dec 12, 2009 at 10:40 AM, Patrick Geoffray wrote: > > Most 10GE NIC have no problems reaching line rate at 1500 Bytes (the > standard Ethernet MTU), the problem is the host OS stack (mainly TCP) where > the per-packet overhead is important. One trick that all 10GE NICs worth > their salt are doing these days is to fake a large MTU at the OS level, > while keeping the wire MTU at 1500 Bytes (for compatibility). This is called > TSO (Transmit Send Offload) and LRO (Large Receive Offload). The OS stack is > using a virtual MTU of 64K and the NIC does segmentation/reassembly in > hardware, sort of. The TSO and LRO are only relevant to TCP though, aren't they? I am using RDMA so that shouldn't matter. Maybe I am wrong. -- Rahul From rpnabar at gmail.com Mon Dec 14 11:54:11 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 14 Dec 2009 13:54:11 -0600 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <4B23C791.5070701@myri.com> References: <4B23C791.5070701@myri.com> Message-ID: On Sat, Dec 12, 2009 at 10:40 AM, Patrick Geoffray wrote: > > It is most likely due to the switch. Try back-to-back to measure without it. > I don't know what hardware you are using, but you can get close to 10us > latency over TCP with a standard 10GE NIC and interrupt coalescing disabled. > With a NIC supporting OS-bypass (RDMA only make sense for bandwidth), you > should get at least half that, ideally below 3us. What was your tool to measure this latency? Just curious. -- Rahul From eugen at leitl.org Tue Dec 15 06:39:42 2009 From: eugen at leitl.org (Eugen Leitl) Date: Tue, 15 Dec 2009 15:39:42 +0100 Subject: [Beowulf] Ich bitte um Feedback bzgl. MPI / please I need your feedback Message-ID: <20091215143942.GG17686@leitl.org> ----- Forwarded message from Rolf Rabenseifner ----- From: Rolf Rabenseifner Date: Tue, 15 Dec 2009 15:42:50 +0100 (CET) To: eugen at leitl.org Subject: Ich bitte um Feedback bzgl. MPI / please I need your feedback Sehr geehrte Damen und Herren, zur Weiterentwicklung des Message Passing Interface (MPI) Standards moechten wir, das MPI-3 Forum, Sie bitten uns ein paar wichtige Fragen zu beantworten. Ich bitte Sie, mir 10 Minuten zu schenken und den kurzen Fragen-Katalog jetzt gleich zu beantworten. Bitte verwenden Sie nicht zu viel Zeit auf die einzelnen Fragen, wenn Sie Probleme bei der Beantwortung einer Frage haben sollten. Koennten Sie bitte diese Mail auch an Kollegen weiterleiten, die MPI nutzen. Hier die URL und das Passwort zu der Umfrage: URL: http://mpi-forum.questionpro.com/ Password: mpi3 Herzlichen Dank im Voraus - eine frohe Weihnachtszeit wuenscht Ihnen Ihr Rolf Rabenseifner ------------------------------------- Dear Madam, dear Sir, for improving the Message Passing Interface (MPI) standard, we (the MPI-3 Forum) kindly ask you to answer a few important questions. I ask you to spend 10 minutes of your time to answer the short questionnaire. Please do not spend too much time on a single question if there occur problems with one question. Please can you also forward this email to colleagues who use MPI. Here the URL and the password of the questionnaire. URL: http://mpi-forum.questionpro.com/ Password: mpi3 Thank you in advance and a Merry Christmas, Rolf Rabenseifner --------------------------------------------------------------------- Dr. Rolf Rabenseifner .. . . . . . . . . . email rabenseifner at hlrs.de High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 University of Stuttgart .. . . . . . . . . fax : ++49(0)711/685-65832 Head of Dpmt Parallel Computing .. .. www.hlrs.de/people/rabenseifner Nobelstr. 19, D-70550 Stuttgart, Germany . . (Office: Allmandring 30) --------------------------------------------------------------------- ----- End forwarded message ----- -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From jorg.sassmannshausen at strath.ac.uk Mon Dec 14 07:17:15 2009 From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=) Date: Mon, 14 Dec 2009 15:17:15 +0000 Subject: [Beowulf] Performance degrading Message-ID: <200912141517.15432.jorg.sassmannshausen@strath.ac.uk> Dear all, I am scratching my head but apart from getting splinters into my fingers I cannot find a good answer for the following problem: I am running a DFT program (NWChem) in parallel on our cluster (AMD Opterons, single quad cores in the node, 12 GB of RAM, Gigabit network) and at certain stages of the run top is presenting me with that: top - 15:10:48 up 13 days, 22:20, 1 user, load average: 0.26, 0.24, 0.19 Tasks: 106 total, 1 running, 105 sleeping, 0 stopped, 0 zombie Cpu0 : 8.0% us, 2.7% sy, 0.0% ni, 82.7% id, 0.0% wa, 1.3% hi, 5.3% si Cpu1 : 4.1% us, 1.4% sy, 0.0% ni, 94.6% id, 0.0% wa, 0.0% hi, 0.0% si Cpu2 : 2.7% us, 0.0% sy, 0.0% ni, 97.3% id, 0.0% wa, 0.0% hi, 0.0% si Cpu3 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 12250540k total, 5581756k used, 6668784k free, 273396k buffers Swap: 16779884k total, 0k used, 16779884k free, 3841688k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16885 sassy 15 0 3928m 1.7g 1.4g S 4 14.4 312:19.92 nwchem 16886 sassy 15 0 3928m 1.7g 1.4g S 4 14.5 313:08.77 nwchem 16887 sassy 15 0 3920m 1.7g 1.4g S 3 14.4 316:18.24 nwchem 16888 sassy 15 0 3923m 1.6g 1.3g S 3 13.3 316:13.55 nwchem 16890 sassy 15 0 2943m 1.7g 1.7g S 3 14.8 104:32.33 nwchem It is not a few seconds it does it, it appears to be for a prolonged period of time. I checked it randomly for say 1 min and the performance is well below 50 % (most of the time around 20 %). I have not noticed that when I am running the job within one node. I have the suspicion that the Gigabit network is the problem, but I really would like to pinpoint that so I can get my boss to upgrade to a better network for parallel computing (hence my previous question about Open-MX). Now how, as I am not an admin of that cluster, would I be able to do that? Thanks for your comments. Best wishes from Glasgow! J?rg -- ************************************************************* J?rg Sa?mannshausen Research Fellow University of Strathclyde Department of Pure and Applied Chemistry 295 Cathedral St. Glasgow G1 1XL email: jorg.sassmannshausen at strath.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From chekh at pcbi.upenn.edu Mon Dec 14 09:57:20 2009 From: chekh at pcbi.upenn.edu (Alex Chekholko) Date: Mon, 14 Dec 2009 12:57:20 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: References: <4B23C791.5070701@myri.com> Message-ID: <20091214125720.218a9397.chekh@pcbi.upenn.edu> On Sun, 13 Dec 2009 21:26:52 -0600 Rahul Nabar wrote: > On Sat, Dec 12, 2009 at 10:40 AM, Patrick Geoffray wrote: > > Rahul, > > > > Rahul Nabar wrote: > >> > >> I have seen a considerable performance boost for my codes by using > >> Jumbo Frames. But are there any systematic tools or strategies to > >> select the optimum MTU size? > > > > There is no optimal MTU size. This is the maximum payload you can fit in one > > packet, so there is no drawback to a bigger MTU. > > Thanks! So I could push it beyond 9000 as well? Reason is I've seen a > steady boost in performance so far. 1500 < 4000 < 9000. > > Maybe my performance continues to increase beyond 9000 too? Well, remember, your hardware has to support it, first. I have a Foundry FGS648P switch which lists in the specs: "Jumbo Frames up to 10,240 bytes for 10/100/1000 and 10GbE ports". I turn that on by issuing the command "jumbo frames" (then saving to flash, etc). The 10GigE NICs that I have are the HP NC510C NetXen-based cards. I use the driver from the HP support site, and that driver only supports up to 8000 MTU. So I use 8000 bytes as my MTU. Set it as high as you can; there is no downside except ensuring all your devices are set to handle that large unit size. Typically, if the device doesn't support jumbo frames, it just drops the jumbo frames silently, which can result in odd intermittent problems. -- Alex Chekholko chekh at pcbi.upenn.edu From atchley at myri.com Tue Dec 15 09:49:23 2009 From: atchley at myri.com (Scott Atchley) Date: Tue, 15 Dec 2009 12:49:23 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <20091214125720.218a9397.chekh@pcbi.upenn.edu> References: <4B23C791.5070701@myri.com> <20091214125720.218a9397.chekh@pcbi.upenn.edu> Message-ID: <437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com> On Dec 14, 2009, at 12:57 PM, Alex Chekholko wrote: > Set it as high as you can; there is no downside except ensuring all > your devices are set to handle that large unit size. Typically, if > the > device doesn't support jumbo frames, it just drops the jumbo frames > silently, which can result in odd intermittent problems. You can test it by using the size parameter with ping: $ ping -s If they all drop, then you have exceeded the MTU of some device. Scott From chenyon1 at iit.edu Wed Dec 9 18:09:16 2009 From: chenyon1 at iit.edu (Yong Chen) Date: Wed, 09 Dec 2009 20:09:16 -0600 Subject: [Beowulf] [hpc-announce] Call For Papers: Intl. Workshop on Parallel Programming Models and Systems Software for HEC (P2S2) Message-ID: CALL FOR PAPERS =============== Third International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) Sept. 13th, 2010 To be held in conjunction with ICPP-2010: The 39th International Conference on Parallel Processing, Sept. 13-16, 2010, San Diego, CA, USA Website: http://www.mcs.anl.gov/events/workshops/p2s2 SCOPE ----- The goal of this workshop is to bring together researchers and practitioners in parallel programming models and systems software for high-end computing systems. Please join us in a discussion of new ideas, experiences, and the latest trends in these areas at the workshop. TOPICS OF INTEREST ------------------ The focus areas for this workshop include, but are not limited to: * Systems software for high-end scientific and enterprise computing architectures o Communication sub-subsystems for high-end computing o High-performance file and storage systems o Fault-tolerance techniques and implementations o Efficient and high-performance virtualization and other management mechanisms for high-end computing * Programming models and their high-performance implementations o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel, Fortress and others o Hybrid Programming Models * Tools for Management, Maintenance, Coordination and Synchronization o Software for Enterprise Data-centers using Modern Architectures o Job scheduling libraries o Management libraries for large-scale system o Toolkits for process and task coordination on modern platforms * Performance evaluation, analysis and modeling of emerging computing platforms PROCEEDINGS ----------- Proceedings of this workshop will be published in CD format and will be available at the conference (together with the ICPP conference proceedings) . SUBMISSION INSTRUCTIONS ----------------------- Submissions should be in PDF format in U.S. Letter size paper. They should not exceed 8 pages (all inclusive). Submissions will be judged based on relevance, significance, originality, correctness and clarity. Please visit workshop website at: http://www.mcs.anl.gov/events/workshops/p2s2/ for the submission link. JOURNAL SPECIAL ISSUE --------------------- The best papers of P2S2'10 will be included in a special issue of the International Journal of High Performance Computing Applications (IJHPCA) on Programming Models, Software and Tools for High-End Computing. IMPORTANT DATES --------------- Paper Submission: March 3rd, 2010 Author Notification: May 3rd, 2010 Camera Ready: June 14th, 2010 PROGRAM CHAIRS -------------- * Pavan Balaji, Argonne National Laboratory * Abhinav Vishnu, Pacific Northwest National Laboratory PUBLICITY CHAIR --------------- * Yong Chen, Illinois Institute of Technology STEERING COMMITTEE ------------------ * William D. Gropp, University of Illinois Urbana-Champaign * Dhabaleswar K. Panda, Ohio State University * Vijay Saraswat, IBM Research PROGRAM COMMITTEE ----------------- * Ahmad Afsahi, Queen's University * George Almasi, IBM Research * Taisuke Boku, Tsukuba University * Ron Brightwell, Sandia National Laboratory * Franck Cappello, INRIA, France * Yong Chen, Illinois Institute of Technology * Ada Gavrilovska, Georgia Tech * Torsten Hoefler, Indiana University * Zhiyi Huang, University of Otago, New Zealand * Hyun-Wook Jin, Konkuk University, Korea * Zhiling Lan, Illinois Institute of Technology * Doug Lea, State University of New York at Oswego * Jiuxing Liu, IBM Research * Guillaume Mercier, INRIA, France * Scott Pakin, Los Alamos National Laboratory * Fabrizio Petrini, IBM Research * Bronis de Supinksi, Lawrence Livermore National Laboratory * Sayantan Sur, IBM Research * Rajeev Thakur, Argonne National Laboratory * Vinod Tipparaju, Oak Ridge National Laboratory * Jesper Traff, NEC, Europe * Weikuan Yu, Auburn University If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov ======================================================================== If you do not want to receive any more announcements regarding the P2S2 workshop, please unsubscribe here: https://lists.mcs.anl.gov/mailman/listinfo/hpc-announce ======================================================================== From djk at lanl.gov Mon Dec 14 06:48:20 2009 From: djk at lanl.gov (Darren) Date: Mon, 14 Dec 2009 07:48:20 -0700 Subject: [Beowulf] [hpc-announce] CFP: 24th International Conference on Supercomputing (ICS'10) - 4 weeks remaining Message-ID: <4B265034.6090806@lanl.gov> [Please accept our apologies if you have received this announcement multiple times] ***************************************************************************** CALL FOR PAPERS - Submission deadline in 4 weeks 24th International Conference on Supercomputing (ICS'10) http://www.ics-conference.org June 1-4, 2010 Epochal Tsukuba (Tsukuba International Congress Center) Tsukuba, Japan http://www.epochal.or.jp/eng/ Sponsored by ACM/SIGARCH ***************************************************************************** ICS is the premier international forum for the presentation of research results in high-performance computing systems. In 2010 the conference will be held at the Epochal Tsukuba (Tsukuba International Congress Center) in Tsukuba City, the largest high-tech and academic city in Japan. Papers are solicited on all aspects of research, development, and application of high-performance experimental and commercial systems. Special emphasis will be given to work that leads to better understanding of the implications of the new era of million-scale parallelism and Exa-scale performance; including (but not limited to): * Computationally challenging scientific and commercial applications: studies and experiences to exploit ultra large scale parallelism, a large number of accelerators, and/or cloud computing paradigm. * High-performance computational and programming models: studies and proposals of new models, paradigms and languages for scalable application development, seamless exploitation of accelerators, and grid/cloud computing. * Architecture and hardware aspects: processor, accelerator, memory, interconnection network, storage and I/O architecture to make future systems scalable, reliable and power efficient. * Software aspects: compilers and runtime systems, programming and development tools, middleware and operating systems to enable us to scale applications and systems easily, efficiently and reliably. * Performance evaluation studies and theoretical underpinnings of any of the above topics, especially those giving us perspective toward future generation high-performance computing. * Large scale installations in the Petaflop era: design, scaling, power, and reliability, including case studies and experience reports, to show the baselines for future systems. In order to encourage open discussion on future directions, the program committee will provide higher priority for papers that present highly innovative and challenging ideas. Papers should not exceed 6,000 words, and should be submitted electronically, in PDF format using the ICS'10 submission web site. Submissions should be blind. The review process will include a rebuttal period. Please refer to the ICS'10 web site for detailed instructions. Workshop and tutorial proposals are also be solicited and due by January 18, 2010. For further information and future updates, refer to the ICS'10 web site at http://www.ics-conference.org or contact the General Chair (ics10-chair at hpcs.cs.tsukuba.ac.jp) or Program Co-Chairs (ics10-chairs at ac.upc.edu). Important Dates Abstract submission: January 11, 2010 Paper submission: January 18, 2010 Author notification: March 22, 2010 Final papers: April 15, 2010 For more information, please visit the conference web site at http://www.ics-conference.org [ICS 2010 Committee Members] GENRAL CHAIR Taisuke Boku, U. Tsukuba PROGRAM CO-CHAIRS Hiroshi Nakashima, Kyoto U. Avi Mendelson, Microsoft FINANCE CHAIR Kazuki Joe, Nara Women's U. PUBLICATION CHAIR Osamu Tatebe, U. Tsukuba PUBLICITY CO-CHAIRS Darren Kerbyson, LANL Hironori Nakajo, Tokyo U. Agric. & Tech. Serge Petiton, CNRS/LIFL WORKSHOP & TUTORIAL CHAIR Koji Inoue, Kyushu U. POSTER CHAIR Masahiro Goshima, U. Tokyo WEB & SUBMISSION CO-CHAIRS Eduard Ayguade, BSC/UPC Alex Ramirez, BSC/UPC LOCAL ARRANGEMENT CHAIR Daisuke Takahashi, U. Tsukuba PROGRAM COMMITTEE Jung Ho Ahn, Seoul NU. Eduard Ayguade, BSC/UPC Carl Beckmann, Intel Muli Ben-Yehuda, IBM Gianfranco Bilardi, U. Padova Greg Byrd, NCSU Franck Cappello, INRIA/UIUC Marcelo Cintra, U. Edinburgh Luiz De Rose, Cray Bronis De Supinski, LLNL/CASC Jack Dongarra, UTenn/ORNL Eitan Frachtenberg, Facebook Kyle Gallivan, FSU Stratis Gallopoulos, ,U. Patras Milind Girkar, Intel Bill Gropp, UIUC Mike Heroux, SNL Adolfy Hoisie, LANL Koh Hotta, Fujitsu Yutaka Ishikawa, U. Tokyo Takeshi Iwashita, Kyoto U. Kazuki Joe, Nara Woman's U. Hironori Kasahara, U. Waseda Arun Kejariwal, Yahoo Darren Kerbyson, LANL Moe Khaleel, PNNL Bill Kramer, NCSA Andrew Lewis, Griffith U. Jose Moreira, IBM Walid Najjar, U.C. Riverside Kengo Nakajima, U. Tokyo Hironori Nakajo, Tokyo U. Agric. & Tech. Hiroshi Nakamura, U. Tokyo Toshio Nakatani, IBM Research Tokyo Michael O'Boyle, U. Edinburgh Lenny Oliker, LBNL Theodore Papatheodoro, U. Patras Miquel Pericas, BSC Keshav Pingali, U. Texas Depei Qian, Beihang U. Alex Ramirez, BSC/UPC Valentina Salapura, IBM Mitsuhisa Sato, U. Tsukuba John Shalf, LBNL Takeshi Shimizu, Fujitsu Joshua Simons, Sun Microsystems Shinji Sumimoto, Fujitsu Makoto Taiji, Riken Toshikazu Takada, Riken Daisuke Takahashi, U. Tsukuba Guangming Tan, ICT Osamu Tatebe, U. Tsukuba Kenjiro Taura, U. Tokyo Rajeev Thakur, ANL Rong Tian, NCIC Robert Van Engelen, FSU Harry Wijshoff, Leiden Mitsuo Yokokawa, Riken Ayal Zaks, IBM Yunquan Zhang, ISCAS From djk at lanl.gov Mon Dec 14 07:32:46 2009 From: djk at lanl.gov (Darren) Date: Mon, 14 Dec 2009 08:32:46 -0700 Subject: [Beowulf] [hpc-announce] Extended Deadline: Workshop on Large-Scale Parallel Processing (LSPP'10) Message-ID: <4B265A9E.5010609@lanl.gov> [Please accept our apologies if you receive multiple copies] ----------------------------------------------------------------- Call for papers: Workshop on LARGE-SCALE PARALLEL PROCESSING to be held in conjunction with IEEE International Parallel and Distributed Processing Symposium Atlanta, Georgia April 23rd, 2010 EXTENDED DEADLINE: December 18th 2009 (Final) Selected work presented at the workshop will be published in a special issue of Parallel Processing Letters. ----------------------------------------------------------------- The workshop on Large-Scale Parallel Processing is a forum that focuses on computer systems that utilize thousands of processors and beyond. This is a very active area given the goals of many researchers world-wide to enhance science-by-simulation through installing large-scale multi-petaflop systems at the start of the next decade. Large-scale systems, referred to by some as extreme-scale and Ultra-scale, have many important research aspects that need detailed examination in order for their effective design, deployment, and utilization to take place. These include handling the substantial increase in multi-core on a chip, the ensuing interconnection hierarchy, communication, and synchronization mechanisms. The workshop aims to bring together researchers from different communities working on challenging problems in this area for a dynamic exchange of ideas. Work at early stages of development as well as work that has been demonstrated in practice is equally welcome. Of particular interest are papers that identify and analyze novel ideas rather than providing incremental advances in the following areas: - LARGE-SCALE SYSTEMS : exploiting parallelism at large-scale, the coordination of large numbers of processing elements, synchronization and communication at large-scale, programming models and productivity - MULTI-CORE : utilization of increased parallelism on a single chip (MPP on a chip such as the Cell and GPUs), the possible integration of these into large-scale systems, and dealing with the resulting hierarchical connectivity. - NOVEL ARCHITECTURES AND EXPERIMENTAL SYSTEMS : the design of novel systems, the use of processors in memory (PIMS), parallelism in emerging technologies, future trends. - APPLICATIONS : novel algorithmic and application methods, experiences in the design and use of applications that scale to large-scales, overcoming of limitations, performance analysis and insights gained. Results of both theoretical and practical significance will be considered, as well as work that has demonstrated impact at small-scale that will also affect large-scale systems. Work may involve algorithms, languages, various types of models, or hardware. ----------------------------------------------------------------- SUBMISSION GUIDELINES Papers should not exceed eight single-space pages (including figures, tables and references) using a 12-point font on 8.5x11 inch pages. Submissions in PostScript or PDF should be made using EDAS (www.edas.info). Informal enquiries can be made to djk at lanl.gov. Submissions will be judged on correctness, originality, technical strength, significance, presentation quality and appropriateness. Submitted papers should not have appeared in or under consideration for another venue. IMPORTANT DATES Submission deadline: December 18th 2009 (Final) Notification of acceptance: January 15th 2010 Camera-Ready Papers due: February 1st 2010 ----------------------------------------------------------------- WORKSHOP CO-CHAIRS Darren J. Kerbyson Los Alamos National Laboratory Ram Rajamony IBM Austin Research Lab Charles Weems University of Massachusetts STEERING COMMITTEE Johnnie Baker Kent State University Alex Jones University of Pittsburgh H.J. Siegel Colorado State University PROGRAM COMMITTEE Ghoerge Almasi IBM T.J. Watson Research Lab Taisuke Boku University of Tsukuba, Japan Marco Daneluto University of Pisa Martin Herbordt Boston University Lei Huang University of Houston Daniel Katz University of Chicago Jesus Labarta Barcelona Supercomputer Center, Spain John Michalakes NCAR, Boulder Celso Mendes University of Illinois Urbana-Champagne Bernd Mohr Forschungszentrum Juelich, Germany Stathis Papaefstathiou Microsoft Michael Scherger Texas A&M University-Corpus Christi Harvey Wasserman NERSC/LBNL Gerhard Wellein University of Erlangen, Germany Pat Worley Oak Ridge National Laboratory Workshop Webpage: http://www.ccs3.lanl.gov/LSPP From sbyna at nec-labs.com Tue Dec 15 08:59:30 2009 From: sbyna at nec-labs.com (Surendra Byna) Date: Tue, 15 Dec 2009 11:59:30 -0500 Subject: [Beowulf] [hpc-announce] CfP: Special Issue of JPDC on "Data Intensive Computing", Submission: One month from Today Message-ID: <951A499AA688EF47A898B45F25BD8EE807039EEC@mailer.nec-labs.com> Dear Colleagues: The paper submission deadline for the Special Issue of Journal of Parallel and Distributed Computing (JPDC) on "Data Intensive Computing" is a month from Today (January 15th 2010). We welcome your submissions. We appreciate sharing this announcement with anyone who might be interested. Thank you. Suren Byna NEC Labs America, Inc. 4 Independence Way, Suite 200 Princeton, NJ. Xian-He Sun Department of Computer Science Illinois Institute of Technology Chicago, IL. ====================================================================== Our apologies for duplicated copies for this CfP ====================================================================== Call for Papers: Special Issue of Journal of Parallel and Distributed Computing on "Data Intensive Computing" ------------------------------------------------------------------------ --- Data intensive computing is posing many challenges in exploiting parallelism of current and upcoming computer architectures. Data volumes of applications in the fields of sciences and engineering, finance, media, online information resources, etc. are expected to double every two years over the next decade and further. With this continuing data explosion, it is necessary to store and process data efficiently by utilizing enormous computing power that is available in the form of multicore/manycore platforms. There is no doubt in the industry and research community that the importance of data intensive computing has been raising and will continue to be the foremost fields of research. This raise brings up many research issues, in forms of capturing and accessing data effectively and fast, processing it while still achieving high performance and high throughput, and storing it efficiently for future use. Programming for high performance yielding data intensive computing is an important challenging issue. Expressing data access requirements of applications and designing programming language abstractions to exploit parallelism are at immediate need. Application and domain specific optimizations are also parts of a viable solution in data intensive computing. While these are a few examples of issues, research in data intensive computing has become quite intense during the last few years yielding strong results. This special issue of the Journal Parallel and Distributed Computing (JPDC) is seeking original unpublished research articles that describe recent advances and efforts in the design and development of data intensive computing, functionalities and capabilities that will benefit many applications. Topics of interest include (but are not limited to): * Data-intensive applications and their challenges * Storage and file systems * High performance data access toolkits * Fault tolerance, reliability, and availability * Meta-data management * Remote data access * Programming models, abstractions for data intensive computing * Compiler and runtime support * Data capturing, management, and scheduling techniques * Future research challenges of data intensive computing * Performance optimization techniques * Replication, archiving, preservation strategies * Real-time data intensive computing * Network support for data intensive computing * Challenges and solutions in the era of multi/many-core platforms * Stream computing * Green (Power efficient) data intensive computing * Security and protection of sensitive data in collaborative environments * Data intensive computing on accelerators and GPUs Guide for Authors Papers need not be solely abstract or conceptual in nature: proofs and experimental results can be included as appropriate. Authors should follow the JPDC manuscript format as described in the "Information for Authors" at the end of each issue of JPDC or at http://ees.elsevier.com/jpdc/ . The journal version will be reviewed as per JPDC review process for special issues. Important Dates: Paper Submission : January 15, 2010 Notification of Acceptance/Rejection : May 31, 2010 Final Version of the Paper : September 15, 2010 Submission Guidelines All manuscripts and any supplementary material should be submitted through Elsevier Editorial System (EES) at http://ees.elsevier.com/jpdc. Authors must select "Special Issue: Data Intensive Computing" when they reach the "Article Type" step in the submission process. First time users must register themselves as Author. For the latest details of the JPDC special issue see http://www.cs.iit.edu/~suren/jpdc. Guest Editors: Dr. Surendra Byna NEC Labs America E-mail: sbyna at nec-labs.com Prof. Xian-He Sun Illinois Institute of Technology E-mail: sun at cs.iit.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Tue Dec 15 11:24:27 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 15 Dec 2009 13:24:27 -0600 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com> References: <4B23C791.5070701@myri.com> <20091214125720.218a9397.chekh@pcbi.upenn.edu> <437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com> Message-ID: On Tue, Dec 15, 2009 at 11:49 AM, Scott Atchley wrote: > On Dec 14, 2009, at 12:57 PM, Alex Chekholko wrote: > >> Set it as high as you can; there is no downside except ensuring all >> your devices are set to handle that large unit size. ?Typically, if the >> device doesn't support jumbo frames, it just drops the jumbo frames >> silently, which can result in odd intermittent problems. > > You can test it by using the size parameter with ping: > > $ ping -s > > If they all drop, then you have exceeded the MTU of some device. Thanks Scott. 9000 seems the max. Neither my Switch nor my eth adapter like higher values. -- Rahul From gus at ldeo.columbia.edu Tue Dec 15 11:36:51 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 15 Dec 2009 14:36:51 -0500 Subject: [Beowulf] Performance degrading In-Reply-To: <200912141517.15432.jorg.sassmannshausen@strath.ac.uk> References: <200912141517.15432.jorg.sassmannshausen@strath.ac.uk> Message-ID: <4B27E553.2020702@ldeo.columbia.edu> Hi Jorg If you have single quad core nodes as you said, then top shows that you are oversubscribing the cores. There are five nwchem processes are running. In my experience, oversubscription only works in relatively light MPI programs (say the example programs that come with OpenMPI or MPICH). Real world applications tend to be very inefficient, and can even hang on oversubscribed CPUs. What happens when you launch four or less processes on a node instead of five? My $0.02. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- J?rg Sa?mannshausen wrote: > Dear all, > > I am scratching my head but apart from getting splinters into my fingers I > cannot find a good answer for the following problem: > I am running a DFT program (NWChem) in parallel on our cluster (AMD Opterons, > single quad cores in the node, 12 GB of RAM, Gigabit network) and at certain > stages of the run top is presenting me with that: > > top - 15:10:48 up 13 days, 22:20, 1 user, load average: 0.26, 0.24, 0.19 > Tasks: 106 total, 1 running, 105 sleeping, 0 stopped, 0 zombie > Cpu0 : 8.0% us, 2.7% sy, 0.0% ni, 82.7% id, 0.0% wa, 1.3% hi, 5.3% si > Cpu1 : 4.1% us, 1.4% sy, 0.0% ni, 94.6% id, 0.0% wa, 0.0% hi, 0.0% si > Cpu2 : 2.7% us, 0.0% sy, 0.0% ni, 97.3% id, 0.0% wa, 0.0% hi, 0.0% si > Cpu3 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si > Mem: 12250540k total, 5581756k used, 6668784k free, 273396k buffers > Swap: 16779884k total, 0k used, 16779884k free, 3841688k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 16885 sassy 15 0 3928m 1.7g 1.4g S 4 14.4 312:19.92 nwchem > 16886 sassy 15 0 3928m 1.7g 1.4g S 4 14.5 313:08.77 nwchem > 16887 sassy 15 0 3920m 1.7g 1.4g S 3 14.4 316:18.24 nwchem > 16888 sassy 15 0 3923m 1.6g 1.3g S 3 13.3 316:13.55 nwchem > 16890 sassy 15 0 2943m 1.7g 1.7g S 3 14.8 104:32.33 nwchem > > It is not a few seconds it does it, it appears to be for a prolonged period of > time. I checked it randomly for say 1 min and the performance is well below > 50 % (most of the time around 20 %). I have not noticed that when I am > running the job within one node. > > I have the suspicion that the Gigabit network is the problem, but I really > would like to pinpoint that so I can get my boss to upgrade to a better > network for parallel computing (hence my previous question about Open-MX). > Now how, as I am not an admin of that cluster, would I be able to do that? > > Thanks for your comments. > > Best wishes from Glasgow! > > J?rg > From Glen.Beane at jax.org Tue Dec 15 12:10:47 2009 From: Glen.Beane at jax.org (Glen Beane) Date: Tue, 15 Dec 2009 15:10:47 -0500 Subject: [Beowulf] Performance degrading In-Reply-To: <4B27E553.2020702@ldeo.columbia.edu> Message-ID: On 12/15/09 2:36 PM, "Gus Correa" wrote: If you have single quad core nodes as you said, then top shows that you are oversubscribing the cores. There are five nwchem processes are running. It has been a very long time, but wasn't that normal behavior for mpich under certain instances? If I recall correctly it had an extra process that was required by the implementation. I don't think it returned from MPI_Init, so you'd have a bunch of processes consuming nearly a full CPU and then one that was mostly idle doing something behind the scenes. I don't remember if this was for mpich/p4 (with or without -with-comm=shared) or for mpich-gm. -- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153 -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Tue Dec 15 13:22:56 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 15 Dec 2009 13:22:56 -0800 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com> References: <4B23C791.5070701@myri.com> <20091214125720.218a9397.chekh@pcbi.upenn.edu> <437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com> Message-ID: <20091215212256.GC28010@bx9.net> On Tue, Dec 15, 2009 at 12:49:23PM -0500, Scott Atchley wrote: > You can test it by using the size parameter with ping: > > $ ping -s > > If they all drop, then you have exceeded the MTU of some device. [lindahl at greg-desk b]$ ping -s 60000 rich-desk PING rich-desk (64.13.159.69) 60000(60028) bytes of data. 60008 bytes from rich-desk (64.13.159.69): icmp_seq=1 ttl=64 time=1.36 ms 60008 bytes from rich-desk (64.13.159.69): icmp_seq=2 ttl=64 time=1.32 ms I never knew I bought such advanced network gear! -- greg From chekh at pcbi.upenn.edu Tue Dec 15 13:43:49 2009 From: chekh at pcbi.upenn.edu (Alex Chekholko) Date: Tue, 15 Dec 2009 16:43:49 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <20091215212256.GC28010@bx9.net> References: <4B23C791.5070701@myri.com> <20091214125720.218a9397.chekh@pcbi.upenn.edu> <437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com> <20091215212256.GC28010@bx9.net> Message-ID: <20091215164349.6d413f8d.chekh@pcbi.upenn.edu> On Tue, 15 Dec 2009 13:22:56 -0800 Greg Lindahl wrote: > On Tue, Dec 15, 2009 at 12:49:23PM -0500, Scott Atchley wrote: > > > You can test it by using the size parameter with ping: > > > > $ ping -s > > > > If they all drop, then you have exceeded the MTU of some device. > > [lindahl at greg-desk b]$ ping -s 60000 rich-desk > PING rich-desk (64.13.159.69) 60000(60028) bytes of data. > 60008 bytes from rich-desk (64.13.159.69): icmp_seq=1 ttl=64 time=1.36 ms > 60008 bytes from rich-desk (64.13.159.69): icmp_seq=2 ttl=64 time=1.32 ms > > I never knew I bought such advanced network gear! Not sure if you're joking, but yes, you also have to tell ping to set the "don't fragment" bit. So on Ubuntu 9.04 it would be: ping -M do -s SIZE whatever.host.com Regards, -- Alex Chekholko chekh at pcbi.upenn.edu From jac67 at georgetown.edu Tue Dec 15 14:22:24 2009 From: jac67 at georgetown.edu (Jess Cannata) Date: Tue, 15 Dec 2009 17:22:24 -0500 Subject: [Beowulf] PXE/TFTP and Xen Kernel Issues Message-ID: <4B280C20.5060700@georgetown.edu> I'm having a problem booting Xen kernels via PXE. I want to boot a machine via PXE that will then host Xen virtual machines. The client machine PXE boots, receives the pxelinux.0 file, and then grabs the Xen kernel (vmlinuz-2.6.18-164.6.1.el5xen). However, it can never load the Xen kernel. On the client, I get the following error: Invalid or corrupt kernel image. I have tried the following three kernels (two stock Centos kernels and one custom compiled kernel) and only the Xen kernel fails: -rw-r--r-- 1 root root 2030154 Dec 10 15:28 vmlinuz-2.6.18-164.6.1.el5xen -rw-r--r-- 1 root root 1932284 Sep 25 16:17 vmlinuz-2.6.18-164.el5 -rw-r--r-- 1 root root 3277584 Dec 10 15:29 vmlinuz-2.6.27.15-jw-node The others load without error. I have checked multiple times that the Xen kernel is not corrupt via md5sums and by booting it via grub. It just seems not to like the PXE system. Here is a snippet of the dnsmasq log to show that the file is sent correctly to the client: Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent /tftpboot/pxelinux.0 to 192.168.0.6 Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent /tftpboot/pxelinux.cfg/default to 192.168.0.6 Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent /tftpboot/vmlinuz-2.6.18-164.6.1.el5xen to 192.168.0.6 I have tried three different systems for the DHCP, TFTP, and PXE Servers (using stock RHEL/Centos packages). Here are the specs: System 1 Centos 5.4 (64-bit) with nvidia Ethernet adapters dnsmasq for both DHCP and TFTP Servers syslinux for PXE System 2 Centos 5.4 (64-bit) with e1000 Ethernet adapters dnsmasq for both DHCP and TFTP Servers syslinux for PXE System 3 Centos 5.3 (32-bit) with e1000 Ethernet adapters (trying 32-bit version of the Xen kernel) Config One: dnsmasq for both DHCP and TFTP Servers syslinux for PXE Config Two: dnsmasq for DHCP Server tftp-server for TFTP Server syslinux for PXE The client machines use the same hardware as the servers. I haven't seen anything about Xen kernels having issues with PXE. Before I start trying different flavors of Linux, I'm curious if anyone else has seen or heard of this problem. Many thanks in advance. Jess From gus at ldeo.columbia.edu Tue Dec 15 17:04:10 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 15 Dec 2009 20:04:10 -0500 Subject: [Beowulf] Performance degrading In-Reply-To: References: Message-ID: <4B28320A.7040604@ldeo.columbia.edu> Hi Glen, Jorg Glen: Yes, you are right about MPICH1/P4 starting extra processes. However, I wonder if that is what is happening to Jorg, of if what he reported is just plain CPU oversubscription. Jorg: Do you use MPICH1/P4? How many processes did you launch on a single node, four or five? Glen: Out of curiosity, I dug out the MPICH1/P4 I still have on an old system, compiled and ran "cpi.c". Indeed there are extra processes there, besides the ones that I intentionally started in the mpirun command line. When I launch two processes on a two-single-core-CPU machine, I also get two (not only one) extra processes, in a total of four. However, as you mentioned, the extra processes do not seem to use any significant CPU. Top shows the two actual processes close to 100% and the extra ones close to zero. Furthermore, the extra processes don't use any significant memory either. Anyway, in Jorg's case all processes consumed about the same (low) amount of CPU, but ~15% memory each, and there were 5 processes (only one "extra"?, is it one per CPU socket? is it one per core? one per node?). Hence, I would guess Jorg's context is different. But ... who knows ... only Jorg can clarify. These extra processes seem to be related to the mechanism used by MPICH1/P4 to launch MPI programs. They don't seem to appear in recent OpenMPI or MPICH2, which have other launching mechanisms. Hence my guess that Jorg had an oversubscription problem. Considering that MPICH1/P4 is old, no longer maintained, and seems to cause more distress than joy in current kernels, I would not recommend it to Jorg or to anybody anyway. Thank you, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Glen Beane wrote: > > > > On 12/15/09 2:36 PM, "Gus Correa" wrote: > > If you have single quad core nodes as you said, > then top shows that you are oversubscribing the cores. > There are five nwchem processes are running. > > > > It has been a very long time, but wasn?t that normal behavior for mpich > under certain instances? If I recall correctly it had an extra process > that was required by the implementation. I don?t think it returned from > MPI_Init, so you?d have a bunch of processes consuming nearly a full CPU > and then one that was mostly idle doing something behind the scenes. I > don?t remember if this was for mpich/p4 (with or without > ?with-comm=shared) or for mpich-gm. > > > > > -- > Glen L. Beane > Software Engineer > The Jackson Laboratory > Phone (207) 288-6153 > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From patrick at myri.com Tue Dec 15 20:05:26 2009 From: patrick at myri.com (Patrick Geoffray) Date: Tue, 15 Dec 2009 23:05:26 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: References: <4B23C791.5070701@myri.com> Message-ID: <4B285C86.10902@myri.com> Rahul Nabar wrote: > Thanks! So I could push it beyond 9000 as well? 1500 Bytes is the standard MTU for Ethernet, anything larger is out of spec. The convention for a larger MTU is Jumbo Frames at 9000 Bytes, and most switches support it these days. Some hardware even support Super Jumbo Frames at 64K but it's rare (and useless IMHO). Since Jumbo Frames are out of spec, they are typically not enabled by default in switches. Patrick From patrick at myri.com Tue Dec 15 20:26:23 2009 From: patrick at myri.com (Patrick Geoffray) Date: Tue, 15 Dec 2009 23:26:23 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: References: <4B23C791.5070701@myri.com> Message-ID: <4B28616F.3000600@myri.com> Rahul Nabar wrote: > The TSO and LRO are only relevant to TCP though, aren't they? I am > using RDMA so that shouldn't matter. Maybe I am wrong. TSO/LRO applies to TCP, but you can have the same technique with different protocol, USO for UDP Send Offload for example. RDMA is everything you want it to be, but it is not a wire protocol. Anyway, NICs that implement zero-copy communication are in a way offloading segmentation and reassembly too. BTW, if your RDMA-capable hardware is running iWarp, it is using TCP on the wire. Patrick From patrick at myri.com Tue Dec 15 20:32:38 2009 From: patrick at myri.com (Patrick Geoffray) Date: Tue, 15 Dec 2009 23:32:38 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: References: <4B23C791.5070701@myri.com> Message-ID: <4B2862E6.2030501@myri.com> Rahul Nabar wrote: > What was your tool to measure this latency? Just curious. I like to use netperf to measure performance over Sockets, including latency (it's there but not obvious). For OS-bypass interfaces, your favorite MPI benchmark is fine. Patrick From patrick at myri.com Tue Dec 15 20:41:59 2009 From: patrick at myri.com (Patrick Geoffray) Date: Tue, 15 Dec 2009 23:41:59 -0500 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: References: Message-ID: <4B286517.10009@myri.com> Bogdan Costescu wrote: > long as it fits in one page. At another time, the switch was more > likely to drop large frames under high load (maybe something to do > with internal memory management), so the 9000bytes frames worked most > of the time while the 1500bytes ones worked all the time... This is an important point. The way hardware flow-control works in Ethernet, a switch has to be able to buffer two full frames plus the time on the wire for the round-trip. For the curious, the PAUSEs packets are sent in-band and you cannot send or receive partial frames. So, instead of requiring ~4K per port minimum, you need about ~20K per port. Add to that up to 8 priorities with DCB and the buffering requirement are quickly getting out of hand. That's one big drawbacks of large MTUs, along with contention with wormhole switching. Patrick From lindahl at pbm.com Wed Dec 16 00:01:18 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 16 Dec 2009 00:01:18 -0800 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <4B286517.10009@myri.com> References: <4B286517.10009@myri.com> Message-ID: <20091216080118.GB8679@bx9.net> On Tue, Dec 15, 2009 at 11:41:59PM -0500, Patrick Geoffray wrote: > So, instead of requiring ~4K per port minimum, you need about ~20K per > port. Add to that up to 8 priorities with DCB and the buffering > requirement are quickly getting out of hand. Don't worry, switch vendors will simply implement it all poorly, just like InfiniBand. That's what always happens with overly-complicated QOS schemes. -- greg From rpnabar at gmail.com Wed Dec 16 08:02:29 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 16 Dec 2009 10:02:29 -0600 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <4B286517.10009@myri.com> References: <4B286517.10009@myri.com> Message-ID: On Tue, Dec 15, 2009 at 10:41 PM, Patrick Geoffray wrote: > Bogdan Costescu wrote: > So, instead of requiring ~4K per port minimum, you need about ~20K per port. > Add to that up to 8 priorities with DCB and the buffering requirement are > quickly getting out of hand. That's one big drawbacks of large MTUs, along > with contention with wormhole switching. On closer investigation I am seeing TXPause frames and dropped packets. Have to dig deeper into this.Gotta figure out how much RAM per port this switch has. -- Rahul From lindahl at pbm.com Wed Dec 16 09:33:05 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 16 Dec 2009 09:33:05 -0800 Subject: [Beowulf] Intel compiler part of the anti-trust lawsuit Message-ID: <20091216173305.GA4233@bx9.net> You folks will recall that Intel, a while ago, stopped enabling their compiler's highest optimization levels for chips that weren't "Genuine Intel(tm)". Well, that's part of the new FTC complaint against Intel: Intel secretly redesigned key software, known as a compiler, in a way that deliberately stunted the performance of competitors? CPU chips. Intel told its customers and the public that software performed better on Intel CPUs than on competitors? CPUs, but the company deceived them by failing to disclose that these differences were due largely or entirely to Intel?s compiler design. PathScale was subpoenaed a long time ago by both AMD and Intel about this issue for the AMD/Intel lawsuit, recently settled. The bundling of chipsets with Atom processors (it's cheaper to buy both than a naked cpu) seems to also be part of the suit. -- greg From gus at ldeo.columbia.edu Wed Dec 16 12:03:34 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 16 Dec 2009 15:03:34 -0500 Subject: [Beowulf] A question about antique hardware Message-ID: <4B293D16.5090704@ldeo.columbia.edu> Dear Beowulfers Did anybody ever get Gigabit Ethernet NICs to work on the Tyan Tiger S2466-4M motherboards under Linux? If so, I would appreciate any words of wisdom about which NICs work, the appropriate BIOS settings, which PCI slots to use, etc. *** I flashed the Tyan S2466-4M BIOS to the latest version, V4.06 (super, final 2003 edition). I need to set this head node up with two GigE ports. I have two Intel 82543 Fiber Gigabit Ethernet PCI adapters, which use the e1000 driver. However, I would happily use other NICs and drivers, anything that works, including copper based GigE. *** I googled up to find tips and solutions, and I tried a number of different combinations: disabling the onboard 3Com Ethernet 100 port with a jumper; placing the NICs on the PCI-64 and on the PCI-32 slots; disabling the BIOS "option RAM scan" on the NICs' PCI slots; disabling USB on BIOS; trying one NIC at a time; etc. However, so far no game. The NICs are recognized, link LEDs light up, ping works, but the system seems to be unstable, ifdown/ifup hangs, hence the system hangs when it tries to take down the GigE ports during shutdown. Moreover, I get many of this kernel message on dmesg: Warning: kfree_skb on hard IRQ f88e47b2 *** This is the head node of our old Linux NetworX cluster. The original head node motherboard, ASUS A7M266, supported the aforementioned Intel NICs. Unfortunately it seems to have died. I bought an used-but-functional Tyan S2466-4M board on E-Bay as a replacement. These S2466-4M boards seem to have been very popular on servers and Beowulfs. It sounded to me as a good choice. After all, we have this board on all compute nodes. The compute nodes don't have GigE, only the onboard 3Com Ethernet 100 for service and I/O, plus Myrinet-2000 for MPI. They have been working fine for 8 years now. *** Thank you. Happy Holidays! Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From gus at ldeo.columbia.edu Wed Dec 16 12:32:41 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 16 Dec 2009 15:32:41 -0500 Subject: [Beowulf] Intel compiler part of the anti-trust lawsuit In-Reply-To: <20091216173305.GA4233@bx9.net> References: <20091216173305.GA4233@bx9.net> Message-ID: <4B2943E9.8080202@ldeo.columbia.edu> Incidentally, have you watched the "cannonbells" movie? http://www.reghardware.co.uk/2009/12/16/intel_chime_stunt/print.html http://www.theregister.co.uk/2009/12/16/intel_ftc/ Gus Correa Greg Lindahl wrote: > You folks will recall that Intel, a while ago, stopped enabling their > compiler's highest optimization levels for chips that weren't "Genuine > Intel(tm)". Well, that's part of the new FTC complaint against Intel: > > Intel secretly redesigned key software, known as a compiler, in a way > that deliberately stunted the performance of competitors? CPU > chips. Intel told its customers and the public that software performed > better on Intel CPUs than on competitors? CPUs, but the company > deceived them by failing to disclose that these differences were due > largely or entirely to Intel?s compiler design. > > PathScale was subpoenaed a long time ago by both AMD and Intel about > this issue for the AMD/Intel lawsuit, recently settled. > > The bundling of chipsets with Atom processors (it's cheaper to buy > both than a naked cpu) seems to also be part of the suit. > > -- greg > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Wed Dec 16 13:26:34 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 16 Dec 2009 13:26:34 -0800 Subject: [Beowulf] A question about antique hardware In-Reply-To: <4B293D16.5090704@ldeo.columbia.edu> References: <4B293D16.5090704@ldeo.columbia.edu> Message-ID: <20091216212634.GB29173@bx9.net> > Did anybody ever get Gigabit Ethernet NICs to work on > the Tyan Tiger S2466-4M motherboards under Linux? You talked a lot about the BIOS but didn't say if it was a 6 year old Linux version. Presumably your old mobos worked fine with the same version of Linux that this guy is failing with, but still, it could be a Linux-related problem, and not the motherboard or BIOS. -- greg From mathog at caltech.edu Wed Dec 16 14:27:29 2009 From: mathog at caltech.edu (David Mathog) Date: Wed, 16 Dec 2009 14:27:29 -0800 Subject: [Beowulf] Geriatric computer does not stay up Message-ID: So we have a cluster of Tyan S2466 nodes and one of them has failed in an odd way. (Yes, these are very old, and they would be gone if we had a replacment.) On applying power the system boots normally and gets far into the boot sequence, sometimes to the login prompt, then it locks up. If booted failsafe it will stay up for tens of minutes before locking. It locked once on "man smartctl" and once on "service network start". However, on the next reboot, it didn't lock with another "man smartctl", so it isn't like it hit a bad part of the disk and died. Smartctl test has not been run, but "smartctl -a /dev/hda" on the one disk shows it as healthy with no blocks swapped out. Power stays on when it locks, and the display remains as it was just before the lock. When it locks it will not respond to either the keyboard or the network. (The network interface light still flashes.) There is nothing in any of the logs to indicate the nature of the problem. The odd thing is that the system is remarkably stable in some ways. For instance, the PS tests good and heat isn't the issue: after running sensors in a tight loop to a log file, waiting for it to lock up, then looking at the log on the next failsafe boot, there were negligible fluctuation on any of the voltages, fan speeds, or temperatures. It will happily sit for 30 minutes in the BIOS, or hours running memtest86 (without errors). The motherboard battery is good, and the inside of the case is very clean, with no dust visible at all. Reset the BIOS but it didn't change anything. Here are my current hypotheses for what's wrong with this beast: 1. The drive is failing electrically, puts voltage spikes out on some operations, and these crash the system. 2. The motherboard capacitors are failing and letting too much noise in. The noise which is fatal is only seen on an active system, so sitting in the BIOS or in Memtest86 does not do it. (But the caps all look good, no swelling, no leaks.) It will run memtest86 overnight though, just in case. 3. The PS capacitors are failing, so that when loaded there is enough voltage fluctuation to crash the system. (Does not agree very well with the sensors measurements, but it could be really high frequency noise superimposed on a steady base voltage.) 4. Evil Djinn ;-( Any thoughts on what else this might be? Thanks. David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From prentice at ias.edu Wed Dec 16 14:39:06 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 16 Dec 2009 17:39:06 -0500 Subject: [Beowulf] Intel Cannonbells In-Reply-To: <4B2943E9.8080202@ldeo.columbia.edu> References: <20091216173305.GA4233@bx9.net> <4B2943E9.8080202@ldeo.columbia.edu> Message-ID: <4B29618A.3080808@ias.edu> Gus Correa wrote: > Incidentally, have you watched the "cannonbells" movie? > > http://www.reghardware.co.uk/2009/12/16/intel_chime_stunt/print.html > http://www.theregister.co.uk/2009/12/16/intel_ftc/ > > I'm calling BS on that one. While possible, I suspect the human projectiles were CG'ed in. -- Prentice From gerry.creager at tamu.edu Wed Dec 16 15:46:54 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Wed, 16 Dec 2009 17:46:54 -0600 Subject: [Beowulf] Geriatric computer does not stay up In-Reply-To: References: Message-ID: <4B29716E.4090600@tamu.edu> David Mathog wrote: > So we have a cluster of Tyan S2466 nodes and one of them has failed in > an odd way. (Yes, these are very old, and they would be gone if we had a > replacment.) On applying power the system boots normally and gets far > into the boot sequence, sometimes to the login prompt, then it locks up. > If booted failsafe it will stay up for tens of minutes before locking. > It locked once on "man smartctl" and once on "service network start". > However, on the next reboot, it didn't lock with another "man smartctl", > so it isn't like it hit a bad part of the disk and died. Smartctl test > has not been run, but "smartctl -a /dev/hda" on the one disk shows it as > healthy with no blocks swapped out. Power stays on when it locks, and > the display remains as it was just before the lock. When it locks it > will not respond to either the keyboard or the network. (The network > interface light still flashes.) There is nothing in any of the logs to > indicate the nature of the problem. > > The odd thing is that the system is remarkably stable in some ways. For > instance, the PS tests good and heat isn't the issue: after running > sensors in a tight loop to a log file, waiting for it to lock up, then > looking at the log on the next failsafe boot, there were negligible > fluctuation on any of the voltages, fan speeds, or temperatures. It > will happily sit for 30 minutes in the BIOS, or hours running memtest86 > (without errors). The motherboard battery is good, and the inside of > the case is very clean, with no dust visible at all. Reset the BIOS but > it didn't change anything. > > Here are my current hypotheses for what's wrong with this beast: > > 1. The drive is failing electrically, puts voltage spikes out on some > operations, and these crash the system. > 2. The motherboard capacitors are failing and letting too much noise in. > The noise which is fatal is only seen on an active system, so sitting > in the BIOS or in Memtest86 does not do it. (But the caps all look good, > no swelling, no leaks.) It will run memtest86 overnight though, just in > case. > 3. The PS capacitors are failing, so that when loaded there is enough > voltage fluctuation to crash the system. (Does not agree very well with > the sensors measurements, but it could be really high frequency noise > superimposed on a steady base voltage.) > 4. Evil Djinn ;-( > > Any thoughts on what else this might be? I'd also be suspicious of memory failures. We have had DIMM failures that were unseen on repeated MemTest86 runs until they failed hard, hard, HARD. While they were still trying to decide, they'd pass MemTest and we'd try using them. Capacitor failures are a potential problem but if the systems have been in a stable environment and not subject to a lot of thermal stressors, they should be fine. Especially the power supply caps shouldn't decide to get old and fail (I'm assuming you're talking electrolytics). The old paper electrolytics might have exhibited this behavior, but not even tantalums will do this. And, if tantalum caps go, they tend to be more spectacular and take lots of other parts with them. More to the point, (ceramic) chip caps that haven't been in a wet/moist/temp-varying, humid environment shouldn't crack and fail. Option 4 has potential, though. gc From bernard at vanhpc.org Wed Dec 16 16:13:08 2009 From: bernard at vanhpc.org (Bernard Li) Date: Wed, 16 Dec 2009 16:13:08 -0800 Subject: [Beowulf] PXE/TFTP and Xen Kernel Issues In-Reply-To: <4B280C20.5060700@georgetown.edu> References: <4B280C20.5060700@georgetown.edu> Message-ID: Hi Jess: With Xen-based kernels, you should be using the xen.gz "kernel" instead of vmlinuz. Here's what a grub entry looks like for booting Xen-based kernels: title CentOS (2.6.18-164.el5xen) root (hd0,0) kernel /xen.gz-3.4.0 module /vmlinuz-2.6.18-164.el5xen ro root=/dev/VolGroup00/LogVol00 module /initrd-2.6.18-164.el5xen.img Good luck! Cheers, Bernard On Tue, Dec 15, 2009 at 2:22 PM, Jess Cannata wrote: > I'm having a problem booting Xen kernels via PXE. I want to boot a machine > via PXE that will then host Xen virtual machines. The client machine PXE > boots, receives the pxelinux.0 file, and then grabs the Xen kernel > (vmlinuz-2.6.18-164.6.1.el5xen). However, it can never load the Xen kernel. > On the client, I get the following error: > > Invalid or corrupt kernel image. > > I have tried the following three kernels (two stock Centos kernels and one > custom compiled kernel) and only the Xen kernel fails: > > -rw-r--r-- 1 root root 2030154 Dec 10 15:28 vmlinuz-2.6.18-164.6.1.el5xen > -rw-r--r-- 1 root root 1932284 Sep 25 16:17 vmlinuz-2.6.18-164.el5 > -rw-r--r-- 1 root root 3277584 Dec 10 15:29 vmlinuz-2.6.27.15-jw-node > > The others load without error. I have checked multiple times that the Xen > kernel is not corrupt via md5sums and by booting it via grub. It just seems > not to like the PXE system. Here is a snippet of the dnsmasq log to show > that the file is sent correctly to the client: > > Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent /tftpboot/pxelinux.0 to > 192.168.0.6 > Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent > /tftpboot/pxelinux.cfg/default to 192.168.0.6 > Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent > /tftpboot/vmlinuz-2.6.18-164.6.1.el5xen to 192.168.0.6 > > I have tried three different systems for the DHCP, TFTP, and PXE Servers > (using stock RHEL/Centos packages). Here are the specs: > > System 1 > Centos 5.4 (64-bit) with nvidia Ethernet adapters > dnsmasq for both DHCP and TFTP Servers > syslinux for PXE > > System 2 > Centos 5.4 (64-bit) with e1000 Ethernet adapters > dnsmasq for both DHCP and TFTP Servers > syslinux for PXE > > System 3 > Centos 5.3 (32-bit) with e1000 Ethernet adapters (trying 32-bit version of > the Xen kernel) > Config One: > dnsmasq for both DHCP and TFTP Servers > syslinux for PXE > > Config Two: > dnsmasq for DHCP Server > tftp-server for TFTP Server > syslinux for PXE > > The client machines use the same hardware as the servers. I haven't seen > anything about Xen kernels having issues with PXE. Before I start trying > different flavors of Linux, I'm curious if anyone else has seen or heard of > this problem. > > Many thanks in advance. > > Jess > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From gus at ldeo.columbia.edu Wed Dec 16 17:28:04 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 16 Dec 2009 20:28:04 -0500 Subject: [Beowulf] Geriatric computer does not stay up In-Reply-To: <4B29716E.4090600@tamu.edu> References: <4B29716E.4090600@tamu.edu> Message-ID: <4B298924.3090805@ldeo.columbia.edu> Hi David Some of the built-in 3Com Ethernet 100 interfaces on Tyan S2466[-4M] motherboards we have here became flaky/failed after many years of use. Those are main boards in in several standalone workstations/PCs. I don't administer those systems, but I believe the symptoms were somewhat random, as those you describe. Disabling the onboard Ethernet (by jumper), and replacing them by PCI Ethernet 100 cards, gave those systems additional lifetime. Would this be the case of your cluster node? Interesting that I also posted today a note asking for help with Gigabit Ethernet on these very same motherboards! We also have them in an old workhorse cluster. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Gerald Creager wrote: > David Mathog wrote: >> So we have a cluster of Tyan S2466 nodes and one of them has failed in >> an odd way. (Yes, these are very old, and they would be gone if we had a >> replacment.) On applying power the system boots normally and gets far >> into the boot sequence, sometimes to the login prompt, then it locks up. >> If booted failsafe it will stay up for tens of minutes before locking. >> It locked once on "man smartctl" and once on "service network start". >> However, on the next reboot, it didn't lock with another "man smartctl", >> so it isn't like it hit a bad part of the disk and died. Smartctl test >> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as >> healthy with no blocks swapped out. Power stays on when it locks, and >> the display remains as it was just before the lock. When it locks it >> will not respond to either the keyboard or the network. (The network >> interface light still flashes.) There is nothing in any of the logs to >> indicate the nature of the problem. >> >> The odd thing is that the system is remarkably stable in some ways. For >> instance, the PS tests good and heat isn't the issue: after running >> sensors in a tight loop to a log file, waiting for it to lock up, then >> looking at the log on the next failsafe boot, there were negligible >> fluctuation on any of the voltages, fan speeds, or temperatures. It >> will happily sit for 30 minutes in the BIOS, or hours running memtest86 >> (without errors). The motherboard battery is good, and the inside of >> the case is very clean, with no dust visible at all. Reset the BIOS but >> it didn't change anything. >> >> Here are my current hypotheses for what's wrong with this beast: >> >> 1. The drive is failing electrically, puts voltage spikes out on some >> operations, and these crash the system. >> 2. The motherboard capacitors are failing and letting too much noise in. >> The noise which is fatal is only seen on an active system, so sitting >> in the BIOS or in Memtest86 does not do it. (But the caps all look good, >> no swelling, no leaks.) It will run memtest86 overnight though, just in >> case. >> 3. The PS capacitors are failing, so that when loaded there is enough >> voltage fluctuation to crash the system. (Does not agree very well with >> the sensors measurements, but it could be really high frequency noise >> superimposed on a steady base voltage.) >> 4. Evil Djinn ;-( >> >> Any thoughts on what else this might be? > > > I'd also be suspicious of memory failures. We have had DIMM failures > that were unseen on repeated MemTest86 runs until they failed hard, > hard, HARD. While they were still trying to decide, they'd pass MemTest > and we'd try using them. > > Capacitor failures are a potential problem but if the systems have been > in a stable environment and not subject to a lot of thermal stressors, > they should be fine. Especially the power supply caps shouldn't decide > to get old and fail (I'm assuming you're talking electrolytics). The > old paper electrolytics might have exhibited this behavior, but not even > tantalums will do this. And, if tantalum caps go, they tend to be more > spectacular and take lots of other parts with them. > > More to the point, (ceramic) chip caps that haven't been in a > wet/moist/temp-varying, humid environment shouldn't crack and fail. > > Option 4 has potential, though. > > gc > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Wed Dec 16 18:05:48 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 16 Dec 2009 18:05:48 -0800 Subject: [Beowulf] Kernel action relevant to us Message-ID: <20091217020548.GC19867@bx9.net> The following patch, not yet accepted into the kernel, should allow local TCP connections to start up faster, while remote ones keep the same behavior of slow start. ----- Forwarded message from chavey at google.com ----- From: chavey at google.com Date: Tue, 15 Dec 2009 13:15:28 -0800 To: davem at davemloft.net CC: netdev at vger.kernel.org, therbert at google.com, chavey at google.com, eric.dumazet at gmail.com Subject: [PATCH] Add rtnetlink init_rcvwnd to set the TCP initial receive window X-Mailing-List: netdev at vger.kernel.org Add rtnetlink init_rcvwnd to set the TCP initial receive window size advertised by passive and active TCP connections. The current Linux TCP implementation limits the advertised TCP initial receive window to the one prescribed by slow start. For short lived TCP connections used for transaction type of traffic (i.e. http requests), bounding the advertised TCP initial receive window results in increased latency to complete the transaction. Support for setting initial congestion window is already supported using rtnetlink init_cwnd, but the feature is useless without the ability to set a larger TCP initial receive window. The rtnetlink init_rcvwnd allows increasing the TCP initial receive window, allowing TCP connection to advertise larger TCP receive window than the ones bounded by slow start. Signed-off-by: Laurent Chavey --- include/linux/rtnetlink.h | 2 ++ include/net/dst.h | 2 -- include/net/tcp.h | 3 ++- net/ipv4/syncookies.c | 3 ++- net/ipv4/tcp_output.c | 17 +++++++++++++---- net/ipv6/syncookies.c | 3 ++- 6 files changed, 21 insertions(+), 9 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index adf2068..db6f614 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -371,6 +371,8 @@ enum #define RTAX_FEATURES RTAX_FEATURES RTAX_RTO_MIN, #define RTAX_RTO_MIN RTAX_RTO_MIN + RTAX_INITRWND, +#define RTAX_INITRWND RTAX_INITRWND __RTAX_MAX }; diff --git a/include/net/dst.h b/include/net/dst.h index 5a900dd..6ef812a 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -84,8 +84,6 @@ struct dst_entry * (L1_CACHE_SIZE would be too much) */ #ifdef CONFIG_64BIT - long __pad_to_align_refcnt[2]; -#else long __pad_to_align_refcnt[1]; #endif /* diff --git a/include/net/tcp.h b/include/net/tcp.h index 03a49c7..6f95d32 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -972,7 +972,8 @@ static inline void tcp_sack_reset(struct tcp_options_received *rx_opt) /* Determine a window scaling and initial window to offer. */ extern void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd, __u32 *window_clamp, - int wscale_ok, __u8 *rcv_wscale); + int wscale_ok, __u8 *rcv_wscale, + __u32 init_rcv_wnd); static inline int tcp_win_from_space(int space) { diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index a6e0e07..d43173c 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -356,7 +356,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb, tcp_select_initial_window(tcp_full_space(sk), req->mss, &req->rcv_wnd, &req->window_clamp, - ireq->wscale_ok, &rcv_wscale); + ireq->wscale_ok, &rcv_wscale, + dst_metric(&rt->u.dst, RTAX_INITRWND)); ireq->rcv_wscale = rcv_wscale; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index fcd278a..ee42c75 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -179,7 +179,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts) */ void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd, __u32 *window_clamp, - int wscale_ok, __u8 *rcv_wscale) + int wscale_ok, __u8 *rcv_wscale, + __u32 init_rcv_wnd) { unsigned int space = (__space < 0 ? 0 : __space); @@ -228,7 +229,13 @@ void tcp_select_initial_window(int __space, __u32 mss, init_cwnd = 2; else if (mss > 1460) init_cwnd = 3; - if (*rcv_wnd > init_cwnd * mss) + /* when initializing use the value from init_rcv_wnd + * rather than the default from above + */ + if (init_rcv_wnd && + (*rcv_wnd > init_rcv_wnd * mss)) + *rcv_wnd = init_rcv_wnd * mss; + else if (*rcv_wnd > init_cwnd * mss) *rcv_wnd = init_cwnd * mss; } @@ -2254,7 +2261,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst, &req->rcv_wnd, &req->window_clamp, ireq->wscale_ok, - &rcv_wscale); + &rcv_wscale, + dst_metric(dst, RTAX_INITRWND)); ireq->rcv_wscale = rcv_wscale; } @@ -2342,7 +2350,8 @@ static void tcp_connect_init(struct sock *sk) &tp->rcv_wnd, &tp->window_clamp, sysctl_tcp_window_scaling, - &rcv_wscale); + &rcv_wscale, + dst_metric(dst, RTAX_INITRWND)); tp->rx_opt.rcv_wscale = rcv_wscale; tp->rcv_ssthresh = tp->rcv_wnd; diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c index 6b6ae91..c8982aa 100644 --- a/net/ipv6/syncookies.c +++ b/net/ipv6/syncookies.c @@ -267,7 +267,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb) req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW); tcp_select_initial_window(tcp_full_space(sk), req->mss, &req->rcv_wnd, &req->window_clamp, - ireq->wscale_ok, &rcv_wscale); + ireq->wscale_ok, &rcv_wscale, + dst_metric(dst, RTAX_INITRWND)); ireq->rcv_wscale = rcv_wscale; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ----- End forwarded message ----- From h-bugge at online.no Thu Dec 17 00:33:36 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Thu, 17 Dec 2009 09:33:36 +0100 Subject: [Beowulf] FTC and Intel Message-ID: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no> In http://www.ftc.gov/os/adjpro/d9341/091216intelcmpt.pdf, there are allegations against Intel, such as "20. Intel?s efforts to deny interoperability between competitors? (e.g., Nvidia, AMD, and Via) GPUs and Intel?s newest CPUs". I was unaware of this. Anyone know what kind of interoperability we are talking about here? H?kon From bcostescu at gmail.com Thu Dec 17 10:07:42 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Thu, 17 Dec 2009 19:07:42 +0100 Subject: [Beowulf] A question about antique hardware In-Reply-To: <4B293D16.5090704@ldeo.columbia.edu> References: <4B293D16.5090704@ldeo.columbia.edu> Message-ID: On Wed, Dec 16, 2009 at 9:03 PM, Gus Correa wrote: > Did anybody ever get Gigabit Ethernet NICs to work on > the Tyan Tiger S2466-4M motherboards under Linux? I had a few nodes with this mainboard and used one Intel 82541 (not sure about the number though, it's a short card which supports both PCI-32 and PCI-64, Gbit over copper) in each. The nodes were usually stable under load-moderate load, but would not survive a combined net+disk load if I was to use them f.e. as NFS servers. Generally speaking, I had the impression that the interrupt handling on these mainboards was faulty; plus maybe also the power regulation... Nodes with these mainboards were the most unstable that I've ever had so, for reduced headache, I would suggest just going with something more recent. Cheers, Bogdan From mathog at caltech.edu Thu Dec 17 10:32:15 2009 From: mathog at caltech.edu (David Mathog) Date: Thu, 17 Dec 2009 10:32:15 -0800 Subject: [Beowulf] Re: Geriatric computer does not stay up Message-ID: Gus Correa wrote > Some of the built-in 3Com Ethernet 100 interfaces on > Tyan S2466[-4M] motherboards we have here became flaky/failed > after many years of use. > Those are main boards in in several standalone workstations/PCs. > I don't administer those systems, but I believe the symptoms > were somewhat random, as those you describe. > > Disabling the onboard Ethernet (by jumper), and replacing them by > PCI Ethernet 100 cards, gave those systems additional lifetime. > Would this be the case of your cluster node? Tyan S2466 MPX does not seem to have such a jumper. Possibly it can be disabled in the BIOS. Oddly, the system is fine PXE booting over that interface, but every attempt at: service network start hangs instantly. Tried booting with a serial console like this from pxelinux.cfg: LABEL serial KERNEL vmlinuz-2.6.24.7-desktop-2mnb APPEND initrd=initrd-2.6.24.7-desktop-2mnb.img root=/dev/hda3 failsafe console=ttyS0,38400 which uses the initrd and vmlinuz downloaded from the server, and the disk from the iffy machine for the programs. That booted fine, but the kernel emitted nothing on the serial line when the machine hung. Running smartctl now, after that will boot a rescue linux and see if that too has network issues. Ran memory tests for over 20 hours without a single hiccup. I'll keep looking. Thanks all. David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From gus at ldeo.columbia.edu Thu Dec 17 11:39:48 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 17 Dec 2009 14:39:48 -0500 Subject: [Beowulf] Re: Geriatric computer does not stay up In-Reply-To: References: Message-ID: <4B2A8904.3010002@ldeo.columbia.edu> Hi David There is more than one version of this motherboard, hence I may be dead wrong about yours. In any case, I have the mobo user manual here. You can download it from Tyan also (with an appendix): http://www.tyan.com/archive/products/html/tigermpx.html ftp://ftp.tyan.com/manuals/m_s2466_120.pdf ftp://ftp.tyan.com/manuals/a_s2466_110.pdf The manual says that jumper J86 disables/enables the onboard "LAN" (3Com 3C905c). Onboard Ethernet is *disabled* when the jumper is *closed*. (It is probably open now, no jumper, assuming onboard Ethernet is currently enabled.) J86 is next to the motherboard edge, near PCI slot #4, probably near the back of the node case. Also, it is unclear whether PXE boot will work with a PCI NIC, but it may. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- David Mathog wrote: > Gus Correa wrote > >> Some of the built-in 3Com Ethernet 100 interfaces on >> Tyan S2466[-4M] motherboards we have here became flaky/failed >> after many years of use. >> Those are main boards in in several standalone workstations/PCs. >> I don't administer those systems, but I believe the symptoms >> were somewhat random, as those you describe. >> >> Disabling the onboard Ethernet (by jumper), and replacing them by >> PCI Ethernet 100 cards, gave those systems additional lifetime. >> Would this be the case of your cluster node? > > Tyan S2466 MPX does not seem to have such a jumper. Possibly it can be > disabled in the BIOS. Oddly, the system is fine PXE booting over that > interface, but every attempt at: > > service network start > > hangs instantly. Tried booting with a serial console like this from > pxelinux.cfg: > > LABEL serial > KERNEL vmlinuz-2.6.24.7-desktop-2mnb > APPEND initrd=initrd-2.6.24.7-desktop-2mnb.img root=/dev/hda3 failsafe > console=ttyS0,38400 > > which uses the initrd and vmlinuz downloaded from the server, and the > disk from the iffy machine for the programs. That booted fine, but the > kernel emitted nothing on the serial line when the machine hung. > > Running smartctl now, after that will boot a rescue linux and see if > that too has network issues. > > Ran memory tests for over 20 hours without a single hiccup. > > I'll keep looking. Thanks all. > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From cousins at umit.maine.edu Thu Dec 17 11:55:52 2009 From: cousins at umit.maine.edu (Steve Cousins) Date: Thu, 17 Dec 2009 14:55:52 -0500 (EST) Subject: [Beowulf] Re: A question about antique hardware In-Reply-To: <200912170130.nBH1Uvvr016863@bluewest.scyld.com> References: <200912170130.nBH1Uvvr016863@bluewest.scyld.com> Message-ID: Gus Correa wrote: > Dear Beowulfers > > Did anybody ever get Gigabit Ethernet NICs to work on > the Tyan Tiger S2466-4M motherboards under Linux? Hi Gus, I have a S2466 Tiger MPX board that has been running for years with a copper Intel Pro/1000 MT NIC (82540EM) without trouble. The onboard 3Com 3c905C is running too and I've never had any problems with it (that I can remember!). The Gigabit card is using the e1000 driver and the 3Com is using 3c59x with a 2.4.20-20.7smp kernel with Redhat 7.3. (yikes!) As you can see it is a machine that I have pretty much forgotten about but it is closed off to the internet and it still serves a purpose. I hope this helps. Steve > If so, I would appreciate any words of wisdom about which > NICs work, the appropriate BIOS settings, > which PCI slots to use, etc. > > *** > > I flashed the Tyan S2466-4M BIOS to the latest version, > V4.06 (super, final 2003 edition). > > I need to set this head node up with two GigE ports. > I have two Intel 82543 Fiber Gigabit Ethernet PCI adapters, > which use the e1000 driver. > However, I would happily use other NICs and drivers, > anything that works, including copper based GigE. > > *** > > I googled up to find tips and solutions, > and I tried a number of different combinations: > disabling the onboard 3Com Ethernet 100 port with a jumper; > placing the NICs on the PCI-64 and on the PCI-32 slots; > disabling the BIOS "option RAM scan" on the NICs' PCI slots; > disabling USB on BIOS; > trying one NIC at a time; etc. > > However, so far no game. > The NICs are recognized, > link LEDs light up, > ping works, > but the system seems to be unstable, > ifdown/ifup hangs, > hence the system hangs when it tries to > take down the GigE ports during shutdown. > > Moreover, I get many of this kernel message on dmesg: > > Warning: kfree_skb on hard IRQ f88e47b2 > > *** > > This is the head node of our old Linux NetworX cluster. > The original head node motherboard, ASUS A7M266, > supported the aforementioned Intel NICs. > Unfortunately it seems to have died. > > I bought an used-but-functional Tyan S2466-4M board > on E-Bay as a replacement. > These S2466-4M boards seem to have been very popular > on servers and Beowulfs. > It sounded to me as a good choice. > After all, we have this board on all compute nodes. > The compute nodes don't have GigE, > only the onboard 3Com Ethernet 100 for service and I/O, > plus Myrinet-2000 for MPI. > They have been working fine for 8 years now. > > *** > > Thank you. > > Happy Holidays! > > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- From jac67 at georgetown.edu Thu Dec 17 13:05:46 2009 From: jac67 at georgetown.edu (Jess Cannata) Date: Thu, 17 Dec 2009 16:05:46 -0500 Subject: [Beowulf] PXE/TFTP and Xen Kernel Issues In-Reply-To: References: <4B280C20.5060700@georgetown.edu> Message-ID: <4B2A9D2A.2060104@georgetown.edu> Thanks Bernard! This is what I was looking for. Jess On 12/16/2009 07:13 PM, Bernard Li wrote: > Hi Jess: > > With Xen-based kernels, you should be using the xen.gz "kernel" > instead of vmlinuz. Here's what a grub entry looks like for booting > Xen-based kernels: > > title CentOS (2.6.18-164.el5xen) > root (hd0,0) > kernel /xen.gz-3.4.0 > module /vmlinuz-2.6.18-164.el5xen ro root=/dev/VolGroup00/LogVol00 > module /initrd-2.6.18-164.el5xen.img > > Good luck! > > Cheers, > > Bernard > > On Tue, Dec 15, 2009 at 2:22 PM, Jess Cannata wrote: > >> I'm having a problem booting Xen kernels via PXE. I want to boot a machine >> via PXE that will then host Xen virtual machines. The client machine PXE >> boots, receives the pxelinux.0 file, and then grabs the Xen kernel >> (vmlinuz-2.6.18-164.6.1.el5xen). However, it can never load the Xen kernel. >> On the client, I get the following error: >> >> Invalid or corrupt kernel image. >> >> I have tried the following three kernels (two stock Centos kernels and one >> custom compiled kernel) and only the Xen kernel fails: >> >> -rw-r--r-- 1 root root 2030154 Dec 10 15:28 vmlinuz-2.6.18-164.6.1.el5xen >> -rw-r--r-- 1 root root 1932284 Sep 25 16:17 vmlinuz-2.6.18-164.el5 >> -rw-r--r-- 1 root root 3277584 Dec 10 15:29 vmlinuz-2.6.27.15-jw-node >> >> The others load without error. I have checked multiple times that the Xen >> kernel is not corrupt via md5sums and by booting it via grub. It just seems >> not to like the PXE system. Here is a snippet of the dnsmasq log to show >> that the file is sent correctly to the client: >> >> Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent /tftpboot/pxelinux.0 to >> 192.168.0.6 >> Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent >> /tftpboot/pxelinux.cfg/default to 192.168.0.6 >> Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent >> /tftpboot/vmlinuz-2.6.18-164.6.1.el5xen to 192.168.0.6 >> >> I have tried three different systems for the DHCP, TFTP, and PXE Servers >> (using stock RHEL/Centos packages). Here are the specs: >> >> System 1 >> Centos 5.4 (64-bit) with nvidia Ethernet adapters >> dnsmasq for both DHCP and TFTP Servers >> syslinux for PXE >> >> System 2 >> Centos 5.4 (64-bit) with e1000 Ethernet adapters >> dnsmasq for both DHCP and TFTP Servers >> syslinux for PXE >> >> System 3 >> Centos 5.3 (32-bit) with e1000 Ethernet adapters (trying 32-bit version of >> the Xen kernel) >> Config One: >> dnsmasq for both DHCP and TFTP Servers >> syslinux for PXE >> >> Config Two: >> dnsmasq for DHCP Server >> tftp-server for TFTP Server >> syslinux for PXE >> >> The client machines use the same hardware as the servers. I haven't seen >> anything about Xen kernels having issues with PXE. Before I start trying >> different flavors of Linux, I'm curious if anyone else has seen or heard of >> this problem. >> >> Many thanks in advance. >> >> Jess >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> From gus at ldeo.columbia.edu Thu Dec 17 13:17:31 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 17 Dec 2009 16:17:31 -0500 Subject: [Beowulf] Re: Geriatric computer does not stay up In-Reply-To: References: Message-ID: <4B2A9FEB.4080103@ldeo.columbia.edu> Hi David A PCI riser card may help: http://www.logicsupply.com/categories/accessories/pci_riser_cards http://www.plinkusa.net/riser.htm We have a single riser card on our old cluster compute nodes for the bulky Myrinet-2000 cards (same S2466 motherboard, 2U chassis). However, we don't have a graphics card on the compute nodes. I wonder how you can fit both (if graphics is needed). Maybe with two riser cards of different heights. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- David Mathog wrote: >> The manual says that jumper J86 disables/enables >> the onboard "LAN" (3Com 3C905c). >> Onboard Ethernet is *disabled* when the jumper is *closed*. >> (It is probably open now, no jumper, >> assuming onboard Ethernet is currently enabled.) > > You are right. The jumper was under the graphics card, which was > mounted sideways in an AGP to PCI adapter. (No idea why they didn't > just put it in a PCI slot in the first place.) Hopefully disabling the > onboard NIC will do it, because the PS and disk have been eliminated as > possible trouble spots. > > >> Also, it is unclear whether PXE boot will work with a PCI NIC, >> but it may. > > First got to find a half height NIC to fit in the case, then I'll let > you know. > > Putting that jumper on did eliminate the onboard PXE part of the boot > sequence. > > Thanks, > > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech From mdidomenico4 at gmail.com Fri Dec 18 05:40:47 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 18 Dec 2009 08:40:47 -0500 Subject: [Beowulf] tg3 driver and rx dropped packets Message-ID: Perhaps my brain is already checking out for the holidays, perhaps someone might be able to shed some light... I have several Dell t3500 workstations which we've installed PCI-X BroadCom cards specifically the BCM5703 Fiber cards. We're also using RedHat v5.4 2.6.18-164.6.1 kernel with all the stock drivers. For some reason which i cannot determine, when we read a large amount of data into the workstation we see the RX dropped counter steadily (rapidly) increase, eventually locking the TCP transmissions, which results in an aborted file operation. This only happens on reads, if we do a write operation we do not see TX drops. We've tested this between all types of devices (nfs, http) servers at various points in the network and the only common thing is the NIC. I'd think it was just a bad nic, but we have several nics across t3500 and t3400's doing this. Is anyone aware of such an issue? Can anyone recommend some steps i can take to isolate why the packets are being dropped? I hooked up wireshark on one of the servers while we were running the test and i see a lot of Duplicate ACK and TCP Checksum errors, in the communications between the two hosts. But im not sure that actually points to anything. Thanks From bcostescu at gmail.com Fri Dec 18 06:00:34 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Fri, 18 Dec 2009 15:00:34 +0100 Subject: [Beowulf] tg3 driver and rx dropped packets In-Reply-To: References: Message-ID: On Fri, Dec 18, 2009 at 2:40 PM, Michael Di Domenico wrote: > For some reason which i cannot determine, when we read a large amount > of data into the workstation we see the RX dropped counter steadily > (rapidly) increase, eventually locking the TCP transmissions, which > results in an aborted file operation. Try turning off rx-checksumming and/or TSO. You can find it you have it enabled with: ethtool -k eth0 and turn it on/off with: ethtool -K eth0 tso off Cheers, Bogdan From lindahl at pbm.com Fri Dec 18 11:36:35 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 18 Dec 2009 11:36:35 -0800 Subject: [Beowulf] tg3 driver and rx dropped packets In-Reply-To: References: Message-ID: <20091218193635.GA14097@bx9.net> On Fri, Dec 18, 2009 at 08:40:47AM -0500, Michael Di Domenico wrote: > I hooked up wireshark on one of the servers while we were running the > test and i see a lot of Duplicate ACK and TCP Checksum errors, in the > communications between the two hosts. But im not sure that actually > points to anything. Well, it points to there being a significant problem. Packets are protected on the wire by a strong checksum, and so if there's corruption, it should be detected there. If that checksum is correct but the weak TCP checksum is wrong, that means something corrupted the packet in the host, for example a bad PCI card. The TCP checksum is so weak that if you see a lot of errors detected, you probably have some undetected errors sneaking through. -- greg From mathog at caltech.edu Fri Dec 18 16:12:03 2009 From: mathog at caltech.edu (David Mathog) Date: Fri, 18 Dec 2009 16:12:03 -0800 Subject: [Beowulf] Re: Geriatric computer does not stay up Message-ID: No Joy. Disabled the on board NIC and plugged in a couple of different NICs in different PCI slots - and they were all just as bad as the one on the motherboard. Let it rest unplugged overnight (just in case it was a stuck bit) and it didn't recover. Loading the PCI graphics card also crashed it. Seems like a failing circuit in the chipset in or around the PCI bridge. First time I've seen that failure. Replaced that motherboard with an even older one, a tiny bit slower, but still working. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From peter.st.john at gmail.com Sat Dec 19 13:20:48 2009 From: peter.st.john at gmail.com (Peter St. John) Date: Sat, 19 Dec 2009 16:20:48 -0500 Subject: [Beowulf] FTC and Intel In-Reply-To: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no> References: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no> Message-ID: Haakon, What I saw (in a recent Slashdot, which I ought to be able to find) was the idea that an Intel compiler disables it's own highest levels of optimizations if it detects that the host processor is not Intel. The complaint is based on that I believe. FWIW, I imagine that if an automobile engine detected poor octane in the fuel, it might throttle down the maximum speed of the car; but if it did so after detecting a competitor's brand of gasoline, it could be considered anti-competitive. But of course IMNAL or however we announce we ain't lawyers so CGS (cum grano salis). Peter On Thu, Dec 17, 2009 at 3:33 AM, H?kon Bugge wrote: > In http://www.ftc.gov/os/adjpro/d9341/091216intelcmpt.pdf, there are > allegations against Intel, such as "20. Intel?s efforts to deny > interoperability between competitors? (e.g., Nvidia, AMD, and Via) > GPUs and Intel?s newest CPUs". > > I was unaware of this. Anyone know what kind of interoperability we are > talking about here? > > > H?kon > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sassy1 at gmx.de Tue Dec 15 14:22:11 2009 From: sassy1 at gmx.de (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=) Date: Tue, 15 Dec 2009 22:22:11 +0000 Subject: [Beowulf] Performance degrading In-Reply-To: <200912152000.nBFK05ZA010546@bluewest.scyld.com> References: <200912152000.nBFK05ZA010546@bluewest.scyld.com> Message-ID: <200912152222.11820.sassy1@gmx.de> Hi Gus, thanks for your comments. The problem is not that there are 5 NWChem running. I am only starting 4 processes and the additional one is the master which does nothing more but coordinating the slaves. Other parts of the program behaving more as you expect it (again parallel between nodes, taken from one node): 14902 sassy 25 0 2161m 325m 124m R 100 2.8 14258:15 nwchem 14903 sassy 25 0 2169m 335m 128m R 100 2.9 14231:15 nwchem 14901 sassy 25 0 2177m 338m 133m R 100 2.9 14277:23 nwchem 14904 sassy 25 0 2161m 333m 132m R 97 2.9 14213:44 nwchem 14906 sassy 15 0 978m 71m 69m S 3 0.6 582:57.22 nwchem As you can see, there are 5 NWChem running but the fifth one does very little. So for me it looks like that the internode communication is a problem here and I would like to pin that down. For example, on the new dual quadcore I can get: 13555 sassy 20 0 2073m 212m 113m R 100 0.9 367:57.27 nwchem 13556 sassy 20 0 2074m 209m 109m R 100 0.9 369:11.21 nwchem 13557 sassy 20 0 2074m 206m 107m R 100 0.9 369:13.76 nwchem 13558 sassy 20 0 2072m 203m 103m R 100 0.8 368:18.53 nwchem 13559 sassy 20 0 2072m 178m 78m R 100 0.7 369:11.49 nwchem 13560 sassy 20 0 2072m 172m 73m R 100 0.7 369:14.35 nwchem 13561 sassy 20 0 2074m 171m 72m R 100 0.7 369:12.34 nwchem 13562 sassy 20 0 2072m 170m 72m R 100 0.7 368:56.30 nwchem So here there is no internode communication and hence I get the performance I would expect. The main problem is I am no longer the administrator of that cluster so anything which requires root access is not possible for me :-( But thanks for your 2 cent! :-) All the best J?rg Am Dienstag 15 Dezember 2009 schrieb beowulf-request at beowulf.org: > Hi Jorg > > If you have single quad core nodes as you said, > then top shows that you are oversubscribing the cores. > There are five nwchem processes are running. > > In my experience, oversubscription only works in relatively > light MPI programs (say the example programs that come with OpenMPI or > MPICH). > Real world applications tend to be very inefficient, > and can even hang on oversubscribed CPUs. > > What happens when you launch four or less processes > on a node instead of five? > > My $0.02. > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- From jorg.sassmannshausen at strath.ac.uk Tue Dec 15 14:31:36 2009 From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=) Date: Tue, 15 Dec 2009 22:31:36 +0000 Subject: [Beowulf] Performance degrading Message-ID: <200912152231.36348.jorg.sassmannshausen@strath.ac.uk> Hi Gus, thanks for your comments. The problem is not that there are 5 NWChem running. I am only starting 4 processes and the additional one is the master which does nothing more but coordinating the slaves. Other parts of the program behaving more as you expect it (again parallel between nodes, taken from one node): 14902 sassy 25 0 2161m 325m 124m R 100 2.8 14258:15 nwchem 14903 sassy 25 0 2169m 335m 128m R 100 2.9 14231:15 nwchem 14901 sassy 25 0 2177m 338m 133m R 100 2.9 14277:23 nwchem 14904 sassy 25 0 2161m 333m 132m R 97 2.9 14213:44 nwchem 14906 sassy 15 0 978m 71m 69m S 3 0.6 582:57.22 nwchem As you can see, there are 5 NWChem running but the fifth one does very little. So for me it looks like that the internode communication is a problem here and I would like to pin that down. For example, on the new dual quadcore I can get: 13555 sassy 20 0 2073m 212m 113m R 100 0.9 367:57.27 nwchem 13556 sassy 20 0 2074m 209m 109m R 100 0.9 369:11.21 nwchem 13557 sassy 20 0 2074m 206m 107m R 100 0.9 369:13.76 nwchem 13558 sassy 20 0 2072m 203m 103m R 100 0.8 368:18.53 nwchem 13559 sassy 20 0 2072m 178m 78m R 100 0.7 369:11.49 nwchem 13560 sassy 20 0 2072m 172m 73m R 100 0.7 369:14.35 nwchem 13561 sassy 20 0 2074m 171m 72m R 100 0.7 369:12.34 nwchem 13562 sassy 20 0 2072m 170m 72m R 100 0.7 368:56.30 nwchem So here there is no internode communication and hence I get the performance I would expect. The main problem is I am no longer the administrator of that cluster so anything which requires root access is not possible for me :-( But thanks for your 2 cent! :-) All the best J?rg Am Dienstag 15 Dezember 2009 schrieb beowulf-request at beowulf.org: > Hi Jorg > > If you have single quad core nodes as you said, > then top shows that you are oversubscribing the cores. > There are five nwchem processes are running. > > In my experience, oversubscription only works in relatively > light MPI programs (say the example programs that come with OpenMPI or > MPICH). > Real world applications tend to be very inefficient, > and can even hang on oversubscribed CPUs. > > What happens when you launch four or less processes > on a node instead of five? > > My $0.02. > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- -- ************************************************************* J?rg Sa?mannshausen Research Fellow University of Strathclyde Department of Pure and Applied Chemistry 295 Cathedral St. Glasgow G1 1XL email: jorg.sassmannshausen at strath.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From jorg.sassmannshausen at strath.ac.uk Wed Dec 16 01:41:39 2009 From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=) Date: Wed, 16 Dec 2009 09:41:39 +0000 Subject: [Beowulf] Re: Performance degrading In-Reply-To: <200912160442.nBG4gYni023476@bluewest.scyld.com> References: <200912160442.nBG4gYni023476@bluewest.scyld.com> Message-ID: <200912160941.39449.jorg.sassmannshausen@strath.ac.uk> Hi guys, ok, some more information. I am using OpenMPI-1.2.8 and I only start 4 processes per node. So my hostfile looks like that: comp12 slots=4 comp18 slots=4 comp08 slots=4 And yes, one process is the idle one which does things in the background. I have observed similar degradions before with a different program (GAMESS) where in the end, running a job on one node was _faster_ then running it on more than one nodes. Clearly, there is a problem here. Interesting to note that the fith process is consuming memory as well, I did not see that at the time when I posted it. That is somehow odd as well, as a different calculation (same program) does not show that behaviour. I assume it is one extra process per job-group which will act as a master or shepherd for the slave processes. I know that GAMESS (which does not use MPI but ddi) has one additional process as data-server. IIRC, the extra process does come from NWChem, but I doubt I am oversubscribing the node as it usually should not do much, as mentioned before. I am still wondering whether that could be a network issue? Thanks for your comments! All the best Jorg On Wednesday 16 December 2009 04:42:59 beowulf-request at beowulf.org wrote: > Hi Glen, Jorg > > Glen: Yes, you are right about MPICH1/P4 starting extra processes. > However, I wonder if that is what is happening to Jorg, > of if what he reported is just plain CPU oversubscription. > > Jorg: ?Do you use MPICH1/P4? > How many processes did you launch on a single node, four or five? > > Glen: ?Out of curiosity, I dug out the MPICH1/P4 I still have on an > old system, compiled and ran "cpi.c". > Indeed there are extra processes there, besides the ones that > I intentionally started in the mpirun command line. > When I launch two processes on a two-single-core-CPU machine, > I also get two (not only one) extra processes, in a total of four. > > However, as you mentioned, > the extra processes do not seem to use any significant CPU. > Top shows the two actual processes close to 100% and the > extra ones close to zero. > Furthermore, the extra processes don't use any > significant memory either. > > Anyway, in Jorg's case all processes consumed about > the same (low) amount of CPU, but ~15% memory each, > and there were 5 processes (only one "extra"?, is it one per CPU socket? > is it one per core? one per node?). > Hence, I would guess Jorg's context is different. > But ... who knows ... only Jorg can clarify. > > These extra processes seem to be related to the > mechanism used by MPICH1/P4 to launch MPI programs. > They don't seem to appear in recent OpenMPI or MPICH2, > which have other launching mechanisms. > Hence my guess that Jorg had an oversubscription problem. > > Considering that MPICH1/P4 is old, no longer maintained, > and seems to cause more distress than joy in current kernels, > I would not recommend it to Jorg or to anybody anyway. > > Thank you, > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- -- ************************************************************* J?rg Sa?mannshausen Research Fellow University of Strathclyde Department of Pure and Applied Chemistry 295 Cathedral St. Glasgow G1 1XL email: jorg.sassmannshausen at strath.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From jack at crepinc.com Wed Dec 16 14:36:05 2009 From: jack at crepinc.com (Jack Carrozzo) Date: Wed, 16 Dec 2009 17:36:05 -0500 Subject: [Beowulf] Geriatric computer does not stay up In-Reply-To: References: Message-ID: <2ad0f9f60912161436w522c1e53n124a720c6031f7f3@mail.gmail.com> I assume you've done this but forgot to mention it in the email - did you test the RAM? -Jack Carrozzo On Wed, Dec 16, 2009 at 5:27 PM, David Mathog wrote: > So we have a cluster of Tyan S2466 nodes and one of them has failed in > an odd way. (Yes, these are very old, and they would be gone if we had a > replacment.) ?On applying power the system boots normally and gets far > into the boot sequence, sometimes to the login prompt, then it locks up. > ?If booted failsafe it will stay up for tens of minutes before locking. > ?It locked once on "man smartctl" and once on "service network start". > However, on the next reboot, it didn't lock with another "man smartctl", > so it isn't like it hit a bad part of the disk and died. ?Smartctl test > has not been run, but "smartctl -a /dev/hda" on the one disk shows it as > healthy with no blocks swapped out. ?Power stays on when it locks, and > the display remains as it was just before the lock. ?When it locks it > will not respond to either the keyboard or the network. ?(The network > interface light still flashes.) ?There is nothing in any of the logs to > indicate the nature of the problem. > > The odd thing is that the system is remarkably stable in some ways. ?For > instance, the PS tests good and heat isn't the issue: after running > sensors in a tight loop to a log file, waiting for it to lock up, then > looking at the log on the next failsafe boot, there were negligible > fluctuation on any of the voltages, fan speeds, or temperatures. ?It > will happily sit for 30 minutes in the BIOS, or hours running memtest86 > (without errors). ?The motherboard battery is good, and the inside of > the case is very clean, with no dust visible at all. ?Reset the BIOS but > it didn't change anything. > > Here are my current hypotheses for what's wrong with this beast: > > 1. The drive is failing electrically, puts voltage spikes out on some > operations, and these crash the system. > 2. The motherboard capacitors are failing and letting too much noise in. > ?The noise which is fatal is only seen on an active system, so sitting > in the BIOS or in Memtest86 does not do it. (But the caps all look good, > no swelling, no leaks.) ?It will run memtest86 overnight though, just in > case. > 3. The PS capacitors are failing, so that when loaded there is enough > voltage fluctuation to crash the system. ?(Does not agree very well with > the sensors measurements, but it could be really high frequency noise > superimposed on a steady base voltage.) > 4. Evil Djinn ;-( > > Any thoughts on what else this might be? > > Thanks. > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From pauljohn32 at gmail.com Sun Dec 20 12:59:08 2009 From: pauljohn32 at gmail.com (Paul Johnson) Date: Sun, 20 Dec 2009 14:59:08 -0600 Subject: [Beowulf] if you had 2 switches and a bunch of cable, what would you do? Message-ID: <13e802630912201259k72971fa9x2627fb9ae4f68174@mail.gmail.com> I inherited a few big racks of unused Dell PowerEdge servers. There are 2 1GB switches in each rack, and each node in the cluster has at least two ethernet connections. I have a head node and some compute nodes up and running in Rocks Cluster 5.2, but in my haste to see that the hardware actually works, I've only only cut and patched enough cables to use one switch for the internal network. What to do with the other switch? As far as I understand it, I have 2 options (aside from selling the extra switch on Ebay :)) Option 1. Create 2 separate internal networks. In wiring drawings for clusters, I often see one administrative network and one for computations (mpi, and so forth). The downside for that is that I don't yet understand how user programs are supposed to differentiate the 2 internal networks and send messages through the computation network. I've not had the luxury of 2 ethernet cards before. Option 2. Try "channel bonding" to try to increase the throughput on a single ethernet node. That has some appeal because the users just see one network. I'm not aiming for the "high availability" approach of using the second switch as a fallback. Rather, I'm aiming for the fastest & widest connection possible in and out of each compute node. My head node still has just one 1GB ethernet connection going into it, and if I could believe that channel bonding would actually improve throughput, I suppose I could get another line (or even 2 more) going into the head node. The headnode has 4 ethernet jacks, so I suppose I could double up inside and outside. I'd be glad to hear your thoughts. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas From hackercasta at esdebian.org Sat Dec 19 13:13:44 2009 From: hackercasta at esdebian.org (=?ISO-8859-1?Q?_=09Iker_Casta=F1os_Chavarri_?=) Date: Sat, 19 Dec 2009 22:13:44 +0100 Subject: [Beowulf] new distribution to create a beowulf cluster: ABC(Automated Beowulf Cluster) GNU/Linux In-Reply-To: <90202de50912180405sc9bd612t67f2fdc17da285b@mail.gmail.com> References: <90202de50912180405sc9bd612t67f2fdc17da285b@mail.gmail.com> Message-ID: <90202de50912191313h34a77c09md2625fe63001f6ba@mail.gmail.com> Good night, This Ubuntu GNU/Linux based distribution alaws to automatically build Beowulf clusters either live or installing the software in the frontend. All nodes run diskless. Connect your computers with a switch and insert the DVD on one of them. It has the ganglia monitor, which is nice for seeing the details of the operation of the cluser. The system is beeing suported sporadically and you may contact technical support at abclinuxsupport at gmail.com ABC has been presented at the symposium ICAT2009 and published a research paper in the IEEE. The developer is Iker Casta?os at the EUITI of Bilbao, University of the Basque Country (Spain). The proyect website: http://www.ehu.es/AC/ABC.htm This is a demostration video: http://www.youtube.com/watch?v=Xn2M1SoVg6U Best regards, Iker Casta?os From h-bugge at online.no Mon Dec 21 01:05:08 2009 From: h-bugge at online.no (=?WINDOWS-1252?Q?H=E5kon_Bugge?=) Date: Mon, 21 Dec 2009 10:05:08 +0100 Subject: [Beowulf] FTC and Intel In-Reply-To: References: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no> Message-ID: Peter, On Dec 19, 2009, at 22:20 , Peter St. John wrote: > Haakon, > What I saw (in a recent Slashdot, which I ought to be able to find) > was the idea that an Intel compiler disables it's own highest levels > of optimizations if it detects that the host processor is not Intel. > The complaint is based on that I believe. > > FWIW, I imagine that if an automobile engine detected poor octane in > the fuel, it might throttle down the maximum speed of the car; but > if it did so after detecting a competitor's brand of gasoline, it > could be considered anti-competitive. But of course IMNAL or however > we announce we ain't lawyers so CGS (cum grano salis). > Peter Intel's compiler generates code that will not run on AMD CPUs (i.e. non GenuineIntel) if an instruction-set higher than SSE2 is selected. This issue is covered by the referenced complaints elsewhere. I know this issue pretty well, as I wrote a piece of software used by Scali/ Platform MPI which reverted the Intel compiler's check for GenuineIntel, which transparently allowed MPI programs to run on AMD CPUs with SSE3/SSSE3 instruction set enabled. Claim 20 in the complaints is not related to this, but _interoperability_ between GPUs and Intel CPUs. That is what I tried to get a better understanding of. I have received insightful comments around this issue off-list. Thanks anyway, H?kon > On Thu, Dec 17, 2009 at 3:33 AM, H?kon Bugge > wrote: > In http://www.ftc.gov/os/adjpro/d9341/091216intelcmpt.pdf, there are > allegations against Intel, such as "20. Intel?s efforts to deny > interoperability between competitors? (e.g., Nvidia, AMD, and Via) > GPUs and Intel?s newest CPUs". > > I was unaware of this. Anyone know what kind of interoperability we > are talking about here? > > > H?kon > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Mvh., H?kon Bugge h-bugge at online.no +47 924 84 514 -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at mclaren.com Mon Dec 21 01:54:05 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Mon, 21 Dec 2009 09:54:05 -0000 Subject: [Beowulf] if you had 2 switches and a bunch of cable,what would you do? In-Reply-To: <13e802630912201259k72971fa9x2627fb9ae4f68174@mail.gmail.com> References: <13e802630912201259k72971fa9x2627fb9ae4f68174@mail.gmail.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0EA7BDEE@milexchmb1.mil.tagmclarengroup.com> Option 1. Create 2 separate internal networks. In wiring drawings for clusters, I often see one administrative network and one for computations (mpi, and so forth). Paul, definitely recommend option 1. Use the second switch for MPI traffic. The way you achieve this is to use your batch scheduling system and run a script which takes the machines list provided by the batch system and translates it into one which is fed to to the mpirun utility, ie the hostnames are turned into something like hostname-eth1 The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From john.hearns at mclaren.com Mon Dec 21 02:04:37 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Mon, 21 Dec 2009 10:04:37 -0000 Subject: [Beowulf] Performance degrading In-Reply-To: <200912152231.36348.jorg.sassmannshausen@strath.ac.uk> References: <200912152231.36348.jorg.sassmannshausen@strath.ac.uk> Message-ID: <68A57CCFD4005646957BD2D18E60667B0EA7BE0D@milexchmb1.mil.tagmclarengroup.com> The main problem is I am no longer the administrator of that cluster so anything which requires root access is not possible for me :-( But thanks for your 2 cent! :-) Jorg, you live in Glasgow. The solution is simple. Proceed without delay to the Scotch Whisky Shop in the Buchanan Galleries (just down the hill). Buy one bottle of single malt whisky - you may like to taste a few just to get the right one. Present this bottle as a Christmas gift to your BOFH (ahem, sorry, friendly systems admin). The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From csamuel at vpac.org Mon Dec 21 03:57:10 2009 From: csamuel at vpac.org (Chris Samuel) Date: Mon, 21 Dec 2009 22:57:10 +1100 (EST) Subject: [Beowulf] if you had 2 switches and a bunch of cable, what would you do? In-Reply-To: <777881520.276391261396089675.JavaMail.root@mail.vpac.org> Message-ID: <60238355.276411261396630847.JavaMail.root@mail.vpac.org> ----- "Paul Johnson" wrote: > What to do with the other switch? We use one ethernet switch for management, one ethernet switch (with jumbo frames) for our NFS storage network and IB/Myrinet for MPI. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From reuti at staff.uni-marburg.de Mon Dec 21 04:38:06 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 21 Dec 2009 13:38:06 +0100 Subject: [Beowulf] if you had 2 switches and a bunch of cable, what would you do? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0EA7BDEE@milexchmb1.mil.tagmclarengroup.com> References: <13e802630912201259k72971fa9x2627fb9ae4f68174@mail.gmail.com> <68A57CCFD4005646957BD2D18E60667B0EA7BDEE@milexchmb1.mil.tagmclarengroup.com> Message-ID: <6C5F76ED-A3F0-4601-97FA-3F6FB670A225@staff.uni-marburg.de> Hi, Am 21.12.2009 um 10:54 schrieb Hearns, John: > Option 1. Create 2 separate internal networks. In wiring drawings for > clusters, I often see one administrative network and one for > computations (mpi, and so forth). first a matter of definition: what's "administrative network"? Just the option to ssh to a node & SGE, or to have access to some facitility of a dedicated service processor like "Lights Out" management? In the former case, I would do it the other way round: use the primary one for ssh, SGE and MPI, and the second one for NFS. Simply because then there is no need to alter the generated list of hosts (ssh to the nodes is in my case only for admin staff anyway), and SGE's is communication to the nodes is not so high (When local spool directories on the nodes are used, there is no further communication needed to store this informaton. Otherwise it will go via NFS to the central spool directory, but this would be the second network then.) With some MPI implementations it can be tricky (but possible) to force them to use the secondary interface, especially for both directions. In MPICH(1) (old, but sometimes still used) also the environment variable MPI_HOST must be set to have the name of the secondary interface. Well, you need a second (or third) network for the headnode then: one for NFS and maybe one for going to the outside world (this way all internal traffic can use private addresses and are invisible from the outside). == If the administrative network is "Lights Out" management, I would look for a switch with less performance laying around. If your servers have it built-in, I would use it. If you have enough ports on the switches, you can also connect it to the first network from above. -- Reuti > Paul, definitely recommend option 1. > Use the second switch for MPI traffic. > > The way you achieve this is to use your batch scheduling system and > run > a script > which takes the machines list provided by the batch system and > translates it into one > which is fed to to the mpirun utility, ie the hostnames are turned > into > something like > hostname-eth1 > > The contents of this email are confidential and for the exclusive > use of the intended recipient. If you receive this email in error > you should not copy it, retransmit it, use it or disclose its > contents but should return it to the sender immediately and delete > your copy. > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From peter.st.john at gmail.com Mon Dec 21 09:41:38 2009 From: peter.st.john at gmail.com (Peter St. John) Date: Mon, 21 Dec 2009 12:41:38 -0500 Subject: [Beowulf] FTC and Intel In-Reply-To: References: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no> Message-ID: Haakon, And thanks for correcting me. I had been surprised your question went unanswered so long as it did; I should always be suspicious that means I had superficially misconstrued the question. Peter On Mon, Dec 21, 2009 at 4:05 AM, H?kon Bugge wrote: > Peter, > > On Dec 19, 2009, at 22:20 , Peter St. John wrote: > > Haakon, > What I saw (in a recent Slashdot, which I ought to be able to find) was the > idea that an Intel compiler disables it's own highest levels of > optimizations if it detects that the host processor is not Intel. The > complaint is based on that I believe. > > FWIW, I imagine that if an automobile engine detected poor octane in the > fuel, it might throttle down the maximum speed of the car; but if it did so > after detecting a competitor's brand of gasoline, it could be considered > anti-competitive. But of course IMNAL or however we announce we ain't > lawyers so CGS (cum grano salis). > Peter > > > Intel's compiler generates code that will not run on AMD CPUs (i.e. non > GenuineIntel) if an instruction-set higher than SSE2 is selected. This issue > is covered by the referenced complaints elsewhere. I know this issue pretty > well, as I wrote a piece of software used by Scali/Platform MPI which > reverted the Intel compiler's check for GenuineIntel, which transparently > allowed MPI programs to run on AMD CPUs with SSE3/SSSE3 instruction set > enabled. > > Claim 20 in the complaints is not related to this, but _interoperability_ > between GPUs and Intel CPUs. That is what I tried to get a better > understanding of. I have received insightful comments around this issue > off-list. > > > Thanks anyway, H?kon > > On Thu, Dec 17, 2009 at 3:33 AM, H?kon Bugge wrote: > >> In http://www.ftc.gov/os/adjpro/d9341/091216intelcmpt.pdf, there are >> allegations against Intel, such as "20. Intel?s efforts to deny >> interoperability between competitors? (e.g., Nvidia, AMD, and Via) >> GPUs and Intel?s newest CPUs". >> >> I was unaware of this. Anyone know what kind of interoperability we are >> talking about here? >> >> >> H?kon >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > Mvh., > > H?kon Bugge > h-bugge at online.no > +47 924 84 514 > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gus at ldeo.columbia.edu Mon Dec 21 10:06:04 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 21 Dec 2009 13:06:04 -0500 Subject: [Beowulf] Re: Performance degrading In-Reply-To: <200912160941.39449.jorg.sassmannshausen@strath.ac.uk> References: <200912160442.nBG4gYni023476@bluewest.scyld.com> <200912160941.39449.jorg.sassmannshausen@strath.ac.uk> Message-ID: <4B2FB90C.5040109@ldeo.columbia.edu> Hi Jorg To clarify what is going on, I would try the cpi.c (comes with MPICH2), or the "connectivity_c.c", "ring_c.c" (come with OpenMPI) programs. Get them with the source code in the MPICH2 and OpenMPI sites. These programs are in the "examples" directories. Compiliation is straightforward with mpicc. Run these programs on one node (4 processes) first, then on several nodes (say -np 8, -np 12, etc). Remember OpenMPI mpiexec has the "-byslot" and "-bynode" options that allow you to experiment with different process vs. core/node configurations (see "man mpiexec"). If cpi.c runs, then it should not be an OpenMPI problem, or with your network, etc. This will narrow your investigation also. I know nothing about your programs, but I find strange that it starts more processes than you request. As noted by Glen this used to be the case with the old MPICH1, which you are not using. Hence, your program seems to be doing stuff under the hood, and beyond mpiexec. In any case, have you tried "mpiexec -np 3 newchem" (if 3 is an acceptable number for newchem)? Also, not being the administrator doesn't prevent you from installing a newer OpenMPI (current version is 1.4) from source code in your area and using it. You just need to set the right PATH, LD_LIBRARY_PATH, and MANPATH to your own OpenMPI on your .bashrc/.cshrc file. I am no expert, just a user, but here is what I think may be happening. Oversubscribing processors/cores leads to context switching across the processes, which is a killer for MPI performance. Oversubscribing memory (e.g. total of all user processes memory above 80% or so), leads to memory paging, another performance killer. I would guess both situations open plenty of opportunity for gridlocks, one process trying to communicate with another that is on hold, and when the other that is on hold becomes active, the one that was trying to talk goes on hold, and so on. Sometimes the programs just hang, sometimes one MPI process goes astray losing communication with the others. Something like this may be happening to you. I think the message is: MPI and oversubscription (of processors or memory) don't mix well. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- J?rg Sa?mannshausen wrote: > Hi guys, > > ok, some more information. > I am using OpenMPI-1.2.8 and I only start 4 processes per node. So my hostfile > looks like that: > comp12 slots=4 > comp18 slots=4 > comp08 slots=4 > > And yes, one process is the idle one which does things in the background. I > have observed similar degradions before with a different program (GAMESS) > where in the end, running a job on one node was _faster_ then running it on > more than one nodes. Clearly, there is a problem here. > > Interesting to note that the fith process is consuming memory as well, I did > not see that at the time when I posted it. That is somehow odd as well, as a > different calculation (same program) does not show that behaviour. I assume > it is one extra process per job-group which will act as a master or shepherd > for the slave processes. I know that GAMESS (which does not use MPI but ddi) > has one additional process as data-server. > > IIRC, the extra process does come from NWChem, but I doubt I am > oversubscribing the node as it usually should not do much, as mentioned > before. > > I am still wondering whether that could be a network issue? > > Thanks for your comments! > > All the best > > Jorg > > > On Wednesday 16 December 2009 04:42:59 beowulf-request at beowulf.org wrote: >> Hi Glen, Jorg >> >> Glen: Yes, you are right about MPICH1/P4 starting extra processes. >> However, I wonder if that is what is happening to Jorg, >> of if what he reported is just plain CPU oversubscription. >> >> Jorg: Do you use MPICH1/P4? >> How many processes did you launch on a single node, four or five? >> >> Glen: Out of curiosity, I dug out the MPICH1/P4 I still have on an >> old system, compiled and ran "cpi.c". >> Indeed there are extra processes there, besides the ones that >> I intentionally started in the mpirun command line. >> When I launch two processes on a two-single-core-CPU machine, >> I also get two (not only one) extra processes, in a total of four. >> >> However, as you mentioned, >> the extra processes do not seem to use any significant CPU. >> Top shows the two actual processes close to 100% and the >> extra ones close to zero. >> Furthermore, the extra processes don't use any >> significant memory either. >> >> Anyway, in Jorg's case all processes consumed about >> the same (low) amount of CPU, but ~15% memory each, >> and there were 5 processes (only one "extra"?, is it one per CPU socket? >> is it one per core? one per node?). >> Hence, I would guess Jorg's context is different. >> But ... who knows ... only Jorg can clarify. >> >> These extra processes seem to be related to the >> mechanism used by MPICH1/P4 to launch MPI programs. >> They don't seem to appear in recent OpenMPI or MPICH2, >> which have other launching mechanisms. >> Hence my guess that Jorg had an oversubscription problem. >> >> Considering that MPICH1/P4 is old, no longer maintained, >> and seems to cause more distress than joy in current kernels, >> I would not recommend it to Jorg or to anybody anyway. >> >> Thank you, >> Gus Correa >> --------------------------------------------------------------------- >> Gustavo Correa >> Lamont-Doherty Earth Observatory - Columbia University >> Palisades, NY, 10964-8000 - USA >> --------------------------------------------------------------------- > From atp at piskorski.com Mon Dec 21 11:56:10 2009 From: atp at piskorski.com (Andrew Piskorski) Date: Mon, 21 Dec 2009 14:56:10 -0500 Subject: [Beowulf] new distribution to create a beowulf cluster: ABC(Automated Beowulf Cluster) GNU/Linux In-Reply-To: <90202de50912191313h34a77c09md2625fe63001f6ba@mail.gmail.com> References: <90202de50912191313h34a77c09md2625fe63001f6ba@mail.gmail.com> Message-ID: <20091221195610.GA59020@piskorski.com> On Sat, Dec 19, 2009 at 10:13:44PM +0100, Iker Casta?os Chavarri wrote: > This Ubuntu GNU/Linux based distribution alaws to automatically build > Beowulf clusters either live or installing the software in the > frontend. All nodes run diskless. > http://www.ehu.es/AC/ABC.htm How is this similar/different to Perceus/Warewulf, xCAT, Scyld, etc.? What use cases drove you to build your own cluster provisioning toolkit rather than use an existing one? Is there a document somewhere explaining your overall system design and rationale? -- Andrew Piskorski http://www.piskorski.com/ From kyron at neuralbs.com Mon Dec 21 11:05:45 2009 From: kyron at neuralbs.com (Eric Thibodeau) Date: Mon, 21 Dec 2009 14:05:45 -0500 Subject: [Beowulf] Geriatric computer does not stay up In-Reply-To: <2ad0f9f60912161436w522c1e53n124a720c6031f7f3@mail.gmail.com> References: <2ad0f9f60912161436w522c1e53n124a720c6031f7f3@mail.gmail.com> Message-ID: This smells like the hell I went through when one of the CPUs needed to be changed in our dep's Tyan VX50... Try swapping CPUs if you have spares. ET On 2009-12-16, at 5:36 PM, Jack Carrozzo wrote: > I assume you've done this but forgot to mention it in the email - did > you test the RAM? > > -Jack Carrozzo > > On Wed, Dec 16, 2009 at 5:27 PM, David Mathog wrote: >> So we have a cluster of Tyan S2466 nodes and one of them has failed in >> an odd way. (Yes, these are very old, and they would be gone if we had a >> replacment.) On applying power the system boots normally and gets far >> into the boot sequence, sometimes to the login prompt, then it locks up. >> If booted failsafe it will stay up for tens of minutes before locking. >> It locked once on "man smartctl" and once on "service network start". >> However, on the next reboot, it didn't lock with another "man smartctl", >> so it isn't like it hit a bad part of the disk and died. Smartctl test >> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as >> healthy with no blocks swapped out. Power stays on when it locks, and >> the display remains as it was just before the lock. When it locks it >> will not respond to either the keyboard or the network. (The network >> interface light still flashes.) There is nothing in any of the logs to >> indicate the nature of the problem. >> >> The odd thing is that the system is remarkably stable in some ways. For >> instance, the PS tests good and heat isn't the issue: after running >> sensors in a tight loop to a log file, waiting for it to lock up, then >> looking at the log on the next failsafe boot, there were negligible >> fluctuation on any of the voltages, fan speeds, or temperatures. It >> will happily sit for 30 minutes in the BIOS, or hours running memtest86 >> (without errors). The motherboard battery is good, and the inside of >> the case is very clean, with no dust visible at all. Reset the BIOS but >> it didn't change anything. >> >> Here are my current hypotheses for what's wrong with this beast: >> >> 1. The drive is failing electrically, puts voltage spikes out on some >> operations, and these crash the system. >> 2. The motherboard capacitors are failing and letting too much noise in. >> The noise which is fatal is only seen on an active system, so sitting >> in the BIOS or in Memtest86 does not do it. (But the caps all look good, >> no swelling, no leaks.) It will run memtest86 overnight though, just in >> case. >> 3. The PS capacitors are failing, so that when loaded there is enough >> voltage fluctuation to crash the system. (Does not agree very well with >> the sensors measurements, but it could be really high frequency noise >> superimposed on a steady base voltage.) >> 4. Evil Djinn ;-( >> >> Any thoughts on what else this might be? >> >> Thanks. >> >> David Mathog >> mathog at caltech.edu >> Manager, Sequence Analysis Facility, Biology Division, Caltech >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From reuti at staff.uni-marburg.de Mon Dec 21 16:01:57 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Tue, 22 Dec 2009 01:01:57 +0100 Subject: [Beowulf] Performance degrading In-Reply-To: <200912152222.11820.sassy1@gmx.de> References: <200912152000.nBFK05ZA010546@bluewest.scyld.com> <200912152222.11820.sassy1@gmx.de> Message-ID: Hi, Am 15.12.2009 um 23:22 schrieb J?rg Sa?mannshausen: > Hi Gus, > > thanks for your comments. The problem is not that there are 5 > NWChem running. > I am only starting 4 processes and the additional one is the master > which > does nothing more but coordinating the slaves. isn't NWChem using Global Arrays internally, and Open MPI is only used for communication? Which version of GA is included with your current NWChem? -- Reuti > Other parts of the program behaving more as you expect it (again > parallel > between nodes, taken from one node): > 14902 sassy 25 0 2161m 325m 124m R 100 2.8 14258:15 nwchem > 14903 sassy 25 0 2169m 335m 128m R 100 2.9 14231:15 nwchem > 14901 sassy 25 0 2177m 338m 133m R 100 2.9 14277:23 nwchem > 14904 sassy 25 0 2161m 333m 132m R 97 2.9 14213:44 nwchem > 14906 sassy 15 0 978m 71m 69m S 3 0.6 582:57.22 nwchem > > As you can see, there are 5 NWChem running but the fifth one does > very little. > So for me it looks like that the internode communication is a > problem here and > I would like to pin that down. > > For example, on the new dual quadcore I can get: > 13555 sassy 20 0 2073m 212m 113m R 100 0.9 367:57.27 nwchem > 13556 sassy 20 0 2074m 209m 109m R 100 0.9 369:11.21 nwchem > 13557 sassy 20 0 2074m 206m 107m R 100 0.9 369:13.76 nwchem > 13558 sassy 20 0 2072m 203m 103m R 100 0.8 368:18.53 nwchem > 13559 sassy 20 0 2072m 178m 78m R 100 0.7 369:11.49 nwchem > 13560 sassy 20 0 2072m 172m 73m R 100 0.7 369:14.35 nwchem > 13561 sassy 20 0 2074m 171m 72m R 100 0.7 369:12.34 nwchem > 13562 sassy 20 0 2072m 170m 72m R 100 0.7 368:56.30 nwchem > So here there is no internode communication and hence I get the > performance I > would expect. > > The main problem is I am no longer the administrator of that > cluster so > anything which requires root access is not possible for me :-( > > But thanks for your 2 cent! :-) > > All the best > > J?rg > > Am Dienstag 15 Dezember 2009 schrieb beowulf-request at beowulf.org: >> Hi Jorg >> >> If you have single quad core nodes as you said, >> then top shows that you are oversubscribing the cores. >> There are five nwchem processes are running. >> >> In my experience, oversubscription only works in relatively >> light MPI programs (say the example programs that come with >> OpenMPI or >> MPICH). >> Real world applications tend to be very inefficient, >> and can even hang on oversubscribed CPUs. >> >> What happens when you launch four or less processes >> on a node instead of five? >> >> My $0.02. >> Gus Correa >> --------------------------------------------------------------------- >> Gustavo Correa >> Lamont-Doherty Earth Observatory - Columbia University >> Palisades, NY, 10964-8000 - USA >> --------------------------------------------------------------------- > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From jorg.sassmannshausen at strath.ac.uk Tue Dec 22 02:30:54 2009 From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=) Date: Tue, 22 Dec 2009 10:30:54 +0000 Subject: [Beowulf] Re: Performance degrading Message-ID: <200912221030.54293.jorg.sassmannshausen@strath.ac.uk> Hi all, right, following the various suggestion and the idea about oversubscribing the node, I have started the same calculation again on 3 nodes but this time only with 3 processes on the node (started), so that will leave room for the 4th process which I believe will be started by NWChem. Has anything changed? No. top - 10:24:06 up 21 days, 17:29, 1 user, load average: 0.25, 0.26, 0.26 Tasks: 131 total, 1 running, 130 sleeping, 0 stopped, 0 zombie Cpu0 : 4.3% us, 1.0% sy, 0.0% ni, 93.7% id, 0.0% wa, 0.0% hi, 1.0% si Cpu1 : 2.7% us, 0.0% sy, 0.0% ni, 97.3% id, 0.0% wa, 0.0% hi, 0.0% si Cpu2 : 0.7% us, 0.0% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.0% hi, 0.0% si Cpu3 : 0.0% us, 0.0% sy, 0.0% ni, 99.0% id, 1.0% wa, 0.0% hi, 0.0% si Mem: 12308356k total, 5251744k used, 7056612k free, 377052k buffers Swap: 24619604k total, 0k used, 24619604k free, 3647568k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29017 sassy 15 0 3921m 1.7g 1.4g S 3 14.8 384:50.34 nwchem 29019 sassy 15 0 3920m 1.8g 1.5g S 2 15.4 379:26.55 nwchem 29018 sassy 15 0 3920m 1.8g 1.5g S 1 15.5 380:31.15 nwchem 29021 sassy 15 0 2943m 1.7g 1.7g S 1 14.9 42:33.81 nwchem <- process started by NWChem I suppose As Reuti pointed out to me, NWChem is using Global Arrays internally and only MPI for communication. I don't think the problem is the OpenMPI I have. I could upgrade to the latest, but that means I have to re-link all the programs I am using. Could the problem be the GA? All the best J?rg -- ************************************************************* J?rg Sa?mannshausen Research Fellow University of Strathclyde Department of Pure and Applied Chemistry 295 Cathedral St. Glasgow G1 1XL email: jorg.sassmannshausen at strath.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From gus at ldeo.columbia.edu Tue Dec 22 16:35:49 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 22 Dec 2009 19:35:49 -0500 Subject: [Beowulf] Re: Performance degrading In-Reply-To: <200912221030.54293.jorg.sassmannshausen@strath.ac.uk> References: <200912221030.54293.jorg.sassmannshausen@strath.ac.uk> Message-ID: <4B3165E5.1000403@ldeo.columbia.edu> Hi Jorg I agree your old OpenMPI 1.2.8 should not be the problem, and upgrading now will only add confusion. I only suggested running simple test programs (cpi.c, connectivity_c.c) to make sure all works right, including your network setup. However, you or somebody else may already have done this in the past. Hybrid communication schemes have to be handled with care. We have plenty of MPI+OpenMP programs here. Normally we request one processor per OpenMP thread for each MPI task. For instance, if each MPI task opens four threads, and you want three MPI tasks, then a total of 12 processors, is requested from the batch system (e.g. #PBS -l nodes=3:ppn=4). However, only 3 MPI processes are launched in mpiexec (mpiexec -bynode -np 3 executable_name). The "-bynode" option will put one MPI task on each of three nodes, and each of these three MPI tasks will launch 4 OpenMP threads, using the 4 local processors. However, GA is not the same as OpenMP, and another scheme may apply. I don't know about GA. I looked up their web page (at PNL), but I didn't find a direct answer to your problem. I would have to read more to learn about GA. See if this GA support page may shed some light on how you configured it (or how it is configured inside NWChem): http://www.emsl.pnl.gov/docs/global/support.html See their comments about the SHMMAX (shared memory segment size) in Linux Kernels, as this may perhaps be the problem. Here, on 32-bit Linux machines I have a number smaller than they recommend (134217728 bytes, 128MB): cat /proc/sys/kernel/shmmax 33554432 But on 64-bit machines it is much larger: cat /proc/sys/kernel/shmmax 68719476736 You may check this out on your nodes, and if you have a low number (say in a 32-bit node), perhaps try their suggestion, and ask the system administrator to change this kernel parameter on the nodes by doing: echo "134217728" >/proc/sys/kernel/shmmax GA seems to be a heavy user of shared memory, hence it is likely to require more shared memory resources than normal programs do. Therefore, there is a flimsy chance that increasing SHMMAX may help. I also found the NWChem web site (also at PNL). You may know well all about this, so forgive me any silly suggestions. I am not a Chemist, computational or otherwise. I still need to understand PH and Hydrogen bridges right. The NWChem User Guide, Appendix D (around page 401, big guide!) has suggestions on how to run in different machines, including Linux clusters with MPI (section D.3). http://www.emsl.pnl.gov/capabilities/computing/nwchem/docs/usermanual.pdf They have also FAQ about Linux clusters: http://www.emsl.pnl.gov/capabilities/computing/nwchem/support/faq.jsp#Linux They also have a "known bugs" list: http://www.emsl.pnl.gov/root/capabilities/computing/nwchem/support/knownbugs/ Somehow they seem to talk only about MPICH (not sure if MPICH1 or MPICH2), but not about OpenMPI. Likewise, browsing through the GA stuff I could find no direct reference to OpenMPI, only to MPICH, although theoretically MPI is supposed to be a standard, and portable programs ans libraries should work right with any MPI flavor (in practice I am not so sure this is true). Also, GA seems to have specific instructions for Infiniband (but not for Ethernet), see the web link on GA above. What do you have IB or Ethernet? If you have both you can select one (say, IB) with the OpenMPI mca parameters in the mpiexec command line. I know it won't help ... but I tried ... :( Good luck, and Happy Holidays! Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- J?rg Sa?mannshausen wrote: > Hi all, > > right, following the various suggestion and the idea about oversubscribing the > node, I have started the same calculation again on 3 nodes but this time only > with 3 processes on the node (started), so that will leave room for the 4th > process which I believe will be started by NWChem. > Has anything changed? No. > > top - 10:24:06 up 21 days, 17:29, 1 user, load average: 0.25, 0.26, 0.26 > Tasks: 131 total, 1 running, 130 sleeping, 0 stopped, 0 zombie > Cpu0 : 4.3% us, 1.0% sy, 0.0% ni, 93.7% id, 0.0% wa, 0.0% hi, 1.0% si > Cpu1 : 2.7% us, 0.0% sy, 0.0% ni, 97.3% id, 0.0% wa, 0.0% hi, 0.0% si > Cpu2 : 0.7% us, 0.0% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.0% hi, 0.0% si > Cpu3 : 0.0% us, 0.0% sy, 0.0% ni, 99.0% id, 1.0% wa, 0.0% hi, 0.0% si > Mem: 12308356k total, 5251744k used, 7056612k free, 377052k buffers > Swap: 24619604k total, 0k used, 24619604k free, 3647568k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 29017 sassy 15 0 3921m 1.7g 1.4g S 3 14.8 384:50.34 nwchem > 29019 sassy 15 0 3920m 1.8g 1.5g S 2 15.4 379:26.55 nwchem > 29018 sassy 15 0 3920m 1.8g 1.5g S 1 15.5 380:31.15 nwchem > 29021 sassy 15 0 2943m 1.7g 1.7g S 1 14.9 42:33.81 nwchem <- > process started by NWChem I suppose > > As Reuti pointed out to me, NWChem is using Global Arrays internally and only > MPI for communication. I don't think the problem is the OpenMPI I have. I > could upgrade to the latest, but that means I have to re-link all the > programs I am using. > > Could the problem be the GA? > > All the best > > J?rg > > From reuti at staff.uni-marburg.de Tue Dec 22 17:39:32 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed, 23 Dec 2009 02:39:32 +0100 Subject: [Beowulf] Re: Performance degrading In-Reply-To: <4B3165E5.1000403@ldeo.columbia.edu> References: <200912221030.54293.jorg.sassmannshausen@strath.ac.uk> <4B3165E5.1000403@ldeo.columbia.edu> Message-ID: <5B9409EF-5C77-4D11-A867-14C9EBCB406E@staff.uni-marburg.de> Hi, Am 23.12.2009 um 01:35 schrieb Gus Correa: > > Likewise, browsing through the GA stuff I could find no direct > reference > to OpenMPI, only to MPICH, > although theoretically MPI is supposed to be a standard, > and portable programs ans libraries should work right with any MPI > flavor (in practice I am not so sure this is true). GA can be used with Open MPI. I just did this for Molpro. Side note: Molpro supports now also a pure MPI-2 compilation besides the former via TCGMSG over MPI. The TCGMSG-MPI version is faster than the plain MPI-2 compilation. Okay, one extra step, and you get the time back. -- Reuti From hackercasta at esdebian.org Mon Dec 21 12:12:23 2009 From: hackercasta at esdebian.org (=?ISO-8859-1?Q?_=09Iker_Casta=F1os_Chavarri_?=) Date: Mon, 21 Dec 2009 21:12:23 +0100 Subject: [Beowulf] new distribution to create a beowulf cluster: ABC(Automated Beowulf Cluster) GNU/Linux In-Reply-To: <20091221195610.GA59020@piskorski.com> References: <90202de50912191313h34a77c09md2625fe63001f6ba@mail.gmail.com> <20091221195610.GA59020@piskorski.com> Message-ID: <90202de50912211212v7f95e608kc793c32e283fe119@mail.gmail.com> Excuse me english please. This is similiar to PelicanHPC but this distributions alaws the instalation of the operating system in the frontend using "ubiquity" and is very easy the instalation. I'm writing a quickstart guide but you have an article published y the IEEE. I'm the author. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=5348420&isnumber=5348395 If you go to the pelicanHPC website you can see that Michael Creel recomend to give a try ABC GNU/Linux http://pareto.uab.es/mcreel/PelicanHPC/ Best regards. 2009/12/21 Andrew Piskorski : > On Sat, Dec 19, 2009 at 10:13:44PM +0100, Iker Casta?os Chavarri wrote: > >> This Ubuntu GNU/Linux based distribution alaws to automatically build >> Beowulf clusters either live or installing the software in the >> frontend. All nodes run diskless. > >> http://www.ehu.es/AC/ABC.htm > > How is this similar/different to Perceus/Warewulf, xCAT, Scyld, etc.? > What use cases drove you to build your own cluster provisioning > toolkit rather than use an existing one? ?Is there a document > somewhere explaining your overall system design and rationale? > > -- > Andrew Piskorski > http://www.piskorski.com/ > -- Iker Casta?os From csamuel at vpac.org Tue Dec 29 17:28:18 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed, 30 Dec 2009 12:28:18 +1100 (EST) Subject: [Beowulf] A change of address.. In-Reply-To: <1799055416.758081262136340551.JavaMail.root@mail.vpac.org> Message-ID: <1413881953.758131262136498659.JavaMail.root@mail.vpac.org> Hi all, As some of you already know I'm leaving VPAC on the 8th January to take up a post at the University of Melbourne running some large HPC clusters for the Victorian Life Sciences Computational Initiative (VLSCI) project. http://www.vlsci.unimelb.edu.au/overview.html Yes, I'm upgrading from a 4 to a 5 letter acronym. ;-) In preparation for that I'm going to move my many mailing list memberships from my VPAC address to my home email address (chris at csamuel.org) as I don't know how well the UniMelb mail system will cope with mailing lists.. cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From vlad at geociencias.unam.mx Sat Dec 26 20:04:43 2009 From: vlad at geociencias.unam.mx (Vlad Manea) Date: Sat, 26 Dec 2009 22:04:43 -0600 Subject: [Beowulf] PERC 5/E problems Message-ID: <4B36DCDB.2030802@geociencias.unam.mx> Hi, I have a PERC 5/E card installed on my frontend (Dell PE 2970) that will be used to connect a MD1000 from Dell. I have a problem: PERC 5/E is not showing in BIOS. When the server is starting up, I cannot press Ctrl-R to launch the PowerEdge Expandable RAID Controller BIOS (the card does not display it's bios boot message saying "hit ctrl-r for perc 5..."). I tried a different PCI slot and riser but with no luck. Is out there anybody that might give a hand fixing this? Thanks, V. From skylar at cs.earlham.edu Thu Dec 31 09:02:09 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Thu, 31 Dec 2009 11:02:09 -0600 Subject: [Beowulf] PERC 5/E problems In-Reply-To: <4B36DCDB.2030802@geociencias.unam.mx> References: <4B36DCDB.2030802@geociencias.unam.mx> Message-ID: <4B3CD911.2030705@cs.earlham.edu> Vlad Manea wrote: > Hi, > > I have a PERC 5/E card installed on my frontend (Dell PE 2970) that > will be > used to connect a MD1000 from Dell. > I have a problem: PERC 5/E is not showing in BIOS. When the server is > starting up, I cannot > press Ctrl-R to launch the PowerEdge Expandable RAID Controller BIOS > (the card does not display it's bios boot message saying "hit ctrl-r > for perc 5..."). > > I tried a different PCI slot and riser but with no luck. > > Is out there anybody that might give a hand fixing this? I can't remember if the Dell BIOS has this option, but some BIOSs allow you to clear the PCI bus cache. That will trigger a full rescan of all the cards that are attached and could get it listed in the boot process again. If the BIOS doesn't have that option, you could try setting the BIOS clear jumper. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From skylar at cs.earlham.edu Thu Dec 31 09:22:08 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Thu, 31 Dec 2009 11:22:08 -0600 Subject: [Beowulf] PERC 5/E problems In-Reply-To: <4B3CDBD4.6000106@geociencias.unam.mx> References: <4B36DCDB.2030802@geociencias.unam.mx> <4B3CD911.2030705@cs.earlham.edu> <4B3CDBD4.6000106@geociencias.unam.mx> Message-ID: <4B3CDDC0.2080305@cs.earlham.edu> Vlad Manea wrote: > Thanks all for your replays, > > In the end I think I found the problem: It looks like I have the PERC > model M778G which apparently > does NOT do RAID (maybe some of you can confirm that :-) ). I was > thinking (wrongly maybe...) that > all PERC cards do RAID... I can't speak to that card specifically, but Dell in the past did sneaky things like calling a system "RAID-capable", but in order to make it actually do RAID you'd have to buy a hardware key or daughter card at some inflated price. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From vlad at geociencias.unam.mx Thu Dec 31 09:13:56 2009 From: vlad at geociencias.unam.mx (Vlad Manea) Date: Thu, 31 Dec 2009 11:13:56 -0600 Subject: [Beowulf] PERC 5/E problems In-Reply-To: <4B3CD911.2030705@cs.earlham.edu> References: <4B36DCDB.2030802@geociencias.unam.mx> <4B3CD911.2030705@cs.earlham.edu> Message-ID: <4B3CDBD4.6000106@geociencias.unam.mx> Thanks all for your replays, In the end I think I found the problem: It looks like I have the PERC model M778G which apparently does NOT do RAID (maybe some of you can confirm that :-) ). I was thinking (wrongly maybe...) that all PERC cards do RAID... Cheers, Vlad Skylar Thompson escribi?: > Vlad Manea wrote: > >> Hi, >> >> I have a PERC 5/E card installed on my frontend (Dell PE 2970) that >> will be >> used to connect a MD1000 from Dell. >> I have a problem: PERC 5/E is not showing in BIOS. When the server is >> starting up, I cannot >> press Ctrl-R to launch the PowerEdge Expandable RAID Controller BIOS >> (the card does not display it's bios boot message saying "hit ctrl-r >> for perc 5..."). >> >> I tried a different PCI slot and riser but with no luck. >> >> Is out there anybody that might give a hand fixing this? >> > > I can't remember if the Dell BIOS has this option, but some BIOSs allow > you to clear the PCI bus cache. That will trigger a full rescan of all > the cards that are attached and could get it listed in the boot process > again. If the BIOS doesn't have that option, you could try setting the > BIOS clear jumper. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: