From landman at scalableinformatics.com  Tue Dec  1 10:02:43 2009
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 01 Dec 2009 13:02:43 -0500
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
Message-ID: <4B155A43.3010304@scalableinformatics.com>

My apologies if this is bad form, I know Toon from his past 
participation on this list, and he asked me to forward.

-------- Original Message --------

Dear all,

I've been working on hpux-itanium for the last 2 years (and even
unsubscribed to beowulf-ml during most of that time, my bad) but soon
will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570,
amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few
questions on the config.

1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1
instead of going to the latest make a difference.
2) What are the advantages of the hpc version of rhel. I browsed the doc
but unless having to compile mpi myself I do not see a difference or did
I miss soth.
3) which filesystem is advisable knowing that we're calculating on large
berkeley db databases

thanks in advance,

toon

Toon Knapen toon.knapen at gmail.com

-----------------------------------

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From gerry.creager at tamu.edu  Tue Dec  1 11:45:13 2009
From: gerry.creager at tamu.edu (Gerald Creager)
Date: Tue, 01 Dec 2009 13:45:13 -0600
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <4B155A43.3010304@scalableinformatics.com>
References: <4B155A43.3010304@scalableinformatics.com>
Message-ID: <4B157249.80701@tamu.edu>

Toon, welcome back!

I've been quite happy with CentOS 5.3 and we're experimenting with 
CentOS 5.4 now.  I see good stability in 5.[34] and the incorporation of 
a couple of tools worth having in a distribution for 'Wulf use.  I'd not 
recommend sticking with the old version, but of course, once you're 
established, not carelessly upgrading, either.

gerry

Joe Landman wrote:
> My apologies if this is bad form, I know Toon from his past 
> participation on this list, and he asked me to forward.
> 
> -------- Original Message --------
> 
> Dear all,
> 
> I've been working on hpux-itanium for the last 2 years (and even
> unsubscribed to beowulf-ml during most of that time, my bad) but soon
> will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570,
> amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few
> questions on the config.
> 
> 1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1
> instead of going to the latest make a difference.
> 2) What are the advantages of the hpc version of rhel. I browsed the doc
> but unless having to compile mpi myself I do not see a difference or did
> I miss soth.
> 3) which filesystem is advisable knowing that we're calculating on large
> berkeley db databases
> 
> thanks in advance,
> 
> toon
> 
> Toon Knapen toon.knapen at gmail.com
> 
> -----------------------------------
> 


From lindahl at pbm.com  Tue Dec  1 11:57:38 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 1 Dec 2009 11:57:38 -0800
Subject: [Beowulf] MPI Processes + Auto Vectorization
In-Reply-To: <428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com>
References: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com>
	<20091130225024.GA28311@nlxdcldnl2.cl.intel.com>
	<428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com>
Message-ID: <20091201195738.GA24566@bx9.net>

On Tue, Dec 01, 2009 at 01:14:13AM -0500, amjad ali wrote:

> My question is that if we do not have free cpu cores in a PC or cluster (all
> cores are running MPI processes), still the auto-vertorization is
> beneficial? Or it is beneficial only if we have some free cpu cores locally?

Perhaps you're confusing auto-parallelization and auto-vectorization?

Auto-vectorization does not use any more cpu cores than unvectorized code.

-- greg


From tom.elken at qlogic.com  Tue Dec  1 11:57:29 2009
From: tom.elken at qlogic.com (Tom Elken)
Date: Tue, 1 Dec 2009 11:57:29 -0800
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <4B157249.80701@tamu.edu>
References: <4B155A43.3010304@scalableinformatics.com>
	<4B157249.80701@tamu.edu>
Message-ID: <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>

> On Behalf Of Gerald Creager
> I've been quite happy with CentOS 5.3 and we're experimenting with
> CentOS 5.4 now.  I see good stability in 5.[34] 

I have to second the recommendation of 5.3 or 5.4.

Some time ago, we saw significant performance improvements on Nehalem (Xeon X5570) in moving from RHEL 5.2 to 5.3.  So I expect that moving from 5.1 to 5.[34] would also be a significant improvement in performance.

Cheers,
-Tom

> and the incorporation
> of
> a couple of tools worth having in a distribution for 'Wulf use.  I'd
> not
> recommend sticking with the old version, but of course, once you're
> established, not carelessly upgrading, either.
> 
> gerry
> 
> Joe Landman wrote:
> > My apologies if this is bad form, I know Toon from his past
> > participation on this list, and he asked me to forward.
> >
> > -------- Original Message --------
> >
> > Dear all,
> >
> > I've been working on hpux-itanium for the last 2 years (and even
> > unsubscribed to beowulf-ml during most of that time, my bad) but soon
> > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570,
> > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few
> > questions on the config.
> >
> > 1) our company is standardised on RHEL 5.1. Would sticking with rhel
> 5.1
> > instead of going to the latest make a difference.
> > 2) What are the advantages of the hpc version of rhel. I browsed the
> doc
> > but unless having to compile mpi myself I do not see a difference or
> did
> > I miss soth.
> > 3) which filesystem is advisable knowing that we're calculating on
> large
> > berkeley db databases
> >
> > thanks in advance,
> >
> > toon
> >
> > Toon Knapen toon.knapen at gmail.com
> >
> > -----------------------------------
> >
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


From gerry.creager at tamu.edu  Tue Dec  1 13:04:32 2009
From: gerry.creager at tamu.edu (Gerald Creager)
Date: Tue, 01 Dec 2009 15:04:32 -0600
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>
References: <4B155A43.3010304@scalableinformatics.com>	
	<4B157249.80701@tamu.edu>	
	<35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>
	<d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>
Message-ID: <4B1584E0.8060704@tamu.edu>

A combination of mostly kernel improvements, and some useful middleware 
as RedHat and by extension, CentOS, seek to get farther into the cluster 
space.
gerry

Toon Knapen wrote:
> Any idea why it gives better performance? Was it on memory bw intensive 
> apps? Could it be due to changes in the kernel that take into account 
> the Numa architecture (affinity) or ...
> 
> On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken <tom.elken at qlogic.com 
> <mailto:tom.elken at qlogic.com>> wrote:
> 
>      > On Behalf Of Gerald Creager
>      > I've been quite happy with CentOS 5.3 and we're experimenting with
>      > CentOS 5.4 now.  I see good stability in 5.[34]
> 
>     I have to second the recommendation of 5.3 or 5.4.
> 
>     Some time ago, we saw significant performance improvements on
>     Nehalem (Xeon X5570) in moving from RHEL 5.2 to 5.3.  So I expect
>     that moving from 5.1 to 5.[34] would also be a significant
>     improvement in performance.
> 
>     Cheers,
>     -Tom
> 
>      > and the incorporation
>      > of
>      > a couple of tools worth having in a distribution for 'Wulf use.  I'd
>      > not
>      > recommend sticking with the old version, but of course, once you're
>      > established, not carelessly upgrading, either.
>      >
>      > gerry
>      >
>      > Joe Landman wrote:
>      > > My apologies if this is bad form, I know Toon from his past
>      > > participation on this list, and he asked me to forward.
>      > >
>      > > -------- Original Message --------
>      > >
>      > > Dear all,
>      > >
>      > > I've been working on hpux-itanium for the last 2 years (and even
>      > > unsubscribed to beowulf-ml during most of that time, my bad)
>     but soon
>      > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570,
>      > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have
>     a few
>      > > questions on the config.
>      > >
>      > > 1) our company is standardised on RHEL 5.1. Would sticking with
>     rhel
>      > 5.1
>      > > instead of going to the latest make a difference.
>      > > 2) What are the advantages of the hpc version of rhel. I
>     browsed the
>      > doc
>      > > but unless having to compile mpi myself I do not see a
>     difference or
>      > did
>      > > I miss soth.
>      > > 3) which filesystem is advisable knowing that we're calculating on
>      > large
>      > > berkeley db databases
>      > >
>      > > thanks in advance,
>      > >
>      > > toon
>      > >
>      > > Toon Knapen toon.knapen at gmail.com <mailto:toon.knapen at gmail.com>
>      > >
>      > > -----------------------------------
>      > >
>      > _______________________________________________
>      > Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin
>      > Computing
>      > To change your subscription (digest mode or unsubscribe) visit
>      > http://www.beowulf.org/mailman/listinfo/beowulf
> 
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 


From amacater at galactic.demon.co.uk  Tue Dec  1 13:42:18 2009
From: amacater at galactic.demon.co.uk (Andrew M.A. Cater)
Date: Tue, 1 Dec 2009 21:42:18 +0000
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <4B1584E0.8060704@tamu.edu>
References: <4B155A43.3010304@scalableinformatics.com>
	<4B157249.80701@tamu.edu>
	<35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>
	<d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>
	<4B1584E0.8060704@tamu.edu>
Message-ID: <20091201214217.GA19804@galactic.demon.co.uk>

On Tue, Dec 01, 2009 at 03:04:32PM -0600, Gerald Creager wrote:
> A combination of mostly kernel improvements, and some useful middleware  
> as RedHat and by extension, CentOS, seek to get farther into the cluster  
> space.
> gerry
>

Maybe also some licensing breaks on large volume licensing. Red Hat is 
primarily a sales and service organisation that also produces a Linux 
by-product :) The HPC variant is targetted at areas which deal in large 
clusters at cheaper than Red Hat Enterprise Linux for servers at 
equivalent volume, IIRC.

> Toon Knapen wrote:
>> Any idea why it gives better performance? Was it on memory bw intensive 
>> apps? Could it be due to changes in the kernel that take into account  
>> the Numa architecture (affinity) or ...
>>
>> On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken <tom.elken at qlogic.com  
>> <mailto:tom.elken at qlogic.com>> wrote:
>>
>>      > On Behalf Of Gerald Creager
>>      > I've been quite happy with CentOS 5.3 and we're experimenting with
>>      > CentOS 5.4 now.  I see good stability in 5.[34]
>>
>>     I have to second the recommendation of 5.3 or 5.4.

I like 5.2 and 5.4 but my normal experience is on relatively stock IBM 
hardware. As ever, YMMV.

>>      > > Dear all,
>>      > >
>>      > > I've been working on hpux-itanium for the last 2 years (and even
>>      > > unsubscribed to beowulf-ml during most of that time, my bad)
>>     but soon
>>      > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570,
>>      > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have
>>     a few
>>      > > questions on the config.
>>      > >
>>      > > 1) our company is standardised on RHEL 5.1. Would sticking with
>>     rhel
>>      > 5.1
>>      > > instead of going to the latest make a difference.

Since you have up to date hardware - also check on the necessary version 
of 3Ware drivers and where they are supported. The command line 
utilities are particularly useful.

>>      > > 2) What are the advantages of the hpc version of rhel. I
>>     browsed the
>>      > doc
>>      > > but unless having to compile mpi myself I do not see a
>>     difference or
>>      > did
>>      > > I miss soth.

See above.

>>      > > 3) which filesystem is advisable knowing that we're calculating on
>>      > large
>>      > > berkeley db databases
>>      > >

You get ext3 or Red Hat's cluster filesystem ?? GFS ??, I think. No xfs 
/ Reiser by default. Check also with HP as to what file systems they 
would recommend. 


>>      > > thanks in advance,
>>      > >
>>      > > toon
>>      > >
>>      > > Toon Knapen toon.knapen at gmail.com <mailto:toon.knapen at gmail.com>
>>      > >

Always happy to pontificate :)

All the best,

Andy


From tom.elken at qlogic.com  Tue Dec  1 13:51:18 2009
From: tom.elken at qlogic.com (Tom Elken)
Date: Tue, 1 Dec 2009 13:51:18 -0800
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>
References: <4B155A43.3010304@scalableinformatics.com>
	<4B157249.80701@tamu.edu>
	<35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>
	<d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>
Message-ID: <35AAF1E4A771E142979F27B51793A48887030F12E9@AVEXMB1.qlogic.org>

Toon wrote:  "Any idea why it gives better performance? Was it on memory bw intensive apps? Could it be due to changes in the kernel that take into account the Numa architecture (affinity)"

This is very likely it.  Dredging thru old e-mails, I see that, in Jan-09, the application of a then-current 2.6.28 kernel to a RHEL 5.2 system provided about a 2x improvement in 8-thread OpenMP STREAM performance.  Subsequently, moving to RHEL 5.3 and its default kernel provided the same good STREAM performance.

-Tom


From: Toon Knapen [mailto:toon.knapen at gmail.com]
Sent: Tuesday, December 01, 2009 12:53 PM
To: Tom Elken
Cc: gerry.creager at tamu.edu; landman at scalableinformatics.com; beowulf
Subject: Re: [Beowulf] Forwarded from a long time reader having trouble posting

Any idea why it gives better performance? Was it on memory bw intensive apps? Could it be due to changes in the kernel that take into account the Numa architecture (affinity) or ...
On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken <tom.elken at qlogic.com<mailto:tom.elken at qlogic.com>> wrote:
> On Behalf Of Gerald Creager
> I've been quite happy with CentOS 5.3 and we're experimenting with
> CentOS 5.4 now.  I see good stability in 5.[34]
I have to second the recommendation of 5.3 or 5.4.

Some time ago, we saw significant performance improvements on Nehalem (Xeon X5570) in moving from RHEL 5.2 to 5.3.  So I expect that moving from 5.1 to 5.[34] would also be a significant improvement in performance.

Cheers,
-Tom

> and the incorporation
> of
> a couple of tools worth having in a distribution for 'Wulf use.  I'd
> not
> recommend sticking with the old version, but of course, once you're
> established, not carelessly upgrading, either.
>
> gerry
>
> Joe Landman wrote:
> > My apologies if this is bad form, I know Toon from his past
> > participation on this list, and he asked me to forward.
> >
> > -------- Original Message --------
> >
> > Dear all,
> >
> > I've been working on hpux-itanium for the last 2 years (and even
> > unsubscribed to beowulf-ml during most of that time, my bad) but soon
> > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570,
> > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few
> > questions on the config.
> >
> > 1) our company is standardised on RHEL 5.1. Would sticking with rhel
> 5.1
> > instead of going to the latest make a difference.
> > 2) What are the advantages of the hpc version of rhel. I browsed the
> doc
> > but unless having to compile mpi myself I do not see a difference or
> did
> > I miss soth.
> > 3) which filesystem is advisable knowing that we're calculating on
> large
> > berkeley db databases
> >
> > thanks in advance,
> >
> > toon
> >
> > Toon Knapen toon.knapen at gmail.com<mailto:toon.knapen at gmail.com>
> >
> > -----------------------------------
> >
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org> sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091201/c7c2bf7b/attachment.html>

From gerry.creager at tamu.edu  Tue Dec  1 14:05:36 2009
From: gerry.creager at tamu.edu (Gerald Creager)
Date: Tue, 01 Dec 2009 16:05:36 -0600
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <20091201214217.GA19804@galactic.demon.co.uk>
References: <4B155A43.3010304@scalableinformatics.com>	<4B157249.80701@tamu.edu>	<35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>	<d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>	<4B1584E0.8060704@tamu.edu>
	<20091201214217.GA19804@galactic.demon.co.uk>
Message-ID: <4B159330.5070802@tamu.edu>

Andrew M.A. Cater wrote:
> On Tue, Dec 01, 2009 at 03:04:32PM -0600, Gerald Creager wrote:
>> A combination of mostly kernel improvements, and some useful middleware  
>> as RedHat and by extension, CentOS, seek to get farther into the cluster  
>> space.
>> gerry
>>
> 
> Maybe also some licensing breaks on large volume licensing. Red Hat is 
> primarily a sales and service organisation that also produces a Linux 
> by-product :) The HPC variant is targetted at areas which deal in large 
> clusters at cheaper than Red Hat Enterprise Linux for servers at 
> equivalent volume, IIRC.
> 
>> Toon Knapen wrote:
>>> Any idea why it gives better performance? Was it on memory bw intensive 
>>> apps? Could it be due to changes in the kernel that take into account  
>>> the Numa architecture (affinity) or ...
>>>
>>> On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken <tom.elken at qlogic.com  
>>> <mailto:tom.elken at qlogic.com>> wrote:
>>>
>>>      > On Behalf Of Gerald Creager
>>>      > I've been quite happy with CentOS 5.3 and we're experimenting with
>>>      > CentOS 5.4 now.  I see good stability in 5.[34]
>>>
>>>     I have to second the recommendation of 5.3 or 5.4.
> 
> I like 5.2 and 5.4 but my normal experience is on relatively stock IBM 
> hardware. As ever, YMMV.
> 
>>>      > > Dear all,
>>>      > >
>>>      > > I've been working on hpux-itanium for the last 2 years (and even
>>>      > > unsubscribed to beowulf-ml during most of that time, my bad)
>>>     but soon
>>>      > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570,
>>>      > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have
>>>     a few
>>>      > > questions on the config.
>>>      > >
>>>      > > 1) our company is standardised on RHEL 5.1. Would sticking with
>>>     rhel
>>>      > 5.1
>>>      > > instead of going to the latest make a difference.
> 
> Since you have up to date hardware - also check on the necessary version 
> of 3Ware drivers and where they are supported. The command line 
> utilities are particularly useful.
> 
>>>      > > 2) What are the advantages of the hpc version of rhel. I
>>>     browsed the
>>>      > doc
>>>      > > but unless having to compile mpi myself I do not see a
>>>     difference or
>>>      > did
>>>      > > I miss soth.
> 
> See above.
> 
>>>      > > 3) which filesystem is advisable knowing that we're calculating on
>>>      > large
>>>      > > berkeley db databases
>>>      > >
> 
> You get ext3 or Red Hat's cluster filesystem ?? GFS ??, I think. No xfs 
> / Reiser by default. Check also with HP as to what file systems they 
> would recommend. 
> 
> 
>>>      > > thanks in advance,
>>>      > >
>>>      > > toon
>>>      > >
>>>      > > Toon Knapen toon.knapen at gmail.com <mailto:toon.knapen at gmail.com>
>>>      > >
> 
> Always happy to pontificate :)


I believe xfs is now available in 5.4.  I'd have to check.  We've found 
xfs to be our preference (but we're revisiting gluster and lustre). 
I've not played with gfs so far.

gerry


From lindahl at pbm.com  Tue Dec  1 14:25:07 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 1 Dec 2009 14:25:07 -0800
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <4B155A43.3010304@scalableinformatics.com>
References: <4B155A43.3010304@scalableinformatics.com>
Message-ID: <20091201222507.GB17474@bx9.net>

> 3) which filesystem is advisable knowing that we're calculating on large
> berkeley db databases

I've had friends tell me that I should never use long-lived berkeley
db databases without a good backup-and-recovery or recreate-from-scratch
plan.

Berkeley db comes with a test suite for integrity, and last time I
used it under Linux, it didn't pass.

-- g


From bill at cse.ucdavis.edu  Tue Dec  1 14:50:53 2009
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Tue, 01 Dec 2009 14:50:53 -0800
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <4B155A43.3010304@scalableinformatics.com>
References: <4B155A43.3010304@scalableinformatics.com>
Message-ID: <4B159DCD.6020102@cse.ucdavis.edu>

Joe Landman wrote:
> My apologies if this is bad form, I know Toon from his past
> participation on this list, and he asked me to forward.
> 
> -------- Original Message --------

Hi Toon, long time no type.
> Dear all,

> I've been working on hpux-itanium for the last 2 years (and even
> unsubscribed to beowulf-ml during most of that time, my bad) but soon
> will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570,
> amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few
> questions on the config. 
> 1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1
> instead of going to the latest make a difference.

That's kind of strange.  So you never patch?  A patched RHEL 5.1 box auto
upgrades to 5.4, doesn't it?  Or is that something specific to CentOS?  5.1 is
rather old I'd worry about poor support for hyperthreading, and things like
GigE drivers when using a release from 2007, especially with hardware released
in 2009.

I believe the older kernels handle the extra cores rather poorly, and don't
even recognize the intel CPUs as NUMA enabled.  You didn't mention hardware or
software RAID.  I'd recommend RAID scrubbing, and if software that requires (I
think) >= 2.6.21, although (I think) Redhat back ported it into their newest
kernels in 5.4, or maybe 5.3.  Definitely not in 5.1 though.

> 2) What are the advantages of the hpc version of rhel. I browsed the doc
> but unless having to compile mpi myself I do not see a difference or did
> I miss soth.

I've never seen the HPC version of RHEL, but I have build a cluster
distribution based on RHEL a few times.  It's pretty common to need to tweak
the various cluster related pieces, like say tight integration which often
requires tweaks to the MPI layer and the batch queue.   I suspect the biggest
advantage for the HPC version of RHEL is a cheaper per seat license.  If you
end up layering things on top of RHEL yourself I recommend cobbler, puppet,
ganglia, and openmpi.

> 3) which filesystem is advisable knowing that we're calculating on large
> berkeley db databases

I've not seen a particularly big difference on random workloads typical of
databases.  Are the databases bigger than ram?  Does your 3ware have a
battery?  Allowing the raid controller to acknowledge writes before they hit
the disk might be a big win (if your DB has lots of writes)?  Can you afford a
SSD to hold the berkeley DB?


From csamuel at vpac.org  Tue Dec  1 15:42:39 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Wed, 2 Dec 2009 10:42:39 +1100 (EST)
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <566807952.7177751259710895916.JavaMail.root@mail.vpac.org>
Message-ID: <45900213.7177791259710959407.JavaMail.root@mail.vpac.org>


----- "Gerald Creager" <gerry.creager at tamu.edu> wrote:

> I believe xfs is now available in 5.4.  I'd have to check.

My meagre understanding based totally on rumours is
that it's still a preview release and that you need
a special support contract with Red Hat to get access.

I'd love to know that I'm wrong there though!

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From artpoon at gmail.com  Tue Dec  1 12:45:52 2009
From: artpoon at gmail.com (Art Poon)
Date: Tue, 1 Dec 2009 12:45:52 -0800
Subject: [Beowulf] Re: cluster fails to boot with managed switch,
	but 5-port switch works OK
Message-ID: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>

Dear colleagues,

I am in charge of managing a cluster at our research centre and am stuck with a vexing (to me) problem!

(Disclaimer: I am a biologist by training and a mostly self-taught programmer.  I am still learning about networking and cluster management, so please bear with me!)

This is an asymmetric Intel Xeon cluster running 4 compute nodes on CentOS 5.4 and Scyld Clusterware 5.  We managed to get it up and running using a dinky little NetGear 5-port 10/100/1000 switch.  Now that I'm looking to expand the cluster, I need to get the managed switch working (an SMC 8824M, though we have several other switches available).

What's got me and the IT guys stumped is that while the compute nodes boot via PXE from the head node without trouble on the NetGear, they barf with the SMC.  To be specific, after the initial boot with a minimal Linux kernel, there is a "fatal error" with "timeout waiting for getfile" when the compute node attempts to download the provisioning image from head.  However, when they were running Rocks before I arrived, the cluster worked fine with the SMC switch.

I've tried resetting the SMC switch to factory defaults (with auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic.  We've tried swapping out to another SMC switch but that didn't change anything.  

I'm grateful if you could weigh in with your expertise.

Thank you,
- Art.


From toon.knapen at gmail.com  Tue Dec  1 12:53:19 2009
From: toon.knapen at gmail.com (Toon Knapen)
Date: Tue, 1 Dec 2009 21:53:19 +0100
Subject: [Beowulf] Forwarded from a long time reader having trouble 
	posting
In-Reply-To: <35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>
References: <4B155A43.3010304@scalableinformatics.com>
	<4B157249.80701@tamu.edu>
	<35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>
Message-ID: <d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>

Any idea why it gives better performance? Was it on memory bw intensive
apps? Could it be due to changes in the kernel that take into account the
Numa architecture (affinity) or ...

On Tue, Dec 1, 2009 at 8:57 PM, Tom Elken <tom.elken at qlogic.com> wrote:

> > On Behalf Of Gerald Creager
> > I've been quite happy with CentOS 5.3 and we're experimenting with
> > CentOS 5.4 now.  I see good stability in 5.[34]
>
> I have to second the recommendation of 5.3 or 5.4.
>
> Some time ago, we saw significant performance improvements on Nehalem (Xeon
> X5570) in moving from RHEL 5.2 to 5.3.  So I expect that moving from 5.1 to
> 5.[34] would also be a significant improvement in performance.
>
> Cheers,
> -Tom
>
> > and the incorporation
> > of
> > a couple of tools worth having in a distribution for 'Wulf use.  I'd
> > not
> > recommend sticking with the old version, but of course, once you're
> > established, not carelessly upgrading, either.
> >
> > gerry
> >
> > Joe Landman wrote:
> > > My apologies if this is bad form, I know Toon from his past
> > > participation on this list, and he asked me to forward.
> > >
> > > -------- Original Message --------
> > >
> > > Dear all,
> > >
> > > I've been working on hpux-itanium for the last 2 years (and even
> > > unsubscribed to beowulf-ml during most of that time, my bad) but soon
> > > will turn back to a beowulf cluster (HP DL380G6's with Xeon X5570,
> > > amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few
> > > questions on the config.
> > >
> > > 1) our company is standardised on RHEL 5.1. Would sticking with rhel
> > 5.1
> > > instead of going to the latest make a difference.
> > > 2) What are the advantages of the hpc version of rhel. I browsed the
> > doc
> > > but unless having to compile mpi myself I do not see a difference or
> > did
> > > I miss soth.
> > > 3) which filesystem is advisable knowing that we're calculating on
> > large
> > > berkeley db databases
> > >
> > > thanks in advance,
> > >
> > > toon
> > >
> > > Toon Knapen toon.knapen at gmail.com
> > >
> > > -----------------------------------
> > >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> > Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091201/9f1511db/attachment.html>

From toon.knapen at gmail.com  Wed Dec  2 03:47:19 2009
From: toon.knapen at gmail.com (Toon Knapen)
Date: Wed, 2 Dec 2009 12:47:19 +0100
Subject: [Beowulf] Forwarded from a long time reader having trouble 
	posting
In-Reply-To: <20091201214217.GA19804@galactic.demon.co.uk>
References: <4B155A43.3010304@scalableinformatics.com>
	<4B157249.80701@tamu.edu>
	<35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>
	<d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>
	<4B1584E0.8060704@tamu.edu>
	<20091201214217.GA19804@galactic.demon.co.uk>
Message-ID: <d5bdff000912020347n277286d8gfa94a9ad7aeb214@mail.gmail.com>

>
>
> Maybe also some licensing breaks on large volume licensing. Red Hat is
> primarily a sales and service organisation that also produces a Linux
> by-product :) The HPC variant is targetted at areas which deal in large
> clusters at cheaper than Red Hat Enterprise Linux for servers at
> equivalent volume, IIRC.
>


AFAICT the HPC version is more expensive but you get extra tools for that
such as benchmarks, pre-compiled mpi, batch-scheduler from Platform.

But MPI is easy to intall and I would prefer other batch-schedulers instead
of the one of Platform so ... I wonder if the kernel/distribution itself is
more optimised ?


> Since you have up to date hardware - also check on the necessary version
> of 3Ware drivers and where they are supported. The command line
> utilities are particularly useful.
>


thanks for the tip.


>
>
> You get ext3 or Red Hat's cluster filesystem ?? GFS ??, I think. No xfs
> / Reiser by default. Check also with HP as to what file systems they
> would recommend.
>


No, we'll be using local file systems primarily and also a connection to a
SAN but no global filesystem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091202/d2da723a/attachment.html>

From toon.knapen at gmail.com  Wed Dec  2 03:49:27 2009
From: toon.knapen at gmail.com (Toon Knapen)
Date: Wed, 2 Dec 2009 12:49:27 +0100
Subject: [Beowulf] Forwarded from a long time reader having trouble 
	posting
In-Reply-To: <4B159330.5070802@tamu.edu>
References: <4B155A43.3010304@scalableinformatics.com>
	<4B157249.80701@tamu.edu>
	<35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>
	<d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>
	<4B1584E0.8060704@tamu.edu>
	<20091201214217.GA19804@galactic.demon.co.uk>
	<4B159330.5070802@tamu.edu>
Message-ID: <d5bdff000912020349ka3830cfl679dead4a04f1581@mail.gmail.com>

>
>
> I believe xfs is now available in 5.4.  I'd have to check.  We've found xfs
> to be our preference (but we're revisiting gluster and lustre). I've not
> played with gfs so far.
>


And why do you prefer xfs if I may ask. Performance? Do you many small files
or large files?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091202/0479eb93/attachment.html>

From toon.knapen at gmail.com  Wed Dec  2 03:53:28 2009
From: toon.knapen at gmail.com (Toon Knapen)
Date: Wed, 2 Dec 2009 12:53:28 +0100
Subject: [Beowulf] Forwarded from a long time reader having trouble 
	posting
In-Reply-To: <20091201222507.GB17474@bx9.net>
References: <4B155A43.3010304@scalableinformatics.com>
	<20091201222507.GB17474@bx9.net>
Message-ID: <d5bdff000912020353k7301ee79k1de64beaa9c5d0b1@mail.gmail.com>

>
> I've had friends tell me that I should never use long-lived berkeley
> db databases without a good backup-and-recovery or recreate-from-scratch
> plan.
>
> Berkeley db comes with a test suite for integrity, and last time I
> used it under Linux, it didn't pass.
>


You mean that subsequent (minor) versions of bdb are not necessarily totally
compatible ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091202/be36430c/attachment.html>

From toon.knapen at gmail.com  Wed Dec  2 03:59:55 2009
From: toon.knapen at gmail.com (Toon Knapen)
Date: Wed, 2 Dec 2009 12:59:55 +0100
Subject: [Beowulf] Forwarded from a long time reader having trouble 
	posting
In-Reply-To: <4B159DCD.6020102@cse.ucdavis.edu>
References: <4B155A43.3010304@scalableinformatics.com>
	<4B159DCD.6020102@cse.ucdavis.edu>
Message-ID: <d5bdff000912020359y1487b192o6ff20685700567e3@mail.gmail.com>

>
> I believe the older kernels handle the extra cores rather poorly, and don't
> even recognize the intel CPUs as NUMA enabled.  You didn't mention hardware
> or
> software RAID.  I'd recommend RAID scrubbing, and if software that requires
> (I
> think) >= 2.6.21, although (I think) Redhat back ported it into their
> newest
> kernels in 5.4, or maybe 5.3.  Definitely not in 5.1 though.
>
For performance reasons we added the 3ware card to handle the raid.


> I've not seen a particularly big difference on random workloads typical of
> databases.  Are the databases bigger than ram?  Does your 3ware have a
> battery?  Allowing the raid controller to acknowledge writes before they
> hit
> the disk might be a big win (if your DB has lots of writes)?  Can you
> afford a
> SSD to hold the berkeley DB?
>
The card has 512 MB of memory. I suppose it will cache the writes there. But
the bdb's can be up to 70 GB large so we'll never be able to pull them in
memory.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091202/ab912ac1/attachment.html>

From jlb17 at duke.edu  Wed Dec  2 10:14:32 2009
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Wed, 2 Dec 2009 13:14:32 -0500 (EST)
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <45900213.7177791259710959407.JavaMail.root@mail.vpac.org>
References: <45900213.7177791259710959407.JavaMail.root@mail.vpac.org>
Message-ID: <alpine.LRH.2.00.0912021308440.8086@hogwarts.egr.duke.edu>

On Wed, 2 Dec 2009 at 10:42am, Chris Samuel wrote

>> I believe xfs is now available in 5.4.  I'd have to check.
>
> My meagre understanding based totally on rumours is
> that it's still a preview release and that you need
> a special support contract with Red Hat to get access.
>
> I'd love to know that I'm wrong there though!

In CentOS, at least, the xfs module comes with the regular kernel (so I'm 
guessing it's the same with stock RHEL).  What Red Hat is *not* shipping 
by default are any of the filesystem utilities, so you can't, e.g., 
actually mkfs an XFS filesystem.  But you can get the xfsprogs RPM from 
the CentOS extras repo and that should work just fine.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


From eugen at leitl.org  Wed Dec  2 10:22:43 2009
From: eugen at leitl.org (Eugen Leitl)
Date: Wed, 2 Dec 2009 19:22:43 +0100
Subject: [Beowulf] Intel shows 48-core 'datacentre on a chip' 
Message-ID: <20091202182243.GG17686@leitl.org>


http://news.zdnet.co.uk/hardware/0,1000000091,39918721,00.htm

Intel shows 48-core 'datacentre on a chip'

* Tags: * Manycore, * Cloud, * Operating System, * Processor

Rupert Goodwins ZDNet UK

Published: 02 Dec 2009 18:00 GMT

Intel has announced the Single-chip Cloud Computer (SCC), an experimental
48-core processor designed to encourage research and development in massively
parallel computation.

Measuring 567 square millimetres ? about the size of a postage stamp ? the
SCC combines 24 dual-core processing elements, each with its own router, four
DDR3 memory controllers capable of handling up to 8GB apiece, and a very fast
on-chip network.

Although no performance, speed or total bandwidth figures were revealed, the
chip has 1.3 billion transistors, consumes up to 125 watts in operation, and
will become available to researchers around the world on a standard-sized
motherboard in the first half of 2010. Intel said it expects to sign up
dozens of partners within six months, with more to come over time.

"This is the prototype of the microprocessor of the future", Joseph Sch?tz,
director of microprocessor and programming research at Intel Labs, said on
Tuesday. The announcement took place at the company's R&D centre in
Braunschweig, Germany, at the first of three SCC launch events around the
world. "Before, if you needed to design software for this level of computing,
you needed your own datacentre. Now, you just need your own chip," said
Sch?tz.

The SCC, previously code-named Rock Creek, is fabricated in 45nm technology.
The on-chip network is configured as a 6x4 node, two-dimensional mesh. It has
a bandwidth of 256GBps, and each core can run its own independent software as
a fully functional IA-32 processor.

Memory, 384KB of it, is shared between all cores, primarily to speed message
passing, while power management can independently control eight
variable-voltage and 28 variable-frequency areas of the chip. This controls
power consumption, setting it between 25 and 125 watts.  Intel SCC
 
Intel's Single-chip Cloud Computer: 24 dual-core tiles on a 567mm2 die

"We called it the Single-chip Cloud Computer, but it was very difficult to
know how to name it", said Sch?tz. "You can easily envision many more cores,
just as you could add more servers to a real datacentre. We could build
relatively small systems with hundreds or thousands of cores."

Unlike other manycore systems, the SCC does not maintain data integrity
through cache coherency, where special circuitry keeps multiple caches in
sync across cores.

"Cache coherency consumes a lot of power but isn't that useful," said Sch?tz.
"You're tempted to do it because you're on one die, but as soon as you go
off-die, it doesn't work. This is a slice of the future in silicon, and we're
giving to people to play with."

Microsoft, a partner in the SCC project, has built support into its developer
toolchain, according to Sch?tz.

Professor Timothy Roscoe of ETH Zurich, who is developing the experimental
Barrelfish operating system also in conjunction with Microsoft, told ZDNet UK
that the SCC was a particularly good fit for his work.

"Our multikernel architecture has many independent processors with different
attributes, but sharing the same state", he said, "and the SCC looks a dream
platform for testing many of our ideas".

Intel said it will publish full details of the SCC at the International
Solid-State Circuits Conference in San Francisco on 8 February, 2010.


From john.hearns at mclaren.com  Wed Dec  2 10:26:28 2009
From: john.hearns at mclaren.com (Hearns, John)
Date: Wed, 2 Dec 2009 18:26:28 -0000
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>

I'm a new member to this list, but the research group that I work for has had a working cluster for many years. I am now looking at upgrading our current configuration. 

 
Go for a new cluster any time.  Upgrading is fraught with pitfalls ? keep that old cluster running till it dies, but look at a fresh new machine.

Nehalem is around now, and you can get a lot of power in a single node, not to mention GPUs.

Mixing modern multi-core hardware with an older OS release which worked with those old disk drivers and Ethernet drivers will  be a nightmare.

 
I was wondering if anyone has actual experience with running more than one node from a single power supply. Even just two boards on one PSU would be nice. We will be using barely 200W per node for 50 nodes and it just seems like a big waste to buy 50 power supply units. I have read the old posts but did not see any reports of success.


Look at the Supermicro twin systems, they have two motherboards in 1U or four motherboards in 2U.

I believe HP have similar.                            

Or of course any of the blade chassis ? Supermicro, HP, Sun and dare I say it SGI.

On a smaller scale you could look at the ?personal supercomputers? from Cray and SGI.


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091202/2fb3ac71/attachment.html>

From landman at scalableinformatics.com  Wed Dec  2 10:30:02 2009
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 02 Dec 2009 13:30:02 -0500
Subject: [Beowulf] Re: cluster fails to boot with managed switch,	but
	5-port switch works OK
In-Reply-To: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
Message-ID: <4B16B22A.8070603@scalableinformatics.com>

Art Poon wrote:
> Dear colleagues,

[...]

> What's got me and the IT guys stumped is that while the compute nodes
> boot via PXE from the head node without trouble on the NetGear, they
> barf with the SMC.  To be specific, after the initial boot with a
> minimal Linux kernel, there is a "fatal error" with "timeout waiting
> for getfile" when the compute node attempts to download the
> provisioning image from head.  However, when they were running Rocks
> before I arrived, the cluster worked fine with the SMC switch.

Is it the switch of the dhcp/bootp/tftp setup thats the problem?  Are 
you sure the tftp daemon is up, or bootp is configured correctly?

Switches sometimes have broadcast storm suppression turned on, or worse, 
sometimes they have spanning tree turned on.  You want the switch to be 
as dumb as you can possibly make it for most linux clusters.  Fast, but 
dumb.

> I've tried resetting the SMC switch to factory defaults (with
> auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and
> it doesn't seem to be demanding anything exotic.  We've tried
> swapping out to another SMC switch but that didn't change anything.

This sounds more on the server software stack than the switch.  Could 
you describe this?  Are you using Scyld/Rocks for that?

Rocks is quite sensitive to configuration issues, and really doesn't 
like altered configurations (it is possible to do, though non-trivial).

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From john.hearns at mclaren.com  Wed Dec  2 10:36:47 2009
From: john.hearns at mclaren.com (Hearns, John)
Date: Wed, 2 Dec 2009 18:36:47 -0000
Subject: [Beowulf] Re: cluster fails to boot with managed switch,but
	5-port switch works OK
In-Reply-To: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0E75A725@milexchmb1.mil.tagmclarengroup.com>


I've tried resetting the SMC switch to factory defaults (with
auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it
doesn't seem to be demanding anything exotic.  We've tried swapping out
to another SMC switch but that didn't change anything.  

No idea really, as I don't use SMC switches.
First thing I would do would be to get a laptop with Linux on, and
attach it to the SMC switch.
Configure the Ethernet interface to use DHCP, then ifconfig it down then
ifcoinfig it up.
Run tcpdump eth0 as you do this, and tail -f /var/log/messages


If it gets a DHCP address (and of course on the cluster head node there
is a pool of free DHCP addresses configured)
the test tftp file transfer. Set the tftp daemon on the head node to run
in debug mode, and start it up.
Try a tftp get of a test file on the laptop.


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From james.p.lux at jpl.nasa.gov  Wed Dec  2 10:52:25 2009
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Wed, 2 Dec 2009 10:52:25 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47FEB7EF4259A@ALTPHYEMBEVSP20.RES.AD.JPL>

From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Ross Tucker
Sent: Wednesday, November 25, 2009 1:54 PM
To: beowulf at beowulf.org
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster

Greetings!

I'm a new member to this list, but the research group that I work for has had a working cluster for many years. I am now looking at upgrading our current configuration. I was wondering if anyone has actual experience with running more than one node from a single power supply. Even just two boards on one PSU would be nice. We will be using barely 200W per node for 50 nodes and it just seems like a big waste to buy 50 power supply units. I have read the old posts but did not see any reports of success.

Best regards,
Ross Tucker

------


Unless your time is free, it's probably not cost effective.

You'd have to come up with the following things:
 Packaging that accommodates two boards.
 Cabling from the PSU to the board that splits the power to two destinations
 Somehow managing the power on/standby/off controls coming from the mobo to the PSU
 Making sure that any voltage sequencing requirements are met.
 Making sure that with your new cabling, you meet the voltage regulation requirements.

There's also the issue of EMI/EMC compliance, if you are concerned about such things.


Given the low cost of power supplies, particularly in large quantities, and the ability to use commodity (low price) packaging in the traditional 1 PSU per mobo configuration, you'd have to have a really good reason to consider this.

Jim Lux


From dag at sonsorol.org  Wed Dec  2 11:12:13 2009
From: dag at sonsorol.org (Chris Dagdigian)
Date: Wed, 02 Dec 2009 14:12:13 -0500
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <ECE7A93BD093E1439C20020FBE87C47FEB7EF4259A@ALTPHYEMBEVSP20.RES.AD.JPL>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<ECE7A93BD093E1439C20020FBE87C47FEB7EF4259A@ALTPHYEMBEVSP20.RES.AD.JPL>
Message-ID: <4B16BC0D.7040003@sonsorol.org>


Not sure if you are looking at DIY or commercial options but this has 
been done well on a commercial scale by at least some integrators.

I've never used them in a cluster but they make great virtualization 
platforms.

This is just one example, the marketing term is "1U Twin Server"

http://www.siliconmechanics.com/c1159/1u-twin-servers.php

Other vendors have various other options and the range of "dedicated" vs 
"shared" resources among the twin and quad server configurations can 
vary quite a bit. There is a good chance someone is already making a box 
with the specs you are interested in.


Regards,
Chris


Lux, Jim (337C) wrote:
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Ross Tucker
> Sent: Wednesday, November 25, 2009 1:54 PM
> To: beowulf at beowulf.org
> Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
>
> Greetings!
>
> I'm a new member to this list, but the research group that I work for has had a working cluster for many years. I am now looking at upgrading our current configuration. I was wondering if anyone has actual experience with running more than one node from a single power supply. Even just two boards on one PSU would be nice. We will be using barely 200W per node for 50 nodes and it just seems like a big waste to buy 50 power supply units. I have read the old posts but did not see any reports of success.
>
> Best regards,
> Ross Tucker


From mathog at caltech.edu  Wed Dec  2 11:40:09 2009
From: mathog at caltech.edu (David Mathog)
Date: Wed, 02 Dec 2009 11:40:09 -0800
Subject: [Beowulf] Re: cluster fails to boot with managed switch,
	but 5-port switch works OK
Message-ID: <E1NFv3Z-0006RD-Bb@mendel.bio.caltech.edu>

> What's got me and the IT guys stumped is that while the compute nodes
boot via PXE from the head node without trouble on the NetGear, they
barf with the SMC.  To be specific, after the initial boot with a
minimal Linux kernel, there is a "fatal error" with "timeout waiting for
getfile" when the compute node attempts to download the provisioning
image from head.  However, when they were running Rocks before I
arrived, the cluster worked fine with the SMC switch.

Use tcpdump or some equivalent.  Run it once with the dumb switch, once
with the managed one, and then compare and contrast.

> I've tried resetting the SMC switch to factory defaults (with
auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it
doesn't seem to be demanding anything exotic.  We've tried swapping out
to another SMC switch but that didn't change anything.  

Detach from the world at large then turn off the firewall on the master.
(Probably not it this time, but whenever there are network problems
always rule out the firewall before spending time on anything else.)

Ipv6 vs. Ipv4?  By which I mean, once the kernel boots, perhaps it goes
to ipv6, which the netgear handles properly, but maybe that is turned
off on the SMC?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From mwill at penguincomputing.com  Wed Dec  2 11:48:17 2009
From: mwill at penguincomputing.com (Michael Will)
Date: Wed, 02 Dec 2009 11:48:17 -0800
Subject: [Beowulf] Re: cluster fails to boot with managed switch,
	but5-port switch works OK
Message-ID: <00cc01ca7388$68bf5259$3504650a@penguincomputing.com>

I don't know anything about smc switches, but for cisco switches I had to enable 'spanning-tree portfast default' before to allow a pxe booting node to stay up. Maybe the smc switch has something similar that prevents the port from being fully useable until some spanning tree algorithm terminates? 

Cheers, Michael

----- Original Message -----
From:"Hearns, John" <john.hearns at mclaren.com>
To:"beowulf at beowulf.org" <beowulf at beowulf.org>
Sent:12/2/2009 10:38 AM
Subject:RE: [Beowulf] Re: cluster fails to boot with managed switch,but5-port switch works OK


I've tried resetting the SMC switch to factory defaults (with
auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it
doesn't seem to be demanding anything exotic.  We've tried swapping out
to another SMC switch but that didn't change anything.  

No idea really, as I don't use SMC switches.
First thing I would do would be to get a laptop with Linux on, and
attach it to the SMC switch.
Configure the Ethernet interface to use DHCP, then ifconfig it down then
ifcoinfig it up.
Run tcpdump eth0 as you do this, and tail -f /var/log/messages


If it gets a DHCP address (and of course on the cluster head node there
is a pool of free DHCP addresses configured)
the test tftp file transfer. Set the tftp daemon on the head node to run
in debug mode, and start it up.
Try a tftp get of a test file on the laptop.


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From landman at scalableinformatics.com  Wed Dec  2 11:58:27 2009
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 02 Dec 2009 14:58:27 -0500
Subject: [Beowulf] Re: cluster fails to boot with managed switch,	but
	5-port switch works OK
In-Reply-To: <E1NFv3Z-0006RD-Bb@mendel.bio.caltech.edu>
References: <E1NFv3Z-0006RD-Bb@mendel.bio.caltech.edu>
Message-ID: <4B16C6E3.3060808@scalableinformatics.com>

David Mathog wrote:
>> What's got me and the IT guys stumped is that while the compute nodes
> boot via PXE from the head node without trouble on the NetGear, they
> barf with the SMC.  To be specific, after the initial boot with a
> minimal Linux kernel, there is a "fatal error" with "timeout waiting for
> getfile" when the compute node attempts to download the provisioning
> image from head.  However, when they were running Rocks before I
> arrived, the cluster worked fine with the SMC switch.

Wondering aloud whether or not the ethernet driver has been correctly 
included in the kernel/initrd for the PXE booted image.  I've 
seen/experienced this before, PXE works fine, the kernel boots, and is 
missing the ethernet driver.

Usually happens with newer hardware and older kernels.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From james.p.lux at jpl.nasa.gov  Wed Dec  2 12:02:03 2009
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Wed, 2 Dec 2009 12:02:03 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <4B16BC0D.7040003@sonsorol.org>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<ECE7A93BD093E1439C20020FBE87C47FEB7EF4259A@ALTPHYEMBEVSP20.RES.AD.JPL>
	<4B16BC0D.7040003@sonsorol.org>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47FEB7EF425BA@ALTPHYEMBEVSP20.RES.AD.JPL>

> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Chris Dagdigian
> Sent: Wednesday, December 02, 2009 11:12 AM
> Cc: beowulf at beowulf.org; Ross Tucker
> Subject: Re: [Beowulf] New member, upgrading our existing Beowulf cluster
> 
> 
> Not sure if you are looking at DIY or commercial options but this has
> been done well on a commercial scale by at least some integrators.
> 
> I've never used them in a cluster but they make great virtualization
> platforms.
> 
> This is just one example, the marketing term is "1U Twin Server"
> 
> http://www.siliconmechanics.com/c1159/1u-twin-servers.php
> 
> Other vendors have various other options and the range of "dedicated" vs
> "shared" resources among the twin and quad server configurations can
> vary quite a bit. There is a good chance someone is already making a box
> with the specs you are interested in.
> 

I note that those chassis seem to have a PSU specifically designed for driving two mobos, and are rated fairly high (980W for the first one on the page referenced).

I took Ross's original request to be one of using existing power supplies and adding a mobo (essentially the DIY option).


From cap at nsc.liu.se  Wed Dec  2 12:28:07 2009
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Wed, 2 Dec 2009 21:28:07 +0100
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <200912022128.07314.cap@nsc.liu.se>

On Wednesday 02 December 2009, Hearns, John wrote:
> I'm a new member to this list, but the research group that I work for has
> had a working cluster for many years. I am now looking at upgrading our
> current configuration.
...
> Mixing modern multi-core hardware with an older OS release which worked
> with those old disk drivers and Ethernet drivers will  be a nightmare.

But why run an older OS release? Something like CentOS-5.latest will run fine 
on your new hardware and it's no problem getting all sorts of old HPC code to 
run on it (disclaimer: of course you can find a zillion apps that break on 
any given OS...).

> I was wondering if anyone has actual experience with running more than one
> node from a single power supply.
...
> Look at the Supermicro twin systems, they have two motherboards in 1U or
> four motherboards in 2U.
>
> I believe HP have similar.

They have 4-nodes in 2U (it has the added benefint of using large 8cm fans 
instead of those inefficient 1U fans...). Supermicro also has a 4-nodes in 
2U.

> Or of course any of the blade chassis ? Supermicro, HP, Sun and dare I say
> it SGI.

We've typically found that blade chassi type hardware is far from cost 
effective for HPC, but YMMV.

> On a smaller scale you could look at the ?personal supercomputers? from
> Cray and SGI.

Even less cost effective (I think).

> The contents of this email are confidential and for the exclusive use of
> the intended recipient...

Good job sending it to a public e-mail list then.

> If you receive this email in error you should not 
> copy it, retransmit it, use it or disclose its contents but should return
> it to the sender immediately and delete your copy.

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091202/a39cf63d/attachment.sig>

From bill at cse.ucdavis.edu  Wed Dec  2 12:36:17 2009
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Wed, 02 Dec 2009 12:36:17 -0800
Subject: [Beowulf] Re: cluster fails to boot with managed switch,	but
	5-port switch works OK
In-Reply-To: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
Message-ID: <4B16CFC1.8060603@cse.ucdavis.edu>

Art Poon wrote:
> I've tried resetting the SMC switch to factory defaults (with
> auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it
> doesn't seem to be demanding anything exotic.  We've tried swapping out to
> another SMC switch but that didn't change anything.

I had a very unpleasant experience with an SMC switch awhile back.  I was
having problems trying to bootstrap a rocks cluster.  Turns out the SMC (and
Dell relabel) was so evil that it warranted a mention in the Rocks FAQ.

I believe the solution was to manually turn on edge node routing or similar on
each port.  Unfortunately there was a bug and you could only turn on the first
16 ports.  There was a fix with new firmware, but there were 2 firmware images
and you couldn't tell which from looking at the switch.  Said firmware upgrade
caused other problems.

Eventually it worked well enough.

I've used quite a variety of switches without problem, I was shocked that a
default switch config wouldn't work with DHCP and PXEboot.

> 
> I'm grateful if you could weigh in with your expertise.
> 
> Thank you,
> - Art.
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rpnabar at gmail.com  Wed Dec  2 13:42:04 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 2 Dec 2009 15:42:04 -0600
Subject: [Beowulf] Re: cluster fails to boot with managed switch, 
	but5-port switch works OK
In-Reply-To: <00cc01ca7388$68bf5259$3504650a@penguincomputing.com>
References: <00cc01ca7388$68bf5259$3504650a@penguincomputing.com>
Message-ID: <c4d69730912021342j1f689a5dg27cba85515c0f548@mail.gmail.com>

On Wed, Dec 2, 2009 at 1:48 PM, Michael Will <mwill at penguincomputing.com> wrote:
> I don't know anything about smc switches, but for cisco switches I had to enable 'spanning-tree portfast default' before to allow a >pxe booting node to stay up. Maybe the smc switch has something similar that prevents the port from being fully useable until some >spanning tree algorithm terminates?

+1 for the spanning tree suggestion. I needed to do the same on my
Dell Catalyst.

Check if "DHCP squash" on the port connected to the master node is
enabled. This can prevent DHCP. Just a thought.

-- 
Rahul


From hearnsj at googlemail.com  Wed Dec  2 14:13:34 2009
From: hearnsj at googlemail.com (John Hearns)
Date: Wed, 2 Dec 2009 22:13:34 +0000
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <200912022128.07314.cap@nsc.liu.se>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<200912022128.07314.cap@nsc.liu.se>
Message-ID: <9f8092cc0912021413h3b61b827peed73cf99cf39ec9@mail.gmail.com>

2009/12/2 Peter Kjellstrom <cap at nsc.liu.se>:
>
> Good job sending it to a public e-mail list then.
>
You know fine well that such disclaimers are inserted by corporate
email servers.
Keep your sarcasm to yourself.


From mwill at penguincomputing.com  Wed Dec  2 14:14:54 2009
From: mwill at penguincomputing.com (Michael Will)
Date: Wed, 02 Dec 2009 14:14:54 -0800
Subject: [Beowulf] Re: cluster fails to boot with managed switch,
	but5-port switch works OK
Message-ID: <00da01ca739c$e443c877$3504650a@penguincomputing.com>

I got another one for you from penguins support team: enable port forward

Sent from Moxier Mail
(http://www.moxier.com)


----- Original Message -----
From:"Rahul Nabar" <rpnabar at gmail.com>
To:"Michael Will" <mwill at penguincomputing.com>
Cc:"Hearns, John" <john.hearns at mclaren.com>, "beowulf at beowulf.org" <beowulf at beowulf.org>
Sent:12/2/2009 1:42 PM
Subject:Re: [Beowulf] Re: cluster fails to boot with managed switch, but5-port switch works OK


On Wed, Dec 2, 2009 at 1:48 PM, Michael Will <mwill at penguincomputing.com> wrote:
> I don't know anything about smc switches, but for cisco switches I had to enable 'spanning-tree portfast default' before to allow a >pxe booting node to stay up. Maybe the smc switch has something similar that prevents the port from being fully useable until some >spanning tree algorithm terminates?

+1 for the spanning tree suggestion. I needed to do the same on my
Dell Catalyst.

Check if "DHCP squash" on the port connected to the master node is
enabled. This can prevent DHCP. Just a thought.

-- 
Rahul


From csamuel at vpac.org  Wed Dec  2 14:15:07 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Thu, 3 Dec 2009 09:15:07 +1100 (EST)
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <alpine.LRH.2.00.0912021308440.8086@hogwarts.egr.duke.edu>
Message-ID: <1122079789.7251391259792107624.JavaMail.root@mail.vpac.org>


----- "Joshua Baker-LePain" <jlb17 at duke.edu> wrote:

> What Red Hat is *not* shipping by default are any
> of the filesystem utilities, so you can't, e.g., 
> actually mkfs an XFS filesystem.  But you can get
> the xfsprogs RPM from the CentOS extras repo and
> that should work just fine.

Ah yes, that was what it was explained to me as!

Mea culpa, I blame the Beowulf bash.. ;-)

-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From lindahl at pbm.com  Wed Dec  2 14:29:35 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 2 Dec 2009 14:29:35 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <9f8092cc0912021413h3b61b827peed73cf99cf39ec9@mail.gmail.com>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<200912022128.07314.cap@nsc.liu.se>
	<9f8092cc0912021413h3b61b827peed73cf99cf39ec9@mail.gmail.com>
Message-ID: <20091202222935.GF12204@bx9.net>

On Wed, Dec 02, 2009 at 10:13:34PM +0000, John Hearns wrote:

> You know fine well that such disclaimers are inserted by corporate
> email servers.

Actually, I had no idea, probably a lot of other people don't either.

Can't you work for a company that doesn't have disclaimers? ;-)

-- greg


From csamuel at vpac.org  Wed Dec  2 14:36:14 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Thu, 3 Dec 2009 09:36:14 +1100 (EST)
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <d5bdff000912020349ka3830cfl679dead4a04f1581@mail.gmail.com>
Message-ID: <1384774161.7252631259793374343.JavaMail.root@mail.vpac.org>


----- "Toon Knapen" <toon.knapen at gmail.com> wrote:

> And why do you prefer xfs if I may ask. Performance?

For us, yes, plus the fact that ext3 is (maybe was, but
not from what I've heard) single threaded through the
journal daemon so if you get a lot of writers (say NFS
daemons for instance) you can get horribly backlogged
and end up with horrendous load averages.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From james.p.lux at jpl.nasa.gov  Wed Dec  2 15:19:08 2009
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Wed, 2 Dec 2009 15:19:08 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <20091202222935.GF12204@bx9.net>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<200912022128.07314.cap@nsc.liu.se>
	<9f8092cc0912021413h3b61b827peed73cf99cf39ec9@mail.gmail.com>
	<20091202222935.GF12204@bx9.net>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47FEB7EF42600@ALTPHYEMBEVSP20.RES.AD.JPL>


> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Greg Lindahl
> Sent: Wednesday, December 02, 2009 2:30 PM
> To: beowulf at beowulf.org
> Subject: Re: [Beowulf] New member, upgrading our existing Beowulf cluster
> 
> On Wed, Dec 02, 2009 at 10:13:34PM +0000, John Hearns wrote:
> 
> > You know fine well that such disclaimers are inserted by corporate
> > email servers.
> 
> Actually, I had no idea, probably a lot of other people don't either.
> 
> Can't you work for a company that doesn't have disclaimers? ;-)
> 
> -- greg

Surely you jest...
Of course.. but the reality is that a lot of folks work for places that have multiple stakeholders of one sort or another who want *different* disclaimers, notwithstanding that the disclaimer is of dubious legal value.  Having had entirely too many discussions and training on this, here's some not so obvious observations..
(obligatory IANAL)

"This system is private and subject to monitoring. Do not use if not authorized"... that one comes out as a startup or login banner a lot, and you think... doh, you're already connected, is this going to scare you off?   Nope.. that's not why it's there.. it's to provide a legal basis for prosecution and collection of the data.  If you don't warn that you monitor, then log files, etc, might not be admissible as evidence. If you don't say that "hey this isn't open to the public", then a defense of "I didn't know" can be raised.

The "Information is confidential. If not addressed to you destroy it and notify" one is in the same sort of classification..but this one is for trade secrets. If you didn't identify it, then you can't claim it's a trade secret. And, if you haven't put the "if its not for you, don't look", then an inadvertent disclosure could be (possibly) legally copied and passed on. It's also, to a lesser degree, protection for the recipient...There's been more than one case of someone in Silicon Valley getting proprietary info by mistake (oops, there's two John Smiths.. or We put Joe's info in John's envelope and vice versa).. Company A that "lost" the info then sues company B employing the unwitting recipient to enjoin that recipient from working on anything that might be competitive. If the employee happens to be the key toiler on Company B's product that's going to whip Company A in the market, you can see there is a problem.  In fact, the mere possibility (not even threat) of this kind of thing can be more effective than a non-compete agreement( which would be legally unenforceable in California in most cases) 

But that's just civil stuff...Now we get to the exciting one...

"The information in this email may be subject to export controls"  Yep.. that's the one that warns you that now that it's in your hot little hands, you assume the responsibility and potential prison term if you transmit it to someone you shouldn't.  Doesn't matter that the originator might have been stupid and shouldn't have sent it, now it's your little baby to worry about.  Mind you, I find this kind of amusing when below things like birthday party invitations or, one of the first times I saw it, printed on the bottom of the packing slip for a tube of 74LS138s back in 1979.  Yep.. the evil doers gain a strategic advantage by knowing that I've got those 3-8 decoders in my garage, corroding away in their tube, just in case I need them 30 years later.  (I'll bet there's more than one person on this list with parts in their house or desk drawer, not soldered into something, with date codes beginning with a 6 or 7, or even a 3 digit code)


Jim


From hahn at mcmaster.ca  Wed Dec  2 16:02:13 2009
From: hahn at mcmaster.ca (Mark Hahn)
Date: Wed, 2 Dec 2009 19:02:13 -0500 (EST)
Subject: [Beowulf] Intel shows 48-core 'datacentre on a chip' 
In-Reply-To: <20091202182243.GG17686@leitl.org>
References: <20091202182243.GG17686@leitl.org>
Message-ID: <Pine.LNX.4.64.0912021857040.27412@coffee.psychology.mcmaster.ca>

> SCC combines 24 dual-core processing elements, each with its own router, four
> DDR3 memory controllers capable of handling up to 8GB apiece, and a very fast

sounds a fair amount like larrabee to me.

> you needed your own datacentre. Now, you just need your own chip," said

that has to be one of the most asinine things I've heard recently.

> The on-chip network is configured as a 6x4 node, two-dimensional mesh. It has

hmm, larrabee rumors indicate a dual-ring bus, not 2d mesh ala tilera.

non-coherent cores sharing a small number of dram interfaces also sounds
like an interesting trick.  it implies that at some level, there's a table
controlling which cores see which chunks of memory...


From csamuel at vpac.org  Wed Dec  2 18:35:01 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Thu, 3 Dec 2009 13:35:01 +1100 (EST)
Subject: [Beowulf] Re: cluster fails to boot with managed switch, but
	5-port switch works OK
In-Reply-To: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
Message-ID: <1558552513.7267981259807701826.JavaMail.root@mail.vpac.org>


----- "Art Poon" <artpoon at gmail.com> wrote:

> To be specific, after the initial boot with a
> minimal Linux kernel, there is a "fatal error"
> with "timeout waiting for getfile" when the
> compute node attempts to download the provisioning
> image from head.

I've seen similar issues with Cisco switches in IBM
Cluster 1350 systems where the switch was in its default
configuration.

The fix was to configure each port pointing to a
compute node as an "edge port" to suppress the
switches instinct to (IIRC) try and snoop for
spanning tree information when bringing the port
up as that meant that the vital packets were
being dropped.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From ashley at pittman.co.uk  Thu Dec  3 01:20:58 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Thu, 03 Dec 2009 09:20:58 +0000
Subject: [Beowulf] Re: cluster fails to boot with managed switch,	but
	5-port switch works OK
In-Reply-To: <4B16C6E3.3060808@scalableinformatics.com>
References: <E1NFv3Z-0006RD-Bb@mendel.bio.caltech.edu>
	<4B16C6E3.3060808@scalableinformatics.com>
Message-ID: <1259832058.6352.60.camel@alpha>

On Wed, 2009-12-02 at 14:58 -0500, Joe Landman wrote:
> David Mathog wrote:
> >> What's got me and the IT guys stumped is that while the compute nodes
> > boot via PXE from the head node without trouble on the NetGear, they
> > barf with the SMC.  To be specific, after the initial boot with a
> > minimal Linux kernel, there is a "fatal error" with "timeout waiting for
> > getfile" when the compute node attempts to download the provisioning
> > image from head.  However, when they were running Rocks before I
> > arrived, the cluster worked fine with the SMC switch.
> 
> Wondering aloud whether or not the ethernet driver has been correctly 
> included in the kernel/initrd for the PXE booted image.  I've 
> seen/experienced this before, PXE works fine, the kernel boots, and is 
> missing the ethernet driver.

Or the new distro you are trying enumerates the ethernet devices
differently and it's trying to load the getfile from a different
unconnected ethernet port.  That's fairly common as well.  It could even
be worse that than in that the enumeration could be non-deterministic to
really confuse you.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From prentice at ias.edu  Thu Dec  3 07:25:49 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 03 Dec 2009 10:25:49 -0500
Subject: [Beowulf] Cluster Users in Clusters Linux and Windows
In-Reply-To: <4788ffe70911261015t2817fcd4i55044d692b1aed64@mail.gmail.com>
References: <4788ffe70911261015t2817fcd4i55044d692b1aed64@mail.gmail.com>
Message-ID: <4B17D87D.1040608@ias.edu>


Leonardo Machado Moreira wrote:
> Hi!
> 
> I am trying to create a cluster with only two machines.
> 
> The server will be a Linux machine, an Arch Linux distribution to be
> more specific. The slave machine will be a Windows 7 machine.

While may be technically possible, why would you want to do this? If
this is for your work, computers are so cheap that from a business point
of view, it makes more sense to just by an additional computer and put
Arch Linux on it than to waste your time trying to make this work. If
you in it the technical challenge and sense of adventure, then good for
you, and good luck.

> I have found it is possible, but I was looking and have found that each
> machine on the cluster must have the same user for the cluster.

I would recommend having both systems use the same LDAP server or Active
Directory (AD) server. I have made Linux systems use AD for
LDAP/Kerberos servers. It's not that hard, but you need AD to support
Posix/Unix attributes like shell, home directory, GECOS field.

Most new versions of AD have this built in, earlier versions require an
additional package Microsoft Services for Unix (also know as SFU or
msSFU, or MSSFU) that can be downloaded from Microsoft.

I wouldn'd try this unless you are very well-versed in LDAP and Kerberos
administration.

On RH-based Linux distros, /etc/ldap.conf should already has the
necessary configuration for SFU in it, you just need to uncomment it.

I never set up windows systems as LDAP clients.

> 
> I was wondering how would I deal with it with the windows machine ?

You'd have to have a windows-specific binary, for one.

> 
> Do I have do implement a specific program in it? Would it found the rsh ?

See above. SFU might come with a Wibndows implementation, but you'd have
problems with the fact the programs might have different names, and
would probably have different paths. That could confuse the queuing
system (if your using one), and/or your MPI implementation.
> 
> Thanks in advance!
> 
> Leonardo Machado Moreira


From prentice at ias.edu  Thu Dec  3 07:40:12 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 03 Dec 2009 10:40:12 -0500
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <4B17DBDC.5060106@ias.edu>

Hearns, John wrote:
> 
> I was wondering if anyone has actual experience with running more than
> one node from a single power supply. Even just two boards on one PSU
> would be nice. We will be using barely 200W per node for 50 nodes and it
> just seems like a big waste to buy 50 power supply units. I have read
> the old posts but did not see any reports of success.
> 
> Look at the Supermicro twin systems, they have two motherboards in 1U or
> four motherboards in 2U.
> 
> I believe HP have similar.      

What I learned at SC09:

HP does make twin nodes similar to SuperMicro, but the HP nodes are not
hot-swappable, if a single node goes down, you need to take down all the
nodes in the chassis before you can remove the dead node. Not very
practical. The SuperMicro nodes are definitely hot-swappable.

> 
> Or of course any of the blade chassis ? Supermicro, HP, Sun and dare I
> say it SGI.
> 
> On a smaller scale you could look at the ?personal supercomputers? from
> Cray and SGI.
> 

The one problem with most blade chassis is that they are still
relatively expensive. There's a lot of technology in the backplanes
(ethernet, IB, KVM, etc), making them expensive.

At SC09, I saw that Appro has a "dumb" blade chassis - the chassis only
provides power, and the networking, KVM access, etc, are accessed from
the front of the blade. The logic being this makes the chassis/blades
cheaper since there's less custom (costly) technology. I didn't get any
pricing information, so I don't know if the cost savings is real or just
a marketing claim.

--
Prentice


From lynesh at Cardiff.ac.uk  Thu Dec  3 07:56:57 2009
From: lynesh at Cardiff.ac.uk (Huw Lynes)
Date: Thu, 03 Dec 2009 15:56:57 +0000
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <4B17DBDC.5060106@ias.edu>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<4B17DBDC.5060106@ias.edu>
Message-ID: <1259855817.6329.3.camel@w609.insrv.cf.ac.uk>

On Thu, 2009-12-03 at 10:40 -0500, Prentice Bisbal wrote:
> Hearns, John wrote:
> > 
> > I was wondering if anyone has actual experience with running more than
> > one node from a single power supply. Even just two boards on one PSU
> > would be nice. We will be using barely 200W per node for 50 nodes and it
> > just seems like a big waste to buy 50 power supply units. I have read
> > the old posts but did not see any reports of success.
> > 
> > Look at the Supermicro twin systems, they have two motherboards in 1U or
> > four motherboards in 2U.
> > 
> > I believe HP have similar.      
> 
> What I learned at SC09:
> 
> HP does make twin nodes similar to SuperMicro, but the HP nodes are not
> hot-swappable, if a single node goes down, you need to take down all the
> nodes in the chassis before you can remove the dead node. Not very
> practical. The SuperMicro nodes are definitely hot-swappable.
> 

The Supermicro 2U twin nodes with 4 mobos in each box are hot-swappable.
The 1U twin nodes with 2 mobos in each box are not.

Yes I don't understand why the 2U box isn't called "quad" either.

Cheers,
Huw

-- 
Huw Lynes                       | Advanced Research Computing
HEC Sysadmin                    | Cardiff University
                                | Redwood Building, 
Tel: +44 (0) 29208 70626        | King Edward VII Avenue, CF10 3NB


From jeff.johnson at aeoncomputing.com  Wed Dec  2 10:34:20 2009
From: jeff.johnson at aeoncomputing.com (Jeff Johnson)
Date: Wed, 02 Dec 2009 10:34:20 -0800
Subject: [Beowulf] Re: Beowulf Digest, Vol 70, Issue 4
In-Reply-To: <200912021821.nB2ILBIK030413@bluewest.scyld.com>
References: <200912021821.nB2ILBIK030413@bluewest.scyld.com>
Message-ID: <4B16B32C.60703@aeoncomputing.com>

On 12/2/09 10:21 AM, beowulf-request at beowulf.org wrote:
> ------------------------------
>
> Message: 8
> Date: Tue, 1 Dec 2009 12:45:52 -0800
> From: Art Poon<artpoon at gmail.com>
> Subject: [Beowulf] Re: cluster fails to boot with managed switch,	but
> 	5-port switch works OK
> To:beowulf at beowulf.org
> Message-ID:<825EEAB3-C58F-46B8-A9C4-A806C5B682D3 at gmail.com>
> Content-Type: text/plain; charset=us-ascii
>
> Dear colleagues,
>
> [snip]
>
> What's got me and the IT guys stumped is that while the compute nodes boot via PXE from the head node without trouble on the NetGear, they barf with the SMC.  To be specific, after the initial boot with a minimal Linux kernel, there is a "fatal error" with "timeout waiting for getfile" when the compute node attempts to download the provisioning image from head.  However, when they were running Rocks before I arrived, the cluster worked fine with the SMC switch.
>
> I've tried resetting the SMC switch to factory defaults (with auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic.  We've tried swapping out to another SMC switch but that didn't change anything.
>
> I'm grateful if you could weigh in with your expertise.
>    
I don't know if my $.02 here could be classified as 'expertise'. With 
that disclaimer out of the way I can say that SMC switches do have a 
tendency to have very old firmware when they are stocked in warehouses 
and they are not often updated. Their update process is a PITA compared 
to other switches out there. I have seen cases where their old firmware 
and STP (spanning tree protocol) causes enough delay when a port comes 
up on the switch for the first time in a pxe/dhcp operation that the 
process times out while the switch is trying to figure out if there are 
network loops. The firmware update can be obtained from www.smc.com and 
is at v2.3.0.0 updated in March. Check your switch to see where you are 
at now.

The Netgear switches are layer-2 and too dumb to cause problems.
> Thank you,
> - Art.
>
>
>
>
> ------------------------------
>
>    

-- 
------------------------------
Jeff Johnson
Manager
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810   f: 858-412-3845

4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117


From artpoon at gmail.com  Wed Dec  2 10:42:28 2009
From: artpoon at gmail.com (Art Poon)
Date: Wed, 2 Dec 2009 10:42:28 -0800
Subject: [Beowulf] Re: cluster fails to boot with managed switch,
	but 5-port switch works OK
In-Reply-To: <4B16B22A.8070603@scalableinformatics.com>
References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
	<4B16B22A.8070603@scalableinformatics.com>
Message-ID: <9AB77A0E-3092-4395-A299-410B8C97C095@gmail.com>

Hi all,

Thanks for your responses!  I finally fixed this yesterday afternoon but neglected to update my post, my apologies.
  
After discussing our problem to the Penguin Computing service rep, I reconfigured the switch to enable fast spanning-tree mode for compute node ports.  That apparently fixed the problem and thanks to your feedback I am starting to understand why.

Thanks again,
- Art.

On Dec 2, 2009, at 10:30 AM, Joe Landman wrote:

> Art Poon wrote:
>> Dear colleagues,
> 
> [...]
> 
>> What's got me and the IT guys stumped is that while the compute nodes
>> boot via PXE from the head node without trouble on the NetGear, they
>> barf with the SMC.  To be specific, after the initial boot with a
>> minimal Linux kernel, there is a "fatal error" with "timeout waiting
>> for getfile" when the compute node attempts to download the
>> provisioning image from head.  However, when they were running Rocks
>> before I arrived, the cluster worked fine with the SMC switch.
> 
> Is it the switch of the dhcp/bootp/tftp setup thats the problem?  Are you sure the tftp daemon is up, or bootp is configured correctly?
> 
> Switches sometimes have broadcast storm suppression turned on, or worse, sometimes they have spanning tree turned on.  You want the switch to be as dumb as you can possibly make it for most linux clusters.  Fast, but dumb.
> 
>> I've tried resetting the SMC switch to factory defaults (with
>> auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and
>> it doesn't seem to be demanding anything exotic.  We've tried
>> swapping out to another SMC switch but that didn't change anything.
> 
> This sounds more on the server software stack than the switch.  Could you describe this?  Are you using Scyld/Rocks for that?
> 
> Rocks is quite sensitive to configuration issues, and really doesn't like altered configurations (it is possible to do, though non-trivial).
> 
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics Inc.
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
>       http://scalableinformatics.com/jackrabbit
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615


From vallard at benincosa.com  Wed Dec  2 10:59:09 2009
From: vallard at benincosa.com (Vallard Benincosa)
Date: Wed, 2 Dec 2009 10:59:09 -0800
Subject: [Beowulf] Re: cluster fails to boot with managed switch,but 
	5-port switch works OK
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0E75A725@milexchmb1.mil.tagmclarengroup.com>
References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A725@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <f7572b050912021059t20b912f3m325ee377a0a05297@mail.gmail.com>

Spanning Trees are usually the biggest culprit in pxe booting and switch
problems.  For SMC switches I think you just need to do something like:

telnet smc
enable
config
interface ethernet 1/1  # if this is the interface your client node is on
spanning-tree edge-port
exit
copy run start


On Wed, Dec 2, 2009 at 10:36 AM, Hearns, John <john.hearns at mclaren.com>wrote:

>
> I've tried resetting the SMC switch to factory defaults (with
> auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it
> doesn't seem to be demanding anything exotic.  We've tried swapping out
> to another SMC switch but that didn't change anything.
>
> No idea really, as I don't use SMC switches.
> First thing I would do would be to get a laptop with Linux on, and
> attach it to the SMC switch.
> Configure the Ethernet interface to use DHCP, then ifconfig it down then
> ifcoinfig it up.
> Run tcpdump eth0 as you do this, and tail -f /var/log/messages
>
>
> If it gets a DHCP address (and of course on the cluster head node there
> is a pool of free DHCP addresses configured)
> the test tftp file transfer. Set the tftp daemon on the head node to run
> in debug mode, and start it up.
> Try a tftp get of a test file on the laptop.
>
>
> The contents of this email are confidential and for the exclusive use of
> the intended recipient.  If you receive this email in error you should not
> copy it, retransmit it, use it or disclose its contents but should return it
> to the sender immediately and delete your copy.
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091202/c5cb3192/attachment.html>

From crhea at mayo.edu  Wed Dec  2 11:14:29 2009
From: crhea at mayo.edu (Cris Rhea)
Date: Wed, 2 Dec 2009 13:14:29 -0600
Subject: [Beowulf] Re: Booting nodes with PXE...
In-Reply-To: <200912021853.nB2IrssM031377@bluewest.scyld.com>
References: <200912021853.nB2IrssM031377@bluewest.scyld.com>
Message-ID: <20091202191429.GA6361@kaizen.mayo.edu>

> > What's got me and the IT guys stumped is that while the compute nodes
> > boot via PXE from the head node without trouble on the NetGear, they
> > barf with the SMC.  To be specific, after the initial boot with a
> > minimal Linux kernel, there is a "fatal error" with "timeout waiting
> > for getfile" when the compute node attempts to download the
> > provisioning image from head.  However, when they were running Rocks
> > before I arrived, the cluster worked fine with the SMC switch.
> 
> Switches sometimes have broadcast storm suppression turned on, or worse, 
> sometimes they have spanning tree turned on.  You want the switch to be 
> as dumb as you can possibly make it for most linux clusters.  Fast, but 
> dumb.

As some have already commented, I'm assuming you have tested each service
(DHCP, tftp, etc.).

My bet is on "spanning tree", as mentioned above. Watch the Ethernet lights 
on the node when booting and see if the port comes alive/stable before 
you get the timeout. I've seen this in spades if "spanning tree portfast" isn't 
set on Cisco switches-- just takes too long to negotiate the GbE interface.


--- Cris

-- 
 Cristopher J. Rhea                     
 Mayo Clinic - Research Computing Facility
 200 First St SW, Rochester, MN 55905
 crhea at Mayo.EDU
 (507) 284-0587


From mclewis at ucdavis.edu  Wed Dec  2 13:06:43 2009
From: mclewis at ucdavis.edu (Michael Lewis)
Date: Wed, 2 Dec 2009 13:06:43 -0800
Subject: [Beowulf] Re: cluster fails to boot with managed switch,
	but 5-port switch works OK
In-Reply-To: <4B16CFC1.8060603@cse.ucdavis.edu>
References: <825EEAB3-C58F-46B8-A9C4-A806C5B682D3@gmail.com>
	<4B16CFC1.8060603@cse.ucdavis.edu>
Message-ID: <20091202210643.GX12440@durian.genomecenter.ucdavis.edu>

On Wed, Dec 02, 2009 at 12:36:17PM -0800, Bill Broadley wrote:
> Art Poon wrote:
> > I've tried resetting the SMC switch to factory defaults (with
> > auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it
> > doesn't seem to be demanding anything exotic.  We've tried swapping out to
> > another SMC switch but that didn't change anything.
> 
> I had a very unpleasant experience with an SMC switch awhile back.  I was
> having problems trying to bootstrap a rocks cluster.  Turns out the SMC (and
> Dell relabel) was so evil that it warranted a mention in the Rocks FAQ.

I run the cluster that Bill is describing here.  Indeed, the default
configuration of the SMC switches was to have spanning-tree turned on on all
ports.  The symptom we had was that the machines would PXEboot fine and load
a kernel, but then fail to DHCP later.  Even worse, the switches would 
occasionally revert back to this setting if they lost power.  

Also, as Bill notes, there is a Dell rebrand of the same switch, which runs
slightly different firmware.  If you've got one of those, get the firmware
from the Dell site, not from SMC.

> I believe the solution was to manually turn on edge node routing or similar on
> each port.  Unfortunately there was a bug and you could only turn on the first
> 16 ports.  There was a fix with new firmware, but there were 2 firmware images
> and you couldn't tell which from looking at the switch.  Said firmware upgrade
> caused other problems.

Here was the fix we used.  For each port (replace 1/5 with 1/N for N=1..):

        Console#config
        Console(config)#interface ethernet 1/5
        Console(config-if)#spanning-tree edge-port

And don't forget to write back to flash when you're done.

After the firmware updates, we haven't had the issue of the configuration
resetting anymore, but we've also upgraded the cluster to a better switch.
The SMCs now run other non-cluster servers.

-- 
Michael Lewis                       |   mclewis at ucdavis.edu
Systems Administrator               |   Voice: (530) 754-7978
Genome Center                       |  
University of California, Davis     |   


From christiansuhendra at gmail.com  Wed Dec  2 22:16:00 2009
From: christiansuhendra at gmail.com (christian suhendra)
Date: Thu, 3 Dec 2009 14:16:00 +0800
Subject: [Beowulf] lamexec vs mpirun
Message-ID: <c761caee0912022216t50ce0bcdyc12742bf0311ad33@mail.gmail.com>

hello guys i've buliding a cluster system with mpich-1.2.7p1,when i test my
machine it work..but when i run the mpirun it doesn't work ini cluster...but
its running
and i got this error
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).

mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.

and then i try to run lamexec it work..
 i wanna ask you guys the different between of lamexec and mpirun
thank you for your advice


regards
christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091203/eb719eb1/attachment.html>

From lindahl at pbm.com  Thu Dec  3 11:02:00 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 3 Dec 2009 11:02:00 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <4B17DBDC.5060106@ias.edu>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<4B17DBDC.5060106@ias.edu>
Message-ID: <20091203190200.GA647@bx9.net>

On Thu, Dec 03, 2009 at 10:40:12AM -0500, Prentice Bisbal wrote:

> if a single node goes down, you need to take down all the
> nodes in the chassis before you can remove the dead node. Not very
> practical.

Eh? What's so hard about marking the other nodes as unusable in your
batch system, and waiting for them to become free?

-- greg


From Glen.Beane at jax.org  Thu Dec  3 11:22:23 2009
From: Glen.Beane at jax.org (Glen Beane)
Date: Thu, 3 Dec 2009 14:22:23 -0500
Subject: [Beowulf] lamexec vs mpirun
In-Reply-To: <c761caee0912022216t50ce0bcdyc12742bf0311ad33@mail.gmail.com>
Message-ID: <C73D7A1F.56FC%glen.beane@jax.org>


On 12/3/09 1:16 AM, "christian suhendra" <christiansuhendra at gmail.com> wrote:

hello guys i've buliding a cluster system with mpich-1.2.7p1,when i test my machine it work..but when i run the mpirun it doesn't work ini cluster...but its running
and i got this error
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).

mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.

and then i try to run lamexec it work..
 i wanna ask you guys the different between of lamexec and mpirun
thank you for your advice


You are not using mpich,  you are using LAM-MPI. LAM uses daemons that must be running on all of the nodes before mpirun can launch the MPI executables.   The lamexec command essentially does a "lamboot" and "mpirun" in a single command.  By the way, LAM is deprecated, and its users are encouraged to switch to OpenMPI.  If you were intending to use mpich and have it installed on your system LAM is being found first in your path and not mpich.

--
Glen L. Beane
Software Engineer
The Jackson Laboratory
Phone (207) 288-6153

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091203/aae728b5/attachment.html>

From hahn at mcmaster.ca  Thu Dec  3 11:29:10 2009
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 3 Dec 2009 14:29:10 -0500 (EST)
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <20091203190200.GA647@bx9.net>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net>
Message-ID: <Pine.LNX.4.64.0912031426350.23226@coffee.psychology.mcmaster.ca>

>> if a single node goes down, you need to take down all the
>> nodes in the chassis before you can remove the dead node. Not very
>> practical.
>
> Eh? What's so hard about marking the other nodes as unusable in your
> batch system, and waiting for them to become free?

depends on your max job length.  but yeah, idling three nodes for a week
is not going to be noticable in anything but a quite small cluster...


From Glen.Beane at jax.org  Thu Dec  3 11:31:58 2009
From: Glen.Beane at jax.org (Glen Beane)
Date: Thu, 3 Dec 2009 14:31:58 -0500
Subject: [Beowulf] lamexec vs mpirun
In-Reply-To: <C73D7A1F.56FC%glen.beane@jax.org>
Message-ID: <C73D7C5E.56FF%glen.beane@jax.org>


On 12/3/09 2:22 PM, "Glen Beane" <glen.beane at jax.org> wrote:


On 12/3/09 1:16 AM, "christian suhendra" <christiansuhendra at gmail.com> wrote:

hello guys i've buliding a cluster system with mpich-1.2.7p1,when i test my machine it work..but when i run the mpirun it doesn't work ini cluster...but its running
and i got this error
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).

mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.

and then i try to run lamexec it work..
 i wanna ask you guys the different between of lamexec and mpirun
thank you for your advice


You are not using mpich,  you are using LAM-MPI. LAM uses daemons that must be running on all of the nodes before mpirun can launch the MPI executables.   The lamexec command essentially does a "lamboot" and "mpirun" in a single command.  By the way, LAM is deprecated, and its users are encouraged to switch to OpenMPI.  If you were intending to use mpich and have it installed on your system LAM is being found first in your path and not mpich.

Actually, I take this back - lamexec is not like lamboot and mpirun in one,  for some reason I had LAM's mpiexec command on my mind, which if I remember right is equivalent to  lamboot + mpirun  (It has been a while).  lamexec is like a distributed shell that launches non-mpi programs on nodes running the LAM daemon (lamd).  but my main point still stands,  you aren't using mpich even if you think you are.

--
Glen L. Beane
Software Engineer
The Jackson Laboratory
Phone (207) 288-6153

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091203/7a32e528/attachment.html>

From jlb17 at duke.edu  Thu Dec  3 11:35:45 2009
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Thu, 3 Dec 2009 14:35:45 -0500 (EST)
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <Pine.LNX.4.64.0912031426350.23226@coffee.psychology.mcmaster.ca>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net>
	<Pine.LNX.4.64.0912031426350.23226@coffee.psychology.mcmaster.ca>
Message-ID: <alpine.LRH.2.00.0912031434350.8086@hogwarts.egr.duke.edu>

On Thu, 3 Dec 2009 at 2:29pm, Mark Hahn wrote

>>> if a single node goes down, you need to take down all the
>>> nodes in the chassis before you can remove the dead node. Not very
>>> practical.
>> 
>> Eh? What's so hard about marking the other nodes as unusable in your
>> batch system, and waiting for them to become free?
>
> depends on your max job length.  but yeah, idling three nodes for a week
> is not going to be noticable in anything but a quite small cluster...

But doesn't the engineer in you just bristle at the (admittedly, rather 
slight) inefficiency?  Call me OCD (you wouldn't be the first), but it 
just bugs me.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


From hahn at mcmaster.ca  Thu Dec  3 11:54:13 2009
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 3 Dec 2009 14:54:13 -0500 (EST)
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <alpine.LRH.2.00.0912031434350.8086@hogwarts.egr.duke.edu>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net>
	<Pine.LNX.4.64.0912031426350.23226@coffee.psychology.mcmaster.ca>
	<alpine.LRH.2.00.0912031434350.8086@hogwarts.egr.duke.edu>
Message-ID: <Pine.LNX.4.64.0912031447360.23226@coffee.psychology.mcmaster.ca>

>> depends on your max job length.  but yeah, idling three nodes for a week
>> is not going to be noticable in anything but a quite small cluster...
>
> But doesn't the engineer in you just bristle at the (admittedly, rather 
> slight) inefficiency?  Call me OCD (you wouldn't be the first), but it just 
> bugs me.

well, what I prefer to do is set a reservation on the node 
that starts one max-job-period in the future.  that means shorter jobs 
get to use the node until then.

ultimately, it depends both on how much individual hotpluggability costs
(the HW changes don't sound trivial to me) as well as the joblength/etc.
I doubt I'd choose to pay more than a few percent for HPability,
assuming 2-4 nodes-per-box and ~<7d job limits...


From prentice at ias.edu  Thu Dec  3 11:54:47 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 03 Dec 2009 14:54:47 -0500
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <20091203190200.GA647@bx9.net>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>	<4B17DBDC.5060106@ias.edu>
	<20091203190200.GA647@bx9.net>
Message-ID: <4B181787.2050407@ias.edu>

Greg Lindahl wrote:
> On Thu, Dec 03, 2009 at 10:40:12AM -0500, Prentice Bisbal wrote:
> 
>> if a single node goes down, you need to take down all the
>> nodes in the chassis before you can remove the dead node. Not very
>> practical.
> 
> Eh? What's so hard about marking the other nodes as unusable in your
> batch system, and waiting for them to become free?
> 
> -- greg
> 

I didn't say it was hard - just impractical. ;)

I thought the same thing when HP told me the nodes weren't
hot-swappable. But then when I learned the SuperMicros were hot
swappable, I figured if SuperMicro can do it, why not HP?

I'm sure you'll agree that taking just one node down instead of 4 is
more convenient, and is less likely to draw the ire of your
number-crunchers.

--
Prentice


From lindahl at pbm.com  Thu Dec  3 12:07:45 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 3 Dec 2009 12:07:45 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <alpine.LRH.2.00.0912031434350.8086@hogwarts.egr.duke.edu>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<4B17DBDC.5060106@ias.edu> <20091203190200.GA647@bx9.net>
	<Pine.LNX.4.64.0912031426350.23226@coffee.psychology.mcmaster.ca>
	<alpine.LRH.2.00.0912031434350.8086@hogwarts.egr.duke.edu>
Message-ID: <20091203200745.GA14991@bx9.net>

On Thu, Dec 03, 2009 at 02:35:45PM -0500, Joshua Baker-LePain wrote:

> But doesn't the engineer in you just bristle at the (admittedly, rather  
> slight) inefficiency?  Call me OCD (you wouldn't be the first), but it  
> just bugs me.

No. If I saved a bit of $$ with the funky case, that's probably a lot
more than I lose from having some idle nodes now and then.

-- greg


From Greg at keller.net  Thu Dec  3 12:17:56 2009
From: Greg at keller.net (Greg Keller)
Date: Thu, 3 Dec 2009 14:17:56 -0600
Subject: [Beowulf] Re: cluster fails to boot with managed switch,
	but 5-port switch works OK
In-Reply-To: <200912031846.nB3Ikesg029202@bluewest.scyld.com>
References: <200912031846.nB3Ikesg029202@bluewest.scyld.com>
Message-ID: <481727C4-110B-4D18-8361-DA12469D70FF@Keller.net>

>>>> What's got me and the IT guys stumped is that while the compute  
>>>> nodes
>>> boot via PXE from the head node without trouble on the NetGear, they
>>> barf with the SMC.  To be specific, after the initial boot with a
>>> minimal Linux kernel, there is a "fatal error" with "timeout  
>>> waiting for
>>> getfile" when the compute node attempts to download the provisioning
>>> image from head.  However, when they were running Rocks before I
>>> arrived, the cluster worked fine with the SMC switch.


This is very common with Spanning tree enabled.  Essentially, once the  
port has a physical link light it may take a while before spanning  
tree allows traffic to actually flow through the port.  Longer than a  
typical timeout.  When loading/reloading the driver there seems to be  
an instantaneous drop of the link that forces a new delay cycle.

With the Dell PowerConnect (SMC Rebrand??) series you have to "enable"  
port fast or "disable" spanning tree to avoid this delay before  
traffic passes.  I generally do both.  The Web based GUI is  
sufficiently bad enough to make this more difficult than it needs to  
be, but you can globally disable spanning tree through it.  I use the  
command line, connect to interface range all, and then configure my  
ports as:

!
enable
config
interface range ethernet all
spanning-tree disable
spanning-tree portfast
mtu 9216
exit
!

Hope this helps!

Cheers!
Greg

Technical Principal
R Systems NA, inc.


From lindahl at pbm.com  Thu Dec  3 12:21:25 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 3 Dec 2009 12:21:25 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
Message-ID: <20091203202125.GF14991@bx9.net>

On Thu, Dec 03, 2009 at 02:54:47PM -0500, Prentice Bisbal wrote:

> I didn't say it was hard - just impractical. ;)

I suspect we have different ideas about what is impractical. I take
nodes out of service gracefully all the time to fix fans, intermittant
memory errors, system disks going bad, and the like.  If anyone
complained, I would explain that I'm saving them from seeing
interrupted jobs.

-- greg


From gerry.creager at tamu.edu  Thu Dec  3 12:30:54 2009
From: gerry.creager at tamu.edu (Gerald Creager)
Date: Thu, 03 Dec 2009 14:30:54 -0600
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <d5bdff000912020349ka3830cfl679dead4a04f1581@mail.gmail.com>
References: <4B155A43.3010304@scalableinformatics.com>	
	<4B157249.80701@tamu.edu>	
	<35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>	
	<d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>	
	<4B1584E0.8060704@tamu.edu>	
	<20091201214217.GA19804@galactic.demon.co.uk>	
	<4B159330.5070802@tamu.edu>
	<d5bdff000912020349ka3830cfl679dead4a04f1581@mail.gmail.com>
Message-ID: <4B181FFE.2070401@tamu.edu>

Toon Knapen wrote:
>      
>     I believe xfs is now available in 5.4.  I'd have to check.  We've
>     found xfs to be our preference (but we're revisiting gluster and
>     lustre). I've not played with gfs so far.
> 
>  
>  
> And why do you prefer xfs if I may ask. Performance? Do you many small 
> files or large files?

Our experience was that XFS was both the most reliable and best 
performing of the various file systems we experimented with.  While the 
majority of our files are relatively small, it also performs well with 
large files.  XFS repair is usually a simple manner and very fast, when 
the XFS equivalent of an fsck is required.

gerry


From gerry.creager at tamu.edu  Thu Dec  3 13:01:15 2009
From: gerry.creager at tamu.edu (Gerald Creager)
Date: Thu, 03 Dec 2009 15:01:15 -0600
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <4B181787.2050407@ias.edu>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>	<4B17DBDC.5060106@ias.edu>	<20091203190200.GA647@bx9.net>
	<4B181787.2050407@ias.edu>
Message-ID: <4B18271B.4050005@tamu.edu>

Prentice Bisbal wrote:
> Greg Lindahl wrote:
>> On Thu, Dec 03, 2009 at 10:40:12AM -0500, Prentice Bisbal wrote:
>>
>>> if a single node goes down, you need to take down all the
>>> nodes in the chassis before you can remove the dead node. Not very
>>> practical.
>> Eh? What's so hard about marking the other nodes as unusable in your
>> batch system, and waiting for them to become free?
>>
>> -- greg
>>
> 
> I didn't say it was hard - just impractical. ;)
> 
> I thought the same thing when HP told me the nodes weren't
> hot-swappable. But then when I learned the SuperMicros were hot
> swappable, I figured if SuperMicro can do it, why not HP?
> 
> I'm sure you'll agree that taking just one node down instead of 4 is
> more convenient, and is less likely to draw the ire of your
> number-crunchers.


Because of our ill-advised choice of a specific APC rack, whenever I've 
got to remove one of the supermicro's from the 2uTwin chassis, I have to 
power 'em all off.  I then have to ease the chassis out so I've enough 
room to get a node out.  I learned to "just do it" after a 2 month whine-in.

gerry


From jlforrest at berkeley.edu  Thu Dec  3 13:50:27 2009
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Thu, 03 Dec 2009 13:50:27 -0800
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <4B181FFE.2070401@tamu.edu>
References: <4B155A43.3010304@scalableinformatics.com>		<4B157249.80701@tamu.edu>		<35AAF1E4A771E142979F27B51793A48887030F12C8@AVEXMB1.qlogic.org>		<d5bdff000912011253x236ec695u87b14777f7ebb737@mail.gmail.com>		<4B1584E0.8060704@tamu.edu>		<20091201214217.GA19804@galactic.demon.co.uk>		<4B159330.5070802@tamu.edu>	<d5bdff000912020349ka3830cfl679dead4a04f1581@mail.gmail.com>
	<4B181FFE.2070401@tamu.edu>
Message-ID: <4B1832A3.4030507@berkeley.edu>

Gerald Creager wrote:
> Toon Knapen wrote:
>>          I believe xfs is now available in 5.4.  I'd have to check.  
>> We've
>>     found xfs to be our preference (but we're revisiting gluster and
>>     lustre). I've not played with gfs so far.

I've used xfs in several CentOS servers. In general, it works
great but I've had some scary looking crashes with 'xfs'
in the call stack (see below). I stupidly haven't taken the
time to send the crashes the the xfs people so maybe these
problems are known and fixed. I also haven't tried xfs with CentOS 5.4,
which contains official support in the kernel.

I'm sticking with ext3 for now, not because it's great, but
because it's reliable and good enough. I'd love to switch to xfs,
though.

Here's one stack trace:

Sep 28 08:41:08 frank kernel: Filesystem "md1": XFS internal error 
xfs_btree_check_sblock at line 307 of file fs/xfs/xfs_btree.c.  Caller 
0xffffffff8836deaf

Sep 28 08:41:08 frank kernel:
Sep 28 08:41:08 frank kernel: Call Trace:
Sep 28 08:41:08 frank kernel:  [<ffffffff8835ebe6>] 
:xfs:xfs_btree_check_sblock+0xaf/0xbe
Sep 28 08:41:08 frank kernel:  [<ffffffff8836deaf>]
:xfs:xfs_inobt_increment+0x156/0x17e
Sep 28 08:41:08 frank kernel:  [<ffffffff8836d918>] 
:xfs:xfs_dialloc+0x4d0/0x80c
Sep 28 08:41:08 frank kernel:  [<ffffffff88373aed>] 
:xfs:xfs_ialloc+0x5f/0x57f
Sep 28 08:41:08 frank kernel:  [<ffffffff88385b0f>] 
:xfs:xfs_dir_ialloc+0x86/0x2b7
Sep 28 08:41:08 frank kernel:  [<ffffffff8837a4b0>] 
:xfs:xlog_grant_log_space+0x204/0x25c
Sep 28 08:41:08 frank kernel:  [<ffffffff883885e4>] 
:xfs:xfs_create+0x237/0x45c
Sep 28 08:41:08 frank kernel:  [<ffffffff8834dd47>] 
:xfs:xfs_attr_get+0x8e/0x9f
Sep 28 08:41:08 frank kernel:  [<ffffffff88391e50>] 
:xfs:xfs_vn_mknod+0x144/0x215
Sep 28 08:41:08 frank kernel:  [<ffffffff8003a5cd>] vfs_create+0xe6/0x158

Cordially,
-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From rpnabar at gmail.com  Thu Dec  3 15:08:39 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Thu, 3 Dec 2009 17:08:39 -0600
Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 
	5-port switch works OK
In-Reply-To: <481727C4-110B-4D18-8361-DA12469D70FF@Keller.net>
References: <200912031846.nB3Ikesg029202@bluewest.scyld.com>
	<481727C4-110B-4D18-8361-DA12469D70FF@Keller.net>
Message-ID: <c4d69730912031508s42e7c8c3r8c0b5d538b67064e@mail.gmail.com>

On Thu, Dec 3, 2009 at 2:17 PM, Greg Keller <Greg at keller.net> wrote:

> This is very common with Spanning tree enabled. ?Essentially, once the port
> has a physical link light it may take a while before spanning tree allows
> traffic to actually flow through the port.

I've had to do this same thing that Greg describes with my Dell power
connect (port fast and spanning tree disable) but on a more
fundamental level:

Why is the long delay with spanning tree? And is it possible to (a)
reduce this delay (b) Increase the timeout for PXE / DHCP?

-- 
Rahul


From csamuel at vpac.org  Thu Dec  3 16:53:39 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Fri, 4 Dec 2009 11:53:39 +1100 (EST)
Subject: [Beowulf] Forwarded from a long time reader having trouble posting
In-Reply-To: <4B181FFE.2070401@tamu.edu>
Message-ID: <1523002622.7329231259888019196.JavaMail.root@mail.vpac.org>


----- "Gerald Creager" <gerry.creager at tamu.edu> wrote:

> XFS repair is usually a simple manner and very fast,

You can also do online growing of XFS filesystems (if
you use LVM for instance) which is dead useful.

Added bonus - you can also do online defragmentation
of XFS filesystems with xfs_fsr (for some reason
packaged in the xfsdump .deb).

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From csamuel at vpac.org  Thu Dec  3 17:57:07 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Fri, 4 Dec 2009 12:57:07 +1100 (EST)
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <20091203190200.GA647@bx9.net>
Message-ID: <588682529.7331271259891827276.JavaMail.root@mail.vpac.org>


----- "Greg Lindahl" <lindahl at pbm.com> wrote:

> Eh? What's so hard about marking the other nodes
> as unusable in your batch system, and waiting for
> them to become free?

If you've got a job running on there for a month
or two then there's a fairly high opportunity cost
involved.

-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From lindahl at pbm.com  Thu Dec  3 18:11:22 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 3 Dec 2009 18:11:22 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <588682529.7331271259891827276.JavaMail.root@mail.vpac.org>
References: <20091203190200.GA647@bx9.net>
	<588682529.7331271259891827276.JavaMail.root@mail.vpac.org>
Message-ID: <20091204021122.GA17286@bx9.net>

On Fri, Dec 04, 2009 at 12:57:07PM +1100, Chris Samuel wrote:

> If you've got a job running on there for a month
> or two then there's a fairly high opportunity cost
> involved.

That kind of policy has a fairly high opportunity cost, even before
you factor in linked nodes. E.g. you see a system disk going bad, but
the user will lose all their output unless the job runs for 4 more
weeks...

-- greg


From bill at cse.ucdavis.edu  Thu Dec  3 18:19:08 2009
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Thu, 03 Dec 2009 18:19:08 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <20091204021122.GA17286@bx9.net>
References: <20091203190200.GA647@bx9.net>	<588682529.7331271259891827276.JavaMail.root@mail.vpac.org>
	<20091204021122.GA17286@bx9.net>
Message-ID: <4B18719C.5080008@cse.ucdavis.edu>

Greg Lindahl wrote:
> On Fri, Dec 04, 2009 at 12:57:07PM +1100, Chris Samuel wrote:
> 
>> If you've got a job running on there for a month
>> or two then there's a fairly high opportunity cost
>> involved.
> 
> That kind of policy has a fairly high opportunity cost, even before
> you factor in linked nodes. E.g. you see a system disk going bad, but
> the user will lose all their output unless the job runs for 4 more
> weeks...

Indeed.  You'd hope that such long running jobs would checkpoint.  Seems like
the perfect place for virtualization.  Seems like for mostly CPU bound jobs
the overhead is getting pretty low.  Then you get all kinds of benefits:
* Checkpointing
* Migration
* easy backfill

Seems like it would be real popular with the admins.  Anyone doing this?


From csamuel at vpac.org  Thu Dec  3 18:32:12 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Fri, 4 Dec 2009 13:32:12 +1100 (EST)
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <195152680.7331771259893408253.JavaMail.root@mail.vpac.org>
Message-ID: <1193336713.7333671259893932486.JavaMail.root@mail.vpac.org>


----- "Greg Lindahl" <lindahl at pbm.com> wrote:

> That kind of policy has a fairly high opportunity
> cost, even before you factor in linked nodes.

Well we cannot dictate to our users what they do,
we set a maximum walltime of 3 months and tell users
that they should checkpoint (if they have control of
the application and have coding skills).

> E.g. you see a system disk going bad, but the user
> will lose all their output unless the job runs for
> 4 more weeks...

We run SMART tests and the like trying to proactively
spot bad disks (and other hardware) prior to failures,
but yes, that's inevitable.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From richard.walsh at comcast.net  Thu Dec  3 18:34:39 2009
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Fri, 4 Dec 2009 02:34:39 +0000 (UTC)
Subject: [Beowulf] Dual head or service node related question ...
Message-ID: <816343207.8941431259894079516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>


All, 


In the typical cluster, with a single head node, that node provides 
login services, batch job submission services, and often supports 
a shared file space mounted via NFS from the head node to the 
compute nodes. This approach works reasonably well for not-too-large 
cluster systems. 


What is viewed as the best practice (or what are people doing) on 
something like an SGI ICE system with multiple service or head nodes? 
Does one service node generally assume the same role as the 
head node above (serving NFS, logins, and running services like 
PBS pro)? Or ... if NFS is used, is it perhaps served from another 
service node and mounted both on the login node and the compute 
nodes? Read-only? Is it better to support a shared file space via Lustre 
across all the nodes? 


The architecture chosen has implications ... for instance in the 
common case above PBS Pro would be installed on the head 
node, perhaps in the shared space, and its server and scheduler 
would be run by /etc/init.d/pbs off of the shared partition. The 
bin and sbin commands would shared by the compute nodes. 


In a case where the login service node and the shared file 
space, NFS service node are different, PBS installation 
must be done on the NFS service node in the case where 
the share space is mounted read-only and only the commands 
and man pages would be installed on the login node. What 
are the implications for other user applications the one would 
like to install in the share space for use from the login nodes? 
Some might have write requirements into the installation 
directory? Does this indicate that the NFS partition should 
be mounted read-write on the login node, but read-only on 
the compute nodes? 


Comments and suggestions, particularly from those that 
have set things up on SGI ICE cluster systems would be 
much appreciated. 


rbw 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091204/b90a0e6c/attachment.html>

From csamuel at vpac.org  Thu Dec  3 18:34:42 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Fri, 4 Dec 2009 13:34:42 +1100 (EST)
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <4B18719C.5080008@cse.ucdavis.edu>
Message-ID: <1587074959.7333741259894082772.JavaMail.root@mail.vpac.org>


----- "Bill Broadley" <bill at cse.ucdavis.edu> wrote:

> Indeed.  You'd hope that such long running jobs would checkpoint. 

They didn't at first, but when we asked them to they
decided they were going to checkpoint all 2GB of RAM
every minute. Over NFS.  We got them to fix that..

> Seems like the perfect place for virtualization.

..or blcr!

[...]
> Seems like it would be real popular with the admins.
> Anyone doing this?

How does it deal with pinned DMA memory on NICs ?

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From lindahl at pbm.com  Thu Dec  3 18:41:29 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 3 Dec 2009 18:41:29 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <1193336713.7333671259893932486.JavaMail.root@mail.vpac.org>
References: <195152680.7331771259893408253.JavaMail.root@mail.vpac.org>
	<1193336713.7333671259893932486.JavaMail.root@mail.vpac.org>
Message-ID: <20091204024129.GA25462@bx9.net>

> > E.g. you see a system disk going bad, but the user
> > will lose all their output unless the job runs for
> > 4 more weeks...
> 
> We run SMART tests and the like trying to proactively
> spot bad disks (and other hardware) prior to failures,
> but yes, that's inevitable.

It's not inevitable that the policy be that 3 month jobs are allowed.

But you know me: I never saw a battle I didn't want to fight :-) Arrr,
mateys, this be the BOFH, and I'm heere to educate you about the right
way to use this here supercomputer... my way... or walk the plank!

-- greg


From hahn at mcmaster.ca  Thu Dec  3 22:30:58 2009
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 4 Dec 2009 01:30:58 -0500 (EST)
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <20091204024129.GA25462@bx9.net>
References: <195152680.7331771259893408253.JavaMail.root@mail.vpac.org>
	<1193336713.7333671259893932486.JavaMail.root@mail.vpac.org>
	<20091204024129.GA25462@bx9.net>
Message-ID: <Pine.LNX.4.64.0912040035010.31700@coffee.psychology.mcmaster.ca>

>>> E.g. you see a system disk going bad, but the user
>>> will lose all their output unless the job runs for
>>> 4 more weeks...

until fairly recently (sometime this year), we didn't constrain
the length of jobs.  we now have a 1 week limit - generally 
argued on the basis of expecting longer jobs to checkpoint.
we also provide blcr for serial/threaded jobs.

I have mixed feelings about this.  the purpose of organizations 
providing HPC is to _enable_, not obstruct.  in some cases, this 
could mean working with a group to find an alternative better than,
for instance, not checkpointing a resource-intensive job.

our node/power failure rates are pretty low - not enough to justify
a 1-week limit.  but to he honest, the main motive is probably to 
increase cluster churn - essentially improving scheduler fairness.

> It's not inevitable that the policy be that 3 month jobs are allowed.

if a length limit is to be justified based on probability-of-failure, 
it should be ~ 1/nnodes; if fail-cost-based, 1/ncpus.  unfortunately, the
other extreme would be a sort of "invisible hand" where users experimentally
derive the failure rate by their rate of failed jobs jobs ;(

personally, I think facilities should permit longer jobs, though perhaps
only after discussing the risks and alternatives.  an economic approach 
might reward checkpointing with a fairshare bonus - merely rewarding 
short jobs seems wrong-headed.


From h-bugge at online.no  Thu Dec  3 23:29:29 2009
From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=)
Date: Fri, 4 Dec 2009 08:29:29 +0100
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <1587074959.7333741259894082772.JavaMail.root@mail.vpac.org>
References: <1587074959.7333741259894082772.JavaMail.root@mail.vpac.org>
Message-ID: <8B9ED9B6-6D97-4F77-B8EF-A8A0A0D49131@online.no>

Hi,


On Dec 4, 2009, at 3:34 , Chris Samuel wrote:
>
> How does it deal with pinned DMA memory on NICs ?


What we did in Platform (Scali) MPI, was to drain the HPC  
interconnect, then close it down. The problem was then reduced to  
checkpoint (e.g. using BLCR) N processes. Continuing from checkpoint  
and restarting from it would both re-open the HPC fabric (could be on  
another physical medium though). You could take the checkpoint on IB  
and restart using Gbe.

Combined with an agnostic interconnect support, this feature allows  
you in the case of a failing IB HCA (or failing switch port or cable)   
to restart from last the checkpoint, runn M-1 nodes communicating with  
other M-2 IB capable nodes using IB, and the last node communicating  
with the M-1 nodes using Gbe.

Traditional checkpointing requires snap-shot of the file-system in the  
general case (and restore of the correct snap-shot at restart),  
whereas checkpoint-and-kill (for migration or preemptive batch  
scheduling) does not require integration with file-systems.


H?kon


From john.hearns at mclaren.com  Fri Dec  4 01:24:23 2009
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 4 Dec 2009 09:24:23 -0000
Subject: [Beowulf] Dual head or service node related question ...
In-Reply-To: <816343207.8941431259894079516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
References: <816343207.8941431259894079516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0E7D0152@milexchmb1.mil.tagmclarengroup.com>

 
What is viewed as the best practice (or what are people doing) on

something like an SGI ICE system with multiple service or head nodes?

Does one service node generally assume the same role as the

head node above (serving NFS, logins, and running services like

PBS pro)?  Or ... if NFS is used, is it perhaps served from another

service node and mounted both on the login node and  the compute

nodes?

 
Two service nodes which act as login/batch submission nodes.

PBSpro configured to fail over between them (ie one is the PBS primary server).

Separate server for storage ? SGI connect these storage servers via the Infiniband fabric,

and use multiple Infiniband ports to spread the load ? you can easily configure this at cluster install time,

ie. every nth node connects to a different Infiniband port on the storage server.

 
The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091204/2aef8b1e/attachment.html>

From john.hearns at mclaren.com  Fri Dec  4 01:45:43 2009
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 4 Dec 2009 09:45:43 -0000
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <20091204024129.GA25462@bx9.net>
References: <195152680.7331771259893408253.JavaMail.root@mail.vpac.org><1193336713.7333671259893932486.JavaMail.root@mail.vpac.org>
	<20091204024129.GA25462@bx9.net>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0E7D01AE@milexchmb1.mil.tagmclarengroup.com>


It's not inevitable that the policy be that 3 month jobs are allowed.


Three MONTHS. Some celebrities careers are shorter than that these days.

If people running jobs like this don't checkpoint, they deserve
everything they get.


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From reuti at staff.uni-marburg.de  Fri Dec  4 03:13:08 2009
From: reuti at staff.uni-marburg.de (Reuti)
Date: Fri, 4 Dec 2009 12:13:08 +0100
Subject: [Beowulf] Dual head or service node related question ...
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0E7D0152@milexchmb1.mil.tagmclarengroup.com>
References: <816343207.8941431259894079516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
	<68A57CCFD4005646957BD2D18E60667B0E7D0152@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <8A56A1CE-7199-4EB9-BC3B-76BCF5ABB8D3@staff.uni-marburg.de>

Hi,

Am 04.12.2009 um 10:24 schrieb Hearns, John:

> What is viewed as the best practice (or what are people doing) on
>
> something like an SGI ICE system with multiple service or head nodes?
>
> Does one service node generally assume the same role as the
>
> head node above (serving NFS, logins, and running services like
>
> PBS pro)?  Or ... if NFS is used, is it perhaps served from another
>
> service node and mounted both on the login node and  the compute
>
> nodes?
>

I don't know for the original system you mentioned. We use SGE (not  
PBSpro) and I prefer putting it's qmaster also on the fileserver (the  
additional load by the fileserver is easier to predict than the  
varying work of interactive users). Then you can have as many login/ 
submission machines as you like or need - there is no daemon running  
at all on them (though it might be different for PBSpro). The  
submission machines just need read access to /usr/sge or whereever  
it's installed to source the settings file and have access to the  
commands. Nevertheless it could be installed w/o NFS access at all -  
even the nodes could spare NFS, but you would lose some fucntionality  
and need some kind of file-staging for the jobs files.

SGE's options regarding NFS are explained here: http:// 
gridengine.sunsource.net/howto/nfsreduce.html The options having just  
local spool directories fits my needs best. Maybe PBSpro has similar  
possibilities.

How is PBSpro doing its spooling - do they have some kind of database  
like SGE?

Is anyone putting the qmaster(s) in separate virtual machine(s) on  
the file server for failover - I got this idea recently?

-- Reuti

>
>
> Two service nodes which act as login/batch submission nodes.
>
> PBSpro configured to fail over between them (ie one is the PBS  
> primary server).
>
> Separate server for storage ? SGI connect these storage servers via  
> the Infiniband fabric,
>
> and use multiple Infiniband ports to spread the load ? you can  
> easily configure this at cluster install time,
>
> ie. every nth node connects to a different Infiniband port on the  
> storage server.
>


From christiansuhendra at gmail.com  Fri Dec  4 06:47:37 2009
From: christiansuhendra at gmail.com (christian suhendra)
Date: Fri, 4 Dec 2009 22:47:37 +0800
Subject: [Beowulf] mpirun command
Message-ID: <c761caee0912040647y657a752k787ec03104499f44@mail.gmail.com>

hello guys...
im using mpich-1.2.7p1 installed on my PC
but when i run mpirun
i've got this error :
suhendra18 at cluster2:/mirror/mpich-1.2.7p1/examples$ mpirun -np 1 cpi
-----------------------------------------------------------------------------

It seems that there is no lamd running on the host cluster2.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for MPI programs to run
(the MPI program tired to invoke the "MPI_Init" function).

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
did my mpich switch to LAM/MPI?? so what should i do??


regards
christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091204/642d9157/attachment.html>

From gus at ldeo.columbia.edu  Fri Dec  4 13:04:51 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Fri, 04 Dec 2009 16:04:51 -0500
Subject: [Beowulf] mpirun command
In-Reply-To: <c761caee0912040647y657a752k787ec03104499f44@mail.gmail.com>
References: <c761caee0912040647y657a752k787ec03104499f44@mail.gmail.com>
Message-ID: <4B197973.5000604@ldeo.columbia.edu>

Hi Christian

Your default mpirun seems to be the old LAM MPI.
Do "which mpirun", "mpirun --showme".

You can use the full path name to your MPICH mpirun.
You should also use the full path name to MPICH mpicc to compile
cpi.c, for compatibility.
It is likely that both are somewhere in your /mirror/mpich-1.2.7p1/
directory tree.
I would guess  /mirror/mpich-1.2.7p1/mpicc
and /mirror/mpich-1.2.7p1/mpirun, but you need to check.

Well, MPICH-1 is also old, and often problematic.
Better upgrade from MPICH1 to MPICH2 and/or from LAM/MPI to OpenMPI,
which are open source and relatively easy to build using
the free Gnu compilers (gcc, gfortran):

http://www.mcs.anl.gov/research/projects/mpich2/
http://www.open-mpi.org/

You may also take a look at their mailing lists,
for specific help w.r.t. MPI.

If you are on Linux,
there are also MPICH2 RPMs for some Linux distributions:

http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads


My $0.02
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

christian suhendra wrote:
> hello guys...
> im using mpich-1.2.7p1 installed on my PC
> but when i run mpirun
> i've got this error :
> suhendra18 at cluster2:/mirror/mpich-1.2.7p1/examples$ mpirun -np 1 cpi
> -----------------------------------------------------------------------------
> 
> It seems that there is no lamd running on the host cluster2.
> 
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for MPI programs to run
> (the MPI program tired to invoke the "MPI_Init" function).
> 
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> -----------------------------------------------------------------------------
> did my mpich switch to LAM/MPI?? so what should i do??
> 
> 
> regards
> christian
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From djholm at fnal.gov  Fri Dec  4 13:22:09 2009
From: djholm at fnal.gov (Don Holmgren)
Date: Fri, 04 Dec 2009 15:22:09 -0600 (CST)
Subject: [Beowulf] mpirun command
In-Reply-To: <c761caee0912040647y657a752k787ec03104499f44@mail.gmail.com>
References: <c761caee0912040647y657a752k787ec03104499f44@mail.gmail.com>
Message-ID: <Pine.LNX.4.62.0912041520050.26624@ccfsrv2.fnal.gov>


Your PC is likely running a Linux distribution that has LAM installed
by default, and "mpirun" is in your path ahead of mpich's mpirun.  You
can confirm this with `which mpirun`.

Try
   /mirror/mpich-1.2.7p1/bin/mpirun -np 1 cpi
instead.  Or make sure that /mirror/mpich-1.2.7p1/bin is at the front of
your path.

Don


On Fri, 4 Dec 2009, christian suhendra wrote:

> hello guys...
> im using mpich-1.2.7p1 installed on my PC
> but when i run mpirun
> i've got this error :
> suhendra18 at cluster2:/mirror/mpich-1.2.7p1/examples$ mpirun -np 1 cpi
> -----------------------------------------------------------------------------
>
> It seems that there is no lamd running on the host cluster2.
>
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for MPI programs to run
> (the MPI program tired to invoke the "MPI_Init" function).
>
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> -----------------------------------------------------------------------------
> did my mpich switch to LAM/MPI?? so what should i do??
>
>
> regards
> christian


From bcostescu at gmail.com  Fri Dec  4 13:23:06 2009
From: bcostescu at gmail.com (Bogdan Costescu)
Date: Fri, 4 Dec 2009 22:23:06 +0100
Subject: [Beowulf] Re: cluster fails to boot with managed switch, but 
	5-port switch works OK
In-Reply-To: <481727C4-110B-4D18-8361-DA12469D70FF@Keller.net>
References: <200912031846.nB3Ikesg029202@bluewest.scyld.com>
	<481727C4-110B-4D18-8361-DA12469D70FF@Keller.net>
Message-ID: <c609bc800912041323x27130f8aq2381046393333cf0@mail.gmail.com>

On Thu, Dec 3, 2009 at 9:17 PM, Greg Keller <Greg at keller.net> wrote:
> Essentially, once the port
> has a physical link light it may take a while before spanning tree allows
> traffic to actually flow through the port. ?Longer than a typical timeout.

The time taken to activate the link is around 60s, but I've been told
that it can be even higher. I've seen many times laptops randomly not
getting addresses via DHCP because the DHCP timeout and the STP time
on a Cisco switch were both around 60s - makes for very frustrating
network diagnostic.

> ?When loading/reloading the driver there seems to be an instantaneous drop
> of the link that forces a new delay cycle.

Most likely the PXE stack doesn't reset the link; the link is up soon
after the computer is powered on so, by the time the POST has
finished, the link is active. Again most likely, the Linux driver does
a link reset as part of the initialization; I remember that the 3c59x
driver was changed ~6years ago to not do this anymore (at Don Becker's
suggestion, IIRC) and it would allow the established link to remain
active, making DHCP succeed all the time.

Bogdan


From Greg at keller.net  Fri Dec  4 14:36:00 2009
From: Greg at keller.net (Greg Keller)
Date: Fri, 4 Dec 2009 16:36:00 -0600
Subject: [Beowulf] Re: cluster fails to boot with managed switch,
	but  5-port switch works OK
In-Reply-To: <c609bc800912041323x27130f8aq2381046393333cf0@mail.gmail.com>
References: <200912031846.nB3Ikesg029202@bluewest.scyld.com>
	<481727C4-110B-4D18-8361-DA12469D70FF@Keller.net>
	<c609bc800912041323x27130f8aq2381046393333cf0@mail.gmail.com>
Message-ID: <77FC36D1-4E0C-49DA-BA31-8ECB88318434@keller.net>


On Dec 4, 2009, at 3:23 PM, Bogdan Costescu wrote:

>> When loading/reloading the driver there seems to be an  
>> instantaneous drop
>> of the link that forces a new delay cycle.
> Most likely the PXE stack doesn't reset the link; the link is up soon
> after the computer is powered on so, by the time the POST has
> finished, the link is active. Again most likely, the Linux driver does
> a link reset as part of the initialization; I remember that the 3c59x
> driver was changed ~6years ago to not do this anymore (at Don Becker's
> suggestion, IIRC) and it would allow the established link to remain
> active, making DHCP succeed all the time.

That's true for some ports.  Most IPMI (duplex'd) ports seem to come  
up at 100Mb and then switch to 1Gb at some point in the Post.   
Normally PXE seems to work but later in the boot it fails to get a  
DHCP the address, so I suspect you are correct for many cases where  
the System brings up the 1Gb Link early in the post before the PXE.

I like the fix you mention where 3com based cards don't reset the  
link.  Most Lan On Motherboards seem to be Broadcom or Intel e1000  
based in my world...  but it would be kuel if "they" figured out the  
same magic for those drivers.  Ultimately I think it's a workaround  
for overly cautious defaults on switches, but some times it's easier  
to drive around the pothole than fix it.

Cheers!
Greg


From christiansuhendra at gmail.com  Fri Dec  4 15:57:15 2009
From: christiansuhendra at gmail.com (christian suhendra)
Date: Fri, 4 Dec 2009 11:57:15 -1200
Subject: [Beowulf] Re: mpicc error
In-Reply-To: <c761caee0912040609l1a8961a9u546155d292025c69@mail.gmail.com>
References: <c761caee0912040609l1a8961a9u546155d292025c69@mail.gmail.com>
Message-ID: <c761caee0912041557n7eb9ae0cyff11013070d84aef@mail.gmail.com>

2009/12/4 christian suhendra <christiansuhendra at gmail.com>

> hello..may ask your favor??
> when i run mpicc i've got error..(error attached)
> i need your help..would you check my listing program if its true or false
> i wanna make matrix operation and the input takes from file txt,, i'm using
> C in my matrix and use canon algorithm..
> but i'm not sure of what i've done..
> please..my deadline almost 3 week..
> thank you for your advice..God Bless..
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091204/14072cea/attachment.html>

From Michael.Frese at NumerEx-LLC.com  Sun Dec  6 12:12:18 2009
From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese)
Date: Sun, 06 Dec 2009 13:12:18 -0700
Subject: [Beowulf] Cluster Communications Test App
Message-ID: <6.2.5.6.2.20091206102842.0c040d70@NumerEx-LLC.com>

We'd like to test our cluster communications hardware and software -- 
MPI/TCP over GigE -- with something besides our favorite -- and 
failing -- parallel application.

Is there some sort of parallel NetPipe out there that sends and 
receives scads of messages and keeps track of integrity and 
performance results?


Mike


From cap at nsc.liu.se  Mon Dec  7 02:38:40 2009
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Mon, 7 Dec 2009 11:38:40 +0100
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <4B17DBDC.5060106@ias.edu>
References: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0E75A71F@milexchmb1.mil.tagmclarengroup.com>
	<4B17DBDC.5060106@ias.edu>
Message-ID: <200912071138.44455.cap@nsc.liu.se>

On Thursday 03 December 2009, Prentice Bisbal wrote:
> Hearns, John wrote:
> > I was wondering if anyone has actual experience with running more than
> > one node from a single power supply. Even just two boards on one PSU
> > would be nice. We will be using barely 200W per node for 50 nodes and it
> > just seems like a big waste to buy 50 power supply units. I have read
> > the old posts but did not see any reports of success.
> >
> > Look at the Supermicro twin systems, they have two motherboards in 1U or
> > four motherboards in 2U.
> >
> > I believe HP have similar.
>
> What I learned at SC09:
>
> HP does make twin nodes similar to SuperMicro, but the HP nodes are not
> hot-swappable,

Almost true, the DL1000 is not hot-swapable, the SL6000 kind of is (you can 
pull a 1U sub-unit which can be either one or two nodes).

> if a single node goes down, you need to take down all the 
> nodes in the chassis before you can remove the dead node. Not very
> practical. The SuperMicro nodes are definitely hot-swappable.

Almost true, Supermicro has (or at least had) both hot-swap and non-hot-swap.

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091207/e1065bc0/attachment.sig>

From brockp at umich.edu  Mon Dec  7 07:05:51 2009
From: brockp at umich.edu (Brock Palen)
Date: Mon, 7 Dec 2009 10:05:51 -0500
Subject: [Beowulf] MPI-3 Podcast, 
Message-ID: <5D32BF5F-FB5D-4D6D-BCED-A8F887F9F01A@umich.edu>

Of interest for members of the list is our recent show about MPI-3 and  
the MPI standards process:

http://www.rce-cast.com/index.php/Podcast/rce-22-mpi-3-forum.html

RSS, and iTunes subscribe are available on that page,
Brock Palen and Jeff Squyres speak with Dr. Bill Gropp of University  
of Illinois Urbana-Champaign
Urbana and Dr. Richard Graham of ORNL (Oak Ridge National Lab) on the  
MPI Forum, MPI-2.2 and the upcoming MPI-3 standards for parallel  
programming with MPI.

William Gropp is the Paul and Cynthia Saylor Professor in the  
Department of Computer Science and Deputy Directory for Research for  
the Institute of Advanced Computing Applications and Technologies at  
the University of Illinois in Urbana-Champaign. He received his Ph.D.  
in Computer Science from Stanford University in 1982 and worked at  
Yale University and Argonne National Laboratory. His research  
interests are in parallel computing, software for scientific  
computing, and numerical methods for partial differential equations,  
and he is well known for the MPICH2 and PETSc libraries.

Richard Graham has been at ONRL since Jan, 2007 and is the Group  
Leader for the Application Performance Tools group in the Computer  
Science and Mathematics division at ONRL, and is a Distinguished  
member of the Research Staff. Prior to joining ORNL he spent eight  
years at ORNL serving in a range of technical and managerial roles,  
leaving as the acting group leader for the Advanced Computing  
Laboratory. He is currently chairman of the MPI Forum, and is leading  
the MPI-3 effort. He led the LA-MPI development effort, and is one of  
three founders of the Open MPI project. Dr. Graham received his PhD in  
Theoretical Chemistry from Texas A&M University in 1990, and a BS in  
Chemistry from Seattle Pacific University in 1983.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985


From csamuel at vpac.org  Mon Dec  7 16:51:27 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Tue, 8 Dec 2009 11:51:27 +1100 (EST)
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <20091204024129.GA25462@bx9.net>
Message-ID: <767373014.7554451260233487577.JavaMail.root@mail.vpac.org>


----- "Greg Lindahl" <lindahl at pbm.com> wrote:

> It's not inevitable that the policy be that 3
> month jobs are allowed.

Our users own our organisation (we are a partnership
of the 8 universities in our state), generally they get
what they ask for (within budget constraints). :-)

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From csamuel at vpac.org  Mon Dec  7 16:55:09 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Tue, 8 Dec 2009 11:55:09 +1100 (EST)
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <1671342945.7554501260233643669.JavaMail.root@mail.vpac.org>
Message-ID: <954667269.7554531260233709381.JavaMail.root@mail.vpac.org>


----- "H?kon Bugge" <h-bugge at online.no> wrote:

> What we did in Platform (Scali) MPI, was to drain
> the HPC interconnect, then close it down. The problem
> was then reduced to checkpoint (e.g. using BLCR)
> N processes.

I suspect this is what Open-MPI does too, but I
don't know if the VM based systems can migrate
such jobs without this application layer support.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From prentice at ias.edu  Tue Dec  8 07:50:28 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 08 Dec 2009 10:50:28 -0500
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <4B18719C.5080008@cse.ucdavis.edu>
References: <20091203190200.GA647@bx9.net>	<588682529.7331271259891827276.JavaMail.root@mail.vpac.org>	<20091204021122.GA17286@bx9.net>
	<4B18719C.5080008@cse.ucdavis.edu>
Message-ID: <4B1E75C4.6090707@ias.edu>


Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ


Bill Broadley wrote:
> Greg Lindahl wrote:
>> On Fri, Dec 04, 2009 at 12:57:07PM +1100, Chris Samuel wrote:
>>
>>> If you've got a job running on there for a month
>>> or two then there's a fairly high opportunity cost
>>> involved.
>> That kind of policy has a fairly high opportunity cost, even before
>> you factor in linked nodes. E.g. you see a system disk going bad, but
>> the user will lose all their output unless the job runs for 4 more
>> weeks...
> 
> Indeed.  You'd hope that such long running jobs would checkpoint.  

You'd hope that. Most of my current clusters users are scientific
researchers in academia, not computer scientists. While some are
extremely computer savvy, others have learned just enough about
programming to do their calculations. Expecting the latter to write code
with checkpointing is unrealistic, and working in academia, I can't
force them to. Which is why taking down 4 nodes instead of just one is
less than ideal.

--
Prentice


From jbardin at bu.edu  Tue Dec  8 09:22:27 2009
From: jbardin at bu.edu (james bardin)
Date: Tue, 8 Dec 2009 12:22:27 -0500
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <4B1E75C4.6090707@ias.edu>
References: <20091203190200.GA647@bx9.net>
	<588682529.7331271259891827276.JavaMail.root@mail.vpac.org>
	<20091204021122.GA17286@bx9.net> <4B18719C.5080008@cse.ucdavis.edu>
	<4B1E75C4.6090707@ias.edu>
Message-ID: <a3b675320912080922q22099987s3995ab4ac9ed08d@mail.gmail.com>

On Tue, Dec 8, 2009 at 10:50 AM, Prentice Bisbal <prentice at ias.edu> wrote:

> You'd hope that. Most of my current clusters users are scientific
> researchers in academia, not computer scientists. While some are
> extremely computer savvy, others have learned just enough about
> programming to do their calculations. Expecting the latter to write code
> with checkpointing is unrealistic, and working in academia, I can't
> force them to. Which is why taking down 4 nodes instead of just one is
> less than ideal.
>

I find it's still advantageous to push them to learn it. A researcher
working with a tight deadline for a grant will often see the light
when a hardware failure loses them a month or more of data processing.
It really is in their own best interests to learn about their tools.


From h-bugge at online.no  Tue Dec  8 09:26:50 2009
From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=)
Date: Tue, 8 Dec 2009 18:26:50 +0100
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <954667269.7554531260233709381.JavaMail.root@mail.vpac.org>
References: <954667269.7554531260233709381.JavaMail.root@mail.vpac.org>
Message-ID: <B3770E35-4AA6-4C8B-8B06-67B56B6031D7@online.no>

On Dec 8, 2009, at 1:55 , Chris Samuel wrote:

> I suspect this is what Open-MPI does too, but I
> don't know if the VM based systems can migrate
> such jobs without this application layer support.

Anyone who knows about migration of VMs where the MPI processes use  
ibverbs (i.e. user space access to  the HCAs)?


H?kon


From james.p.lux at jpl.nasa.gov  Tue Dec  8 09:56:49 2009
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Tue, 8 Dec 2009 09:56:49 -0800
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <a3b675320912080922q22099987s3995ab4ac9ed08d@mail.gmail.com>
Message-ID: <C743D361.C51B%James.P.Lux@jpl.nasa.gov>


On 12/8/09 9:22 AM, "james bardin" <jbardin at bu.edu> wrote:

> On Tue, Dec 8, 2009 at 10:50 AM, Prentice Bisbal <prentice at ias.edu> wrote:
> 
>> You'd hope that. Most of my current clusters users are scientific
>> researchers in academia, not computer scientists. While some are
>> extremely computer savvy, others have learned just enough about
>> programming to do their calculations. Expecting the latter to write code
>> with checkpointing is unrealistic, and working in academia, I can't
>> force them to. Which is why taking down 4 nodes instead of just one is
>> less than ideal.
>> 
> 
> I find it's still advantageous to push them to learn it. A researcher
> working with a tight deadline for a grant will often see the light
> when a hardware failure loses them a month or more of data processing.
> It really is in their own best interests to learn about their tools.


What about some form of "image checkpoint" like "hibernation"... Should be
application unaware, just snapshots memory.


From prentice at ias.edu  Tue Dec  8 10:52:28 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 08 Dec 2009 13:52:28 -0500
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
In-Reply-To: <C743D361.C51B%James.P.Lux@jpl.nasa.gov>
References: <C743D361.C51B%James.P.Lux@jpl.nasa.gov>
Message-ID: <4B1EA06C.3070307@ias.edu>

Lux, Jim (337C) wrote:
> 
> 
> On 12/8/09 9:22 AM, "james bardin" <jbardin at bu.edu> wrote:
> 
>> On Tue, Dec 8, 2009 at 10:50 AM, Prentice Bisbal <prentice at ias.edu> wrote:
>>
>>> You'd hope that. Most of my current clusters users are scientific
>>> researchers in academia, not computer scientists. While some are
>>> extremely computer savvy, others have learned just enough about
>>> programming to do their calculations. Expecting the latter to write code
>>> with checkpointing is unrealistic, and working in academia, I can't
>>> force them to. Which is why taking down 4 nodes instead of just one is
>>> less than ideal.
>>>
>> I find it's still advantageous to push them to learn it. A researcher
>> working with a tight deadline for a grant will often see the light
>> when a hardware failure loses them a month or more of data processing.
>> It really is in their own best interests to learn about their tools.
> 
> 
> What about some form of "image checkpoint" like "hibernation"... Should be
> application unaware, just snapshots memory.

That's fine when the problem is on one system and there's only one
system image to worry about check pointing once you start spreading the
job around to multiple systems, things get complicated, especially if
your node is heterogeneous w.r.t hardware.

I fear we're straying off the topic of the original post...

--
Prentice


From jeff.johnson at aeoncomputing.com  Mon Dec  7 12:11:44 2009
From: jeff.johnson at aeoncomputing.com (Jeff Johnson)
Date: Mon, 07 Dec 2009 12:11:44 -0800
Subject: [Beowulf] Re: Beowulf Digest, Vol 70, Issue 17
In-Reply-To: <200912072000.nB7K0BYf028155@bluewest.scyld.com>
References: <200912072000.nB7K0BYf028155@bluewest.scyld.com>
Message-ID: <4B1D6180.2070604@aeoncomputing.com>

On 12/7/09 12:00 PM, beowulf-request at beowulf.org wrote:
> Date: Sun, 06 Dec 2009 13:12:18 -0700
> From: "Michael H. Frese"<Michael.Frese at NumerEx-LLC.com>
> Subject: [Beowulf] Cluster Communications Test App
>
>    
[...]
> Is there some sort of parallel NetPipe out there that sends and
> receives scads of messages and keeps track of integrity and
> performance results?
>    
What used to be Pallas, now Intel's IMB (v3.2) is what you are looking for:
http://software.intel.com/en-us/articles/intel-mpi-benchmarks/

Compile against your MPI environment and run. Very detailed instructions 
are provided. I would suggest the Alltoall test method within IMB, it is 
a good way to stress your topology.

--Jeff

-- 
------------------------------
Jeff Johnson
Manager
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810   f: 858-412-3845
m: 619-204-9061

4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117


From jellogum at gmail.com  Wed Dec  9 10:19:30 2009
From: jellogum at gmail.com (Jeremy Baker)
Date: Wed, 9 Dec 2009 10:19:30 -0800
Subject: [Beowulf] Sony PS3, random news
Message-ID: <d2300d1c0912091019h10654b02wf2e847afbdc39a6c@mail.gmail.com>

DoD buys PS3 for HPC.

CNN brief at

http://scitech.blogs.cnn.com/2009/12/09/military-purchases-2200-ps3s/

Clip from report:

 "Though a single 3.2 GHz cell processor can deliver over 200 GFLOPS,
whereas the Sony PS3 configuration delivers approximately 150 GFLOPS, the
approximately tenfold cost difference per GFLOP makes the Sony PS3 the only
viable technology for HPC applications."


-- 
Jeremy Baker
PO 297
Johnson, VT
05656
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091209/ad746115/attachment.html>

From amjad11 at gmail.com  Wed Dec  9 17:22:44 2009
From: amjad11 at gmail.com (amjad ali)
Date: Wed, 9 Dec 2009 20:22:44 -0500
Subject: [Beowulf] scalability
Message-ID: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com>

Hi all,

I have, with my group, a small cluster of about 16 nodes (each one with
single socket Xeon 3085 or 3110; And I face problem of poor scalability. Its
network is quite ordinary GiGE (perhaps DLink DGS-1024D 24-Port
10/100/1000), store and forward switch, of price about $250 only.
ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf

How should I work on that for better scalability?

What could be better affordable options of fast switches? (Myrinet,
Infiniband are quite costly).

When buying a switch what should we see in it? What latency?


Thank you very much.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091209/d552f9b5/attachment.html>

From amjad11 at gmail.com  Wed Dec  9 17:46:25 2009
From: amjad11 at gmail.com (amjad ali)
Date: Wed, 9 Dec 2009 20:46:25 -0500
Subject: [Beowulf] cluster sharing
Message-ID: <428810f20912091746k74184ef4n125c2b5ada94606a@mail.gmail.com>

Hi all,

I am usually running my parallel jobs on the university cluster of several
hundred nodes (accessible to all). So often I observe very different total
MPI_Wtime value when I run my program (even of same problem size) on
different occasion.

How should I make a reasonable performance measurement in such a case? Is it
fine to get speedup/measurements on such a shared cluster?


Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091209/7c378e78/attachment.html>

From gus at ldeo.columbia.edu  Wed Dec  9 18:11:44 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Wed, 09 Dec 2009 21:11:44 -0500
Subject: [Beowulf] scalability
In-Reply-To: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com>
References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com>
Message-ID: <4B2058E0.9050602@ldeo.columbia.edu>

Hi Amjad

There is relatively inexpensive Infiniband SDR:
http://www.colfaxdirect.com/store/pc/showsearchresults.asp?customfield=5&SearchValues=65
http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=12
http://www.colfaxdirect.com/store/pc/viewCategories.asp?SFID=12&SFNAME=Brand&SFVID=50&SFVALUE=Mellanox&SFCount=0&page=0&pageStyle=m&idcategory=2&VS12=0&VS9=0&VS10=0&VS4=0&VS3=0&VS11=0
Not the latest greatest, but faster than Gigabit Ethernet.
A better Gigabit Ethernet switch may help also,
but I wonder if the impact will be as big as expected.

However, are you sure the scalability problems you see are
due to poor network connection?
Could it be perhaps related to the code itself,
or maybe to the processors' memory bandwidth?

You could test if it is network running the program inside a node
(say on 4 cores) and across 4 nodes with
one core in use on each node, or other combinations
(2 cores on 2 nodes).

You could have an indication of the processors' scalability
by timing program runs inside a single node using 1,2,3,4 cores.

My experience with dual socket dual core Xeons vs.
dual socket dual core Opterons,
with the type of code we run here (ocean,atmosphere,climate models,
which are not totally far from your CFD) is that Opterons
scale close to linear, but Xeons get nearly stuck in terms of scaling
when there are more than 2 processes (3 or 4) running in a single node.

My two cents.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


amjad ali wrote:
> Hi all,
> 
> I have, with my group, a small cluster of about 16 nodes (each one with 
> single socket Xeon 3085 or 3110; And I face problem of poor scalability. 
> Its network is quite ordinary GiGE (perhaps DLink DGS-1024D 24-Port 
> 10/100/1000), store and forward switch, of price about $250 only.
> ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf
> 
> How should I work on that for better scalability?
> 
> What could be better affordable options of fast switches? (Myrinet, 
> Infiniband are quite costly).
> 
> When buying a switch what should we see in it? What latency?
> 
> 
> Thank you very much.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From amjad11 at gmail.com  Wed Dec  9 19:14:23 2009
From: amjad11 at gmail.com (amjad ali)
Date: Wed, 9 Dec 2009 22:14:23 -0500
Subject: [Beowulf] scalability
In-Reply-To: <4B2058E0.9050602@ldeo.columbia.edu>
References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com>
	<4B2058E0.9050602@ldeo.columbia.edu>
Message-ID: <428810f20912091914r4955515wa53f12611a848af6@mail.gmail.com>

Hi Gus,
your nice reply; as usual.

I ran my code on single socket xeon node having two cores; It ran linear
97+% efficient.

Then I ran my code on single socket xeon node having four cores ( Xeon 3220
-which really not a good quad core) I got the efficiency of around 85%.

But on four single socket nodes I ran 4 processes  (1 process on each node);
I got the efficiency of around 62%.

Yes, CFD codes are memory bandwidth bound usually.

Thank you very much.


run with 2core
On Wed, Dec 9, 2009 at 9:11 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Hi Amjad
>
> There is relatively inexpensive Infiniband SDR:
>
> http://www.colfaxdirect.com/store/pc/showsearchresults.asp?customfield=5&SearchValues=65
> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=12
>
> http://www.colfaxdirect.com/store/pc/viewCategories.asp?SFID=12&SFNAME=Brand&SFVID=50&SFVALUE=Mellanox&SFCount=0&page=0&pageStyle=m&idcategory=2&VS12=0&VS9=0&VS10=0&VS4=0&VS3=0&VS11=0
> Not the latest greatest, but faster than Gigabit Ethernet.
> A better Gigabit Ethernet switch may help also,
> but I wonder if the impact will be as big as expected.
>
> However, are you sure the scalability problems you see are
> due to poor network connection?
> Could it be perhaps related to the code itself,
> or maybe to the processors' memory bandwidth?
>
> You could test if it is network running the program inside a node
> (say on 4 cores) and across 4 nodes with
> one core in use on each node, or other combinations
> (2 cores on 2 nodes).
>
> You could have an indication of the processors' scalability
> by timing program runs inside a single node using 1,2,3,4 cores.
>
> My experience with dual socket dual core Xeons vs.
> dual socket dual core Opterons,
> with the type of code we run here (ocean,atmosphere,climate models,
> which are not totally far from your CFD) is that Opterons
> scale close to linear, but Xeons get nearly stuck in terms of scaling
> when there are more than 2 processes (3 or 4) running in a single node.
>
> My two cents.
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
> amjad ali wrote:
>
>> Hi all,
>>
>> I have, with my group, a small cluster of about 16 nodes (each one with
>> single socket Xeon 3085 or 3110; And I face problem of poor scalability. Its
>> network is quite ordinary GiGE (perhaps DLink DGS-1024D 24-Port
>> 10/100/1000), store and forward switch, of price about $250 only.
>> ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf
>>
>> How should I work on that for better scalability?
>>
>> What could be better affordable options of fast switches? (Myrinet,
>> Infiniband are quite costly).
>>
>> When buying a switch what should we see in it? What latency?
>>
>>
>> Thank you very much.
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091209/d1c71bd8/attachment.html>

From toon.knapen at gmail.com  Wed Dec  9 23:20:34 2009
From: toon.knapen at gmail.com (Toon Knapen)
Date: Thu, 10 Dec 2009 08:20:34 +0100
Subject: [Beowulf] cluster sharing
In-Reply-To: <428810f20912091746k74184ef4n125c2b5ada94606a@mail.gmail.com>
References: <428810f20912091746k74184ef4n125c2b5ada94606a@mail.gmail.com>
Message-ID: <d5bdff000912092320m724930d1xf581b6f6198b5cc1@mail.gmail.com>

Hello Amjad,

You need exclusive access to (part of) the cluster. Generally a batch
scheduler is available that will schedule the jobs of multiple users on
generally a FIFO basis, guaranteeing exclusive access and thus optimal
performance.


On Thu, Dec 10, 2009 at 2:46 AM, amjad ali <amjad11 at gmail.com> wrote:

> Hi all,
>
> I am usually running my parallel jobs on the university cluster of several
> hundred nodes (accessible to all). So often I observe very different total
> MPI_Wtime value when I run my program (even of same problem size) on
> different occasion.
>
> How should I make a reasonable performance measurement in such a case? Is
> it fine to get speedup/measurements on such a shared cluster?
>
>
> Thanks.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091210/0076305e/attachment.html>

From deadline at eadline.org  Thu Dec 10 05:45:30 2009
From: deadline at eadline.org (Douglas Eadline)
Date: Thu, 10 Dec 2009 08:45:30 -0500 (EST)
Subject: [Beowulf] scalability
In-Reply-To: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com>
References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com>
Message-ID: <51346.192.168.1.1.1260452730.squirrel@mail.eadline.org>


The performance of GigE can vary widely based on several
issues:

 - chipset
 - driver
 - driver settings (interrupt coalescing etc.)
 - switch performance under load
 - switch manufacturer

A well tuned GigE network is never as good as
IB or Myrinet, but it can work well for some codes.
You can also try using Open-MX instead of TCP
for MPI communications.

--
Doug

> Hi all,
>
> I have, with my group, a small cluster of about 16 nodes (each one with
> single socket Xeon 3085 or 3110; And I face problem of poor scalability.
> Its
> network is quite ordinary GiGE (perhaps DLink DGS-1024D 24-Port
> 10/100/1000), store and forward switch, of price about $250 only.
> ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf
>
> How should I work on that for better scalability?
>
> What could be better affordable options of fast switches? (Myrinet,
> Infiniband are quite costly).
>
> When buying a switch what should we see in it? What latency?
>
>
> Thank you very much.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


--
Doug


From gus at ldeo.columbia.edu  Thu Dec 10 10:43:07 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Thu, 10 Dec 2009 13:43:07 -0500
Subject: [Beowulf] scalability
In-Reply-To: <428810f20912091914r4955515wa53f12611a848af6@mail.gmail.com>
References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com>	
	<4B2058E0.9050602@ldeo.columbia.edu>
	<428810f20912091914r4955515wa53f12611a848af6@mail.gmail.com>
Message-ID: <4B21413B.7020000@ldeo.columbia.edu>

Hi Amjad

amjad ali wrote:
> Hi Gus,
> your nice reply; as usual.
> 
> I ran my code on single socket xeon node having two cores; It ran linear 
> 97+% efficient.
> 
> Then I ran my code on single socket xeon node having four cores ( Xeon 
> 3220 -which really not a good quad core) I got the efficiency of around 85%.
> 
> But on four single socket nodes I ran 4 processes  (1 process on each 
> node); I got the efficiency of around 62%.
> 

This is about the same number I've got for an atmospheric model
in a dual-socket dual-core Xeon computer.
Somehow the memory path/bus on these systems is not very efficient,
and saturates when more than two processes do
intensive work concurrently.
A similar computer configuration with dual- dual-
AMD Opterons performed significantly better on the same atmospheric
code (efficiency close to 90%).

I was told that some people used to run two processes only on
dual-socket dual-core Xeon nodes , leaving the other two cores idle.
Although it is an apparent waste, the argument was that it paid
off in terms of overall efficiency.

Have you tried to run your programs this way on your cluster?
Say, with one process only per node, and N nodes,
then with two processes per node, and N/2 nodes,
then with four processes per node, and N/4 nodes.
This may tell what is optimal for the hardware you have.
With OpenMPI you can use the "mpiexec" flags
"-bynode" and "-byslot" to control this behavior.
"man mpiexec" is your friend!  :)

> Yes, CFD codes are memory bandwidth bound usually.
>

Indeed, and so is most of our atmosphere/ocean/climate codes,
which has a lot of CFD, but also radiative processes, mixing,
thermodynamics, etc.
However, most of our models use fixed grids, and I suppose
some of your aerodynamics may use adaptive meshes, right?
I guess you are doing aerodynamics, right?

> Thank you very much.
> 

My pleasure.


I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


> 
> 
> 
> 
> run with 2core
> On Wed, Dec 9, 2009 at 9:11 PM, Gus Correa <gus at ldeo.columbia.edu 
> <mailto:gus at ldeo.columbia.edu>> wrote:
> 
>     Hi Amjad
> 
>     There is relatively inexpensive Infiniband SDR:
>     http://www.colfaxdirect.com/store/pc/showsearchresults.asp?customfield=5&SearchValues=65
>     <http://www.colfaxdirect.com/store/pc/showsearchresults.asp?customfield=5&SearchValues=65>
>     http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=12
>     http://www.colfaxdirect.com/store/pc/viewCategories.asp?SFID=12&SFNAME=Brand&SFVID=50&SFVALUE=Mellanox&SFCount=0&page=0&pageStyle=m&idcategory=2&VS12=0&VS9=0&VS10=0&VS4=0&VS3=0&VS11=0
>     <http://www.colfaxdirect.com/store/pc/viewCategories.asp?SFID=12&SFNAME=Brand&SFVID=50&SFVALUE=Mellanox&SFCount=0&page=0&pageStyle=m&idcategory=2&VS12=0&VS9=0&VS10=0&VS4=0&VS3=0&VS11=0>
>     Not the latest greatest, but faster than Gigabit Ethernet.
>     A better Gigabit Ethernet switch may help also,
>     but I wonder if the impact will be as big as expected.
> 
>     However, are you sure the scalability problems you see are
>     due to poor network connection?
>     Could it be perhaps related to the code itself,
>     or maybe to the processors' memory bandwidth?
> 
>     You could test if it is network running the program inside a node
>     (say on 4 cores) and across 4 nodes with
>     one core in use on each node, or other combinations
>     (2 cores on 2 nodes).
> 
>     You could have an indication of the processors' scalability
>     by timing program runs inside a single node using 1,2,3,4 cores.
> 
>     My experience with dual socket dual core Xeons vs.
>     dual socket dual core Opterons,
>     with the type of code we run here (ocean,atmosphere,climate models,
>     which are not totally far from your CFD) is that Opterons
>     scale close to linear, but Xeons get nearly stuck in terms of scaling
>     when there are more than 2 processes (3 or 4) running in a single node.
> 
>     My two cents.
>     Gus Correa
>     ---------------------------------------------------------------------
>     Gustavo Correa
>     Lamont-Doherty Earth Observatory - Columbia University
>     Palisades, NY, 10964-8000 - USA
>     ---------------------------------------------------------------------
> 
> 
>     amjad ali wrote:
> 
>         Hi all,
> 
>         I have, with my group, a small cluster of about 16 nodes (each
>         one with single socket Xeon 3085 or 3110; And I face problem of
>         poor scalability. Its network is quite ordinary GiGE (perhaps
>         DLink DGS-1024D 24-Port 10/100/1000), store and forward switch,
>         of price about $250 only.
>         ftp://ftp10.dlink.com/pdfs/products/DGS-1024D/DGS-1024D_ds.pdf
> 
>         How should I work on that for better scalability?
> 
>         What could be better affordable options of fast switches?
>         (Myrinet, Infiniband are quite costly).
> 
>         When buying a switch what should we see in it? What latency?
> 
> 
>         Thank you very much.
> 
> 
>         ------------------------------------------------------------------------
> 
>         _______________________________________________
>         Beowulf mailing list, Beowulf at beowulf.org
>         <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>         To change your subscription (digest mode or unsubscribe) visit
>         http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 


From jorg.sassmannshausen at strath.ac.uk  Thu Dec 10 06:56:29 2009
From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=)
Date: Thu, 10 Dec 2009 14:56:29 +0000
Subject: [Beowulf] Re: scalability
In-Reply-To: <200912101350.nBADnxv1012984@bluewest.scyld.com>
References: <200912101350.nBADnxv1012984@bluewest.scyld.com>
Message-ID: <200912101456.29738.jorg.sassmannshausen@strath.ac.uk>

Hi Doug,

I have heard of Open-MX before, do you need special hardware for that? We are 
currently using one GB network here on the cluster (for everything: NFS, 
MPI...) and I would like to increase the performance for the parallel codes I 
am using (NWChem, cp2k, GAMESS) without dishing out too much money for IB or 
Myrinet as the cluster is too small for them (22 nodes right now).

All the best

J?rg

On Thursday 10 December 2009 13:50:51 beowulf-request at beowulf.org wrote:
> Date: Thu, 10 Dec 2009 08:45:30 -0500 (EST)
> From: "Douglas Eadline" <deadline at eadline.org>
> Subject: Re: [Beowulf] scalability
> To: "amjad ali" <amjad11 at gmail.com>
> Cc: Beowulf Mailing List <beowulf at beowulf.org>
> Message-ID: <51346.192.168.1.1.1260452730.squirrel at mail.eadline.org>
> Content-Type: text/plain;charset=iso-8859-1
>
>
> The performance of GigE can vary widely based on several
> issues:
>
> ?- chipset
> ?- driver
> ?- driver settings (interrupt coalescing etc.)
> ?- switch performance under load
> ?- switch manufacturer
>
> A well tuned GigE network is never as good as
> IB or Myrinet, but it can work well for some codes.
> You can also try using Open-MX instead of TCP
> for MPI communications.
>
> --
> Doug

-- 
*************************************************************
J?rg Sa?mannshausen
Research Fellow
University of Strathclyde
Department of Pure and Applied Chemistry
295 Cathedral St.
Glasgow
G1 1XL

email: jorg.sassmannshausen at strath.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html


From atchley at myri.com  Thu Dec 10 11:47:06 2009
From: atchley at myri.com (Scott Atchley)
Date: Thu, 10 Dec 2009 14:47:06 -0500
Subject: [Beowulf] Re: scalability
In-Reply-To: <200912101456.29738.jorg.sassmannshausen@strath.ac.uk>
References: <200912101350.nBADnxv1012984@bluewest.scyld.com>
	<200912101456.29738.jorg.sassmannshausen@strath.ac.uk>
Message-ID: <4881CFFF-C36E-4AD6-BEDC-300474AD44C9@myri.com>

On Dec 10, 2009, at 9:56 AM, J?rg Sa?mannshausen wrote:

> I have heard of Open-MX before, do you need special hardware for that?

No, any Ethernet driver on Linux.

http://open-mx.org

Scott


From deadline at eadline.org  Thu Dec 10 11:53:58 2009
From: deadline at eadline.org (Douglas Eadline)
Date: Thu, 10 Dec 2009 14:53:58 -0500 (EST)
Subject: [Beowulf] Re: scalability
In-Reply-To: <200912101456.29738.jorg.sassmannshausen@strath.ac.uk>
References: <200912101350.nBADnxv1012984@bluewest.scyld.com>
	<200912101456.29738.jorg.sassmannshausen@strath.ac.uk>
Message-ID: <54300.192.168.1.1.1260474838.squirrel@mail.eadline.org>


> Hi Doug,
>
> I have heard of Open-MX before, do you need special hardware for that? We
> are
> currently using one GB network here on the cluster (for everything: NFS,
> MPI...) and I would like to increase the performance for the parallel
> codes I
> am using (NWChem, cp2k, GAMESS) without dishing out too much money for IB
> or
> Myrinet as the cluster is too small for them (22 nodes right now).
>
> All the best

Open-MX will work over any Ethernet connection. If you have
a recent kernel, it should work in an incremental fashion,
that is, you can install it, build any MPI that supports
MX and run it. It will co-exist just fine with TCP/IP
traffic.

Check the home page: http://open-mx.gforge.inria.fr/

I think the authors read this list.

--
Doug

>
> J?rg
>
> On Thursday 10 December 2009 13:50:51 beowulf-request at beowulf.org wrote:
>> Date: Thu, 10 Dec 2009 08:45:30 -0500 (EST)
>> From: "Douglas Eadline" <deadline at eadline.org>
>> Subject: Re: [Beowulf] scalability
>> To: "amjad ali" <amjad11 at gmail.com>
>> Cc: Beowulf Mailing List <beowulf at beowulf.org>
>> Message-ID: <51346.192.168.1.1.1260452730.squirrel at mail.eadline.org>
>> Content-Type: text/plain;charset=iso-8859-1
>>
>>
>> The performance of GigE can vary widely based on several
>> issues:
>>
>> ?- chipset
>> ?- driver
>> ?- driver settings (interrupt coalescing etc.)
>> ?- switch performance under load
>> ?- switch manufacturer
>>
>> A well tuned GigE network is never as good as
>> IB or Myrinet, but it can work well for some codes.
>> You can also try using Open-MX instead of TCP
>> for MPI communications.
>>
>> --
>> Doug
>
> --
> *************************************************************
> J?rg Sa?mannshausen
> Research Fellow
> University of Strathclyde
> Department of Pure and Applied Chemistry
> 295 Cathedral St.
> Glasgow
> G1 1XL
>
> email: jorg.sassmannshausen at strath.ac.uk
> web: http://sassy.formativ.net
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


--
Doug


From bernard at vanhpc.org  Thu Dec 10 12:20:21 2009
From: bernard at vanhpc.org (Bernard Li)
Date: Thu, 10 Dec 2009 12:20:21 -0800
Subject: [Beowulf] Sony PS3, random news
In-Reply-To: <d2300d1c0912091019h10654b02wf2e847afbdc39a6c@mail.gmail.com>
References: <d2300d1c0912091019h10654b02wf2e847afbdc39a6c@mail.gmail.com>
Message-ID: <d4c731da0912101220w6def1dd3oa2c31efd0f3dbeaf@mail.gmail.com>

Curious what software they use to provision these PS3s ;-)

Cheers,

Bernard

On Wed, Dec 9, 2009 at 10:19 AM, Jeremy Baker <jellogum at gmail.com> wrote:
> DoD buys PS3 for HPC.
>
> CNN brief at
>
> http://scitech.blogs.cnn.com/2009/12/09/military-purchases-2200-ps3s/
>
> Clip from report:
>
> ?"Though a single 3.2 GHz cell processor can deliver over 200 GFLOPS,
> whereas the Sony PS3 configuration delivers approximately 150 GFLOPS, the
> approximately tenfold cost difference per GFLOP makes the Sony PS3 the only
> viable technology for HPC applications."
>
>
> --
> Jeremy Baker
> PO 297
> Johnson, VT
> 05656
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From gerry.creager at tamu.edu  Thu Dec 10 13:22:10 2009
From: gerry.creager at tamu.edu (Gerald Creager)
Date: Thu, 10 Dec 2009 15:22:10 -0600
Subject: [Beowulf] Sony PS3, random news
In-Reply-To: <d4c731da0912101220w6def1dd3oa2c31efd0f3dbeaf@mail.gmail.com>
References: <d2300d1c0912091019h10654b02wf2e847afbdc39a6c@mail.gmail.com>
	<d4c731da0912101220w6def1dd3oa2c31efd0f3dbeaf@mail.gmail.com>
Message-ID: <4B216682.7060506@tamu.edu>

Grand Theft Auto?  More'n'likely, YellowDog Linux, which is what the 
folks in Colorado adapted, under contract with Sony, to generate for the 
PS3.

gerry

Bernard Li wrote:
> Curious what software they use to provision these PS3s ;-)
> 
> Cheers,
> 
> Bernard
> 
> On Wed, Dec 9, 2009 at 10:19 AM, Jeremy Baker <jellogum at gmail.com> wrote:
>> DoD buys PS3 for HPC.
>>
>> CNN brief at
>>
>> http://scitech.blogs.cnn.com/2009/12/09/military-purchases-2200-ps3s/
>>
>> Clip from report:
>>
>>  "Though a single 3.2 GHz cell processor can deliver over 200 GFLOPS,
>> whereas the Sony PS3 configuration delivers approximately 150 GFLOPS, the
>> approximately tenfold cost difference per GFLOP makes the Sony PS3 the only
>> viable technology for HPC applications."
>>
>>
>> --
>> Jeremy Baker
>> PO 297
>> Johnson, VT
>> 05656
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From csamuel at vpac.org  Thu Dec 10 16:28:05 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Fri, 11 Dec 2009 11:28:05 +1100 (EST)
Subject: [Beowulf] cluster sharing
In-Reply-To: <428810f20912091746k74184ef4n125c2b5ada94606a@mail.gmail.com>
Message-ID: <1433546203.7750671260491285873.JavaMail.root@mail.vpac.org>


----- "amjad ali" <amjad11 at gmail.com> wrote:

> How should I make a reasonable performance measurement
> in such a case?

I would suggest that you request entire nodes
through the batch scheduler. So if you are using
Torque (PBS) for instance and wanted to run an
80 CPU job on dual socket quad core nodes you
would request:

#PBS -l nodes=10:ppn=8

The scheduler would then only allocate you nodes
that were not used by other people.

All the best,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From csamuel at vpac.org  Thu Dec 10 16:33:46 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Fri, 11 Dec 2009 11:33:46 +1100 (EST)
Subject: [Beowulf] scalability
In-Reply-To: <4B21413B.7020000@ldeo.columbia.edu>
Message-ID: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org>


----- "Gus Correa" <gus at ldeo.columbia.edu> wrote:

> This is about the same number I've got for an atmospheric model
> in a dual-socket dual-core Xeon computer.
> Somehow the memory path/bus on these systems is not very efficient,
> and saturates when more than two processes do
> intensive work concurrently.
> A similar computer configuration with dual- dual-
> AMD Opterons performed significantly better on the same atmospheric
> code (efficiency close to 90%).

The issue is that there are Xeon's and then there
are Xeon's.

The older Woodcrest/Clovertown type CPUs had the
standard Intel bottleneck of a single memory
controller for both sockets.

The newer Nehalem Xeon's have HyperTransport^W QPI
which involves each socket having its own memory
controller with connections to local RAM.

This is essentially what AMD have been doing with
Opteron for years and why they've traditionally
done better than Intel with memory intensive codes.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From bernard at vanhpc.org  Fri Dec 11 09:26:41 2009
From: bernard at vanhpc.org (Bernard Li)
Date: Fri, 11 Dec 2009 09:26:41 -0800
Subject: [Beowulf] Sony PS3, random news
In-Reply-To: <4B216682.7060506@tamu.edu>
References: <d2300d1c0912091019h10654b02wf2e847afbdc39a6c@mail.gmail.com>
	<d4c731da0912101220w6def1dd3oa2c31efd0f3dbeaf@mail.gmail.com>
	<4B216682.7060506@tamu.edu>
Message-ID: <d4c731da0912110926w21e7f052h81028062ecdcf58c@mail.gmail.com>

Hi Gerry:

On Thu, Dec 10, 2009 at 1:22 PM, Gerald Creager <gerry.creager at tamu.edu> wrote:

> Grand Theft Auto? ?More'n'likely, YellowDog Linux, which is what the folks
> in Colorado adapted, under contract with Sony, to generate for the PS3.

:-)

Actually by now you could install pretty much any Linux flavour on the
PS3.  I was talking specifically about what system software they used
to provision the 2000+ PS3s.  You would still need to manually switch
each PS3 to boot from "otheros", but beyond that it would be nice to
have an automated system of provisioning the OS...

Cheers,

Bernard


From gus at ldeo.columbia.edu  Fri Dec 11 10:25:34 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Fri, 11 Dec 2009 13:25:34 -0500
Subject: [Beowulf] scalability
In-Reply-To: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org>
References: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org>
Message-ID: <4B228E9E.8030407@ldeo.columbia.edu>

Hi Chris

Chris Samuel wrote:
> ----- "Gus Correa" <gus at ldeo.columbia.edu> wrote:
> 
>> This is about the same number I've got for an atmospheric model
>> in a dual-socket dual-core Xeon computer.
>> Somehow the memory path/bus on these systems is not very efficient,
>> and saturates when more than two processes do
>> intensive work concurrently.
>> A similar computer configuration with dual- dual-
>> AMD Opterons performed significantly better on the same atmospheric
>> code (efficiency close to 90%).
> 
> The issue is that there are Xeon's and then there
> are Xeon's.
> 
> The older Woodcrest/Clovertown type CPUs had the
> standard Intel bottleneck of a single memory
> controller for both sockets.
> 

Yes, that is for fact, but didn't the
Harpertown generation still have a similar problem?

Amjad's Xeon small cluster machines are dual socket dual core,
perhaps a bit older than the type I had used here
(Intel Xeon 5160 3.00GHz) in standalone workstations
and tested an atmosphere model with the efficiency
numbers I mentioned above.
According to Amjad:

"I have, with my group, a small cluster of about 16 nodes
(each one with single socket Xeon 3085 or 3110;
And I face problem of poor scalability. "

I lost track of the Intel number/naming convention.
Are Amjad's and mine Woodcrest?
Clovertown?
Harpertown?

> The newer Nehalem Xeon's have HyperTransport^W QPI
> which involves each socket having its own memory
> controller with connections to local RAM.

That has been widely reported, at least in SPEC2000 type of tests.
Unfortunately I don't have any Nehalem to play with our codes.
However, please take a look at the ongoing discussion on the OpenMPI
list about memory issues with Nehalem
(perhaps combined with later versions of GCC) on MPI programs:

http://www.open-mpi.org/community/lists/users/2009/12/11462.php
http://www.open-mpi.org/community/lists/users/2009/12/11499.php
http://www.open-mpi.org/community/lists/users/2009/12/11500.php
http://www.open-mpi.org/community/lists/users/2009/12/11516.php
http://www.open-mpi.org/community/lists/users/2009/12/11515.php

> 
> This is essentially what AMD have been doing with
> Opteron for years and why they've traditionally
> done better than Intel with memory intensive codes.
> 

Yes, and we're happy with their performance, memory bandwidth
and scalability on the codes we run (mostly ocean/atmosphere/climate).
Steady workhorses.

Not advocating any manufacturer's cause,
just telling our experience.

> cheers,
> Chris

Cheers,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


From tom.elken at qlogic.com  Fri Dec 11 11:18:11 2009
From: tom.elken at qlogic.com (Tom Elken)
Date: Fri, 11 Dec 2009 11:18:11 -0800
Subject: [Beowulf] scalability
In-Reply-To: <4B228E9E.8030407@ldeo.columbia.edu>
References: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org>
	<4B228E9E.8030407@ldeo.columbia.edu>
Message-ID: <35AAF1E4A771E142979F27B51793A48887030F1FE7@AVEXMB1.qlogic.org>

> > The older Woodcrest/Clovertown type CPUs had the
> > standard Intel bottleneck of a single memory
> > controller for both sockets.
> 
> Yes, that is for fact, but didn't the
> Harpertown generation still have a similar problem?

Yes.

> 
> Amjad's Xeon small cluster machines are dual socket dual core,
> perhaps a bit older than the type I had used here
> (Intel Xeon 5160 3.00GHz) in standalone workstations
...
> According to Amjad:
> "I have, with my group, a small cluster of about 16 nodes
> (each one with single socket Xeon 3085 or 3110;
> And I face problem of poor scalability. "
> 
> I lost track of the Intel number/naming convention.
> Are Amjad's and mine Woodcrest?

Yours (Xeon 5160) are Woodcrest.
Amjad's are Conroe (Xeon 3085) and  Wolfdale (Xeon 3110).  

But they all appear to be very similar.  A good reference is:
http://en.wikipedia.org/wiki/Xeon#3000-series_.22Conroe.22
and further down the page.
Microarchitecturally they are probably virtually identical.  Main differences  differences in 
- L2 size (Wolfdale (Xeon 3110) had 6MB, and the others 4MB, all shared between 2 cores),
- power dissipation, 
- process size, and 
- max # of CPUs in a system.

-Tom

> Clovertown?
> Harpertown?
> 
> > The newer Nehalem Xeon's have HyperTransport^W QPI
> > which involves each socket having its own memory
> > controller with connections to local RAM.
> 


From gerry.creager at tamu.edu  Fri Dec 11 11:28:40 2009
From: gerry.creager at tamu.edu (Gerald Creager)
Date: Fri, 11 Dec 2009 13:28:40 -0600
Subject: [Beowulf] scalability
In-Reply-To: <4B228E9E.8030407@ldeo.columbia.edu>
References: <236624678.7750761260491626840.JavaMail.root@mail.vpac.org>
	<4B228E9E.8030407@ldeo.columbia.edu>
Message-ID: <4B229D68.4060204@tamu.edu>

Howdy!

Gus Correa wrote:
> Hi Chris
> 
> Chris Samuel wrote:
>> ----- "Gus Correa" <gus at ldeo.columbia.edu> wrote:
>>
>>> This is about the same number I've got for an atmospheric model
>>> in a dual-socket dual-core Xeon computer.
>>> Somehow the memory path/bus on these systems is not very efficient,
>>> and saturates when more than two processes do
>>> intensive work concurrently.
>>> A similar computer configuration with dual- dual-
>>> AMD Opterons performed significantly better on the same atmospheric
>>> code (efficiency close to 90%).
>>
>> The issue is that there are Xeon's and then there
>> are Xeon's.
>>
>> The older Woodcrest/Clovertown type CPUs had the
>> standard Intel bottleneck of a single memory
>> controller for both sockets.
>>
> 
> Yes, that is for fact, but didn't the
> Harpertown generation still have a similar problem?

Yes.

> Amjad's Xeon small cluster machines are dual socket dual core,
> perhaps a bit older than the type I had used here
> (Intel Xeon 5160 3.00GHz) in standalone workstations
> and tested an atmosphere model with the efficiency
> numbers I mentioned above.
> According to Amjad:
> 
> "I have, with my group, a small cluster of about 16 nodes
> (each one with single socket Xeon 3085 or 3110;
> And I face problem of poor scalability. "

What's the application?  WRF may fail to scale even at modest numbers of 
cores if the domain size is sufficiently large.

IS this a NWP code? (sorry for coming in late on the discussion... I've 
been hacking on RWFv3.1.1 with openMPI and PGI and seeing some 
interesting problems.

gerry


> I lost track of the Intel number/naming convention.
> Are Amjad's and mine Woodcrest?
> Clovertown?
> Harpertown?
> 
>> The newer Nehalem Xeon's have HyperTransport^W QPI
>> which involves each socket having its own memory
>> controller with connections to local RAM.
> 
> That has been widely reported, at least in SPEC2000 type of tests.
> Unfortunately I don't have any Nehalem to play with our codes.
> However, please take a look at the ongoing discussion on the OpenMPI
> list about memory issues with Nehalem
> (perhaps combined with later versions of GCC) on MPI programs:
> 
> http://www.open-mpi.org/community/lists/users/2009/12/11462.php
> http://www.open-mpi.org/community/lists/users/2009/12/11499.php
> http://www.open-mpi.org/community/lists/users/2009/12/11500.php
> http://www.open-mpi.org/community/lists/users/2009/12/11516.php
> http://www.open-mpi.org/community/lists/users/2009/12/11515.php
> 
>>
>> This is essentially what AMD have been doing with
>> Opteron for years and why they've traditionally
>> done better than Intel with memory intensive codes.
>>
> 
> Yes, and we're happy with their performance, memory bandwidth
> and scalability on the codes we run (mostly ocean/atmosphere/climate).
> Steady workhorses.
> 
> Not advocating any manufacturer's cause,
> just telling our experience.
> 
>> cheers,
>> Chris
> 
> Cheers,
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf


From gus at ldeo.columbia.edu  Fri Dec 11 15:34:45 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Fri, 11 Dec 2009 18:34:45 -0500
Subject: [Beowulf] Re: scalability
In-Reply-To: <428810f20912101944w30e9fe36r64216533ac218d7d@mail.gmail.com>
References: <428810f20912091722lca5d633g45b3feee7543bbf@mail.gmail.com>	
	<4B2058E0.9050602@ldeo.columbia.edu>	
	<428810f20912091914r4955515wa53f12611a848af6@mail.gmail.com>	
	<4B21413B.7020000@ldeo.columbia.edu>
	<428810f20912101944w30e9fe36r64216533ac218d7d@mail.gmail.com>
Message-ID: <4B22D715.3050809@ldeo.columbia.edu>

Hi Amjad

amjad ali wrote:
> Hi Gus,
> 
>     I was told that some people used to run two processes only on
>     dual-socket dual-core Xeon nodes , leaving the other two cores idle.
>     Although it is an apparent waste, the argument was that it paid
>     off in terms of overall efficiency.
> 
> 
> I guess I fully agree with this.
> 
> 
>     Have you tried to run your programs this way on your cluster?
>     Say, with one process only per node, and N nodes,
>     then with two processes per node, and N/2 nodes,
>     then with four processes per node, and N/4 nodes.
>     This may tell what is optimal for the hardware you have.
>     With OpenMPI you can use the "mpiexec" flags
>     "-bynode" and "-byslot" to control this behavior.
>     "man mpiexec" is your friend!  :)
> 
> 
> does mpich also provide this? 
 > or it will be controlled by the scheduler??

A mixed answer.

I think you can do this with MPICH2, but it is not so
easy as it is with OpenMPI, depending on other things,
particularly the mpiexec that you use.

1) You can use Torque/Maui and request full nodes,
as Chris Samuel suggested to you in another thread.
E.g.:
#PBS -l nodes=10:ppn=8

This doesn't guarantee or requires that your
job will run on a single core per node,
but it ensures that nobody else
will be running anything there but you.
Hence, this is kind of a preliminary step.

2) If you use mpd and the native MPICH2 mpiexec to launch programs
compiled with MPICH2, you could control where the processes run
by using the "-machinefile" or the "-configfile"
option.
To take effect, you also need to tweak with the
contents of your "machinefile"/"configfile".
For instance, you could write a script to run inside the
PBS script, but before the mpiexec command, to read the PBS_NODES
file, and build the "machinefile"/"configfile"
with selected nodes.
That is more involved than with OpenMPI, but not too hard to do.
You must read "man mpiexec" to do this right.

3) If you use the OSC mpiexec 
(http://www.osc.edu/~djohnson/mpiexec/index.php)
to launch programs compiled with MPICH2,
you can use the "-pernode" option to run in a single core per node,
which is similar to OpenMPI "-bynode", and easy to do.

4) If you still use MPICH1, which is too old, unmaintained,
and troublesome, then upgrade to OpenMPI or to MPICH2,
and use the solutions proposed here and in previous emails.

> 
> But still if it is a shared cluster (as in my case) then the cores you
> left unbusy may be allocated to another process of another user by the 
> Batch scheduler. Right??

Unless you request full nodes, as Chris Samuel suggested:
#PBS -l nodes=10:ppn=8

However, beware that this greedy and wasteful  behavior
may drive your system administrator and
the other cluster users mad at you!        :)
Well, you can always justify it in the name of science, of course. ;)

> 
> 
> 
>         Yes, CFD codes are memory bandwidth bound usually.
> 
> 
>     Indeed, and so is most of our atmosphere/ocean/climate codes,
>     which has a lot of CFD, but also radiative processes, mixing,
>     thermodynamics, etc.
>     However, most of our models use fixed grids, and I suppose
>     some of your aerodynamics may use adaptive meshes, right?
>     I guess you are doing aerodynamics, right?
> 
> 
> Amazing!!
> but I would really love to know (infact, learn) which 
> factors/indications made you to guess so correctly.
> 

Google is not only your friend.
It is also *my* friend!  :)
Is the Amjad Ali Pasha listed here yourself or somebody else?

http://www.aero.iitb.ac.in/aero/people/students/phd.html

Dialog here is a two way road, a cooperative and open exchange.
My identity is stamped on my signature block in all messages,
no secret about it.
Why not yours?

> 
> I would offer you 6 cents.
> 2 cents --- you missed below.
> 2 extra.
> 2 cents for next email.
> 
> 
>         Thank you very much.
> 

My two Rupees. :)

Best,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

> 
>     My pleasure.
> 
> 
>     I hope this helps,
> 
>     Gus Correa
>     ---------------------------------------------------------------------
>     Gustavo Correa
>     Lamont-Doherty Earth Observatory - Columbia University
>     Palisades, NY, 10964-8000 - USA
>     ---------------------------------------------------------------------
> 


From rpnabar at gmail.com  Fri Dec 11 22:59:43 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Sat, 12 Dec 2009 00:59:43 -0600
Subject: [Beowulf] Performance tuning for Jumbo Frames
Message-ID: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>

I have seen a considerable performance boost for my codes by using
Jumbo Frames. But are there any systematic tools or strategies to
select the optimum MTU size? I have it set as 9000. (Of course, all
switiching hardware supports jumbo frames and no talking to the
external world required of the interfaces) Have you guys found
performance to be MTU sensitive?

Also, are there any switch side parameters that can affect the
performance of HPC codes? Specifically I was trying to run VASP which
is known to be latency sensitive. I have a 10 Gig E network with a
RDMA offload card and am getting average latencies (ping pong) using
rping of around 14 microsecs in the MPI tests. Is there a way to
figure out what percentage of this latency is in the switch and what
%age in the stack, cards and cables? Just trying to figure out which
are the battles one picks to fight.

Any tips?

-- 
Rahu


From hearnsj at googlemail.com  Sat Dec 12 00:45:56 2009
From: hearnsj at googlemail.com (John Hearns)
Date: Sat, 12 Dec 2009 08:45:56 +0000
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
Message-ID: <9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com>

2009/12/12 Rahul Nabar <rpnabar at gmail.com>:
>  Is there a way to
> figure out what percentage of this latency is in the switch and what
> %age in the stack, cards and cables? Just trying to figure out which
> are the battles one picks to fight.

I would say take the switch out and do a direct point-to-point link
between two systems.
Is this possible with 10gig ethernet?


From rpnabar at gmail.com  Sat Dec 12 01:02:03 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Sat, 12 Dec 2009 03:02:03 -0600
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com>
Message-ID: <c4d69730912120102x255f883fm4e12a8a2c38e958a@mail.gmail.com>

On Sat, Dec 12, 2009 at 2:45 AM, John Hearns <hearnsj at googlemail.com> wrote:
> 2009/12/12 Rahul Nabar <rpnabar at gmail.com>:
>> ?Is there a way to
>> figure out what percentage of this latency is in the switch and what
>> %age in the stack, cards and cables? Just trying to figure out which
>> are the battles one picks to fight.
>
> I would say take the switch out and do a direct point-to-point link
> between two systems.
> Is this possible with 10gig ethernet?

Yes! Great idea. I never thought of that. I'll try that.

Yes, I'm no expert, but as far I know a direct link is possible.

-- 
Rahul


From h-bugge at online.no  Sat Dec 12 03:27:48 2009
From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=)
Date: Sat, 12 Dec 2009 12:27:48 +0100
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
Message-ID: <83F599A7-0A85-4887-9178-0D98E31EB0F0@online.no>


On Dec 12, 2009, at 7:59 , Rahul Nabar wrote:

> I have seen a considerable performance boost for my codes by using
> Jumbo Frames. But are there any systematic tools or strategies to
> select the optimum MTU size? I have it set as 9000. (Of course, all
> switiching hardware supports jumbo frames and no talking to the
> external world required of the interfaces) Have you guys found
> performance to be MTU sensitive?

Once (i.e. several years ago) I tested an application on Gbe and found  
that around 1/3rd of the MTU was the optimum. But I guess YMMV.

H?kon


From richard.walsh at comcast.net  Sat Dec 12 05:39:30 2009
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Sat, 12 Dec 2009 13:39:30 +0000 (UTC)
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <83F599A7-0A85-4887-9178-0D98E31EB0F0@online.no>
Message-ID: <1399280631.2109511260625170185.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>


>On Dec 12, 2009 H?kon Bugge wrote: 
> 
>On Dec 12, 2009, at 7:59 , Rahul Nabar wrote: 
> 
>> I have seen a considerable performance boost for my codes by using 
>> Jumbo Frames. But are there any systematic tools or strategies to 
>> select the optimum MTU size? I have it set as 9000. (Of course, all 
>> switiching hardware supports jumbo frames and no talking to the 
>> external world required of the interfaces) Have you guys found 
>> performance to be MTU sensitive? 
> 
>Once (i.e. several years ago) I tested an application on Gbe and found 
>that around 1/3rd of the MTU was the optimum. But I guess YMMV. 


I would seem that a larger MTU would help in at least two situations, 
clearly applications with very large messages, but also those that 
have transmission bursts of messages below the MTU that could 
take advantage of hardware coalescing. The common MTU of 1500 was 
not chosen arbitrarily, but was probably not tuned to serve the "average" 
HPC application. If one is going to run one application predominately, 
then that application might prefer the default or not. If one is running 
a suite of codes, the it will probably help some and hurt others and 
should be "weighted average" tuned. 


I think the first-order focal point for performance should be on writing 
good code while choosing the "right" MTU size is a third order concern. 
Here is a stupid question ... it is the >>Maximum<< Transmission Unit ... 
Right? You don't get it all the time ... how is the run-time value set 
and adjusted based on the quality of message traffic? Experts ... ?? 


rbw 

_______________________________________________ 
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing 
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091212/68cf0646/attachment.html>

From rpnabar at gmail.com  Sat Dec 12 07:54:23 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Sat, 12 Dec 2009 09:54:23 -0600
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <83F599A7-0A85-4887-9178-0D98E31EB0F0@online.no>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<83F599A7-0A85-4887-9178-0D98E31EB0F0@online.no>
Message-ID: <c4d69730912120754r6a85471em34f522cd72206e12@mail.gmail.com>

On Sat, Dec 12, 2009 at 5:27 AM, H?kon Bugge <h-bugge at online.no> wrote:
>

> Once (i.e. several years ago) I tested an application on Gbe and found that
> around 1/3rd of the MTU was the optimum. But I guess YMMV.

i.e MTU = 0.3 * 9000?

or 0.3 * 1500

Sorry, I was confused.

  --
Rahul


From patrick at myri.com  Sat Dec 12 08:40:49 2009
From: patrick at myri.com (Patrick Geoffray)
Date: Sat, 12 Dec 2009 11:40:49 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
Message-ID: <4B23C791.5070701@myri.com>

Rahul,

Rahul Nabar wrote:
> I have seen a considerable performance boost for my codes by using
> Jumbo Frames. But are there any systematic tools or strategies to
> select the optimum MTU size?

There is no optimal MTU size. This is the maximum payload you can fit in 
one packet, so there is no drawback to a bigger MTU. Actually, there is 
one in terms of wormhole switching, but switch contention is an issue 
happily ignored by most HPC users.

> external world required of the interfaces) Have you guys found
> performance to be MTU sensitive?

A large MTU means fewer packets for the same amount of data transfered.
In all stack processing, there is a per-packet overhead (decoding 
header, integrity, sequence number, etc) and a per-byte overhead (copy). 
A large MTU reduces the total per-packet overhead because there are less 
packets to process.

Most 10GE NIC have no problems reaching line rate at 1500 Bytes (the 
standard Ethernet MTU), the problem is the host OS stack (mainly TCP) 
where the per-packet overhead is important. One trick that all 10GE NICs 
worth their salt are doing these days is to fake a large MTU at the OS 
level, while keeping the wire MTU at 1500 Bytes (for compatibility). 
This is called TSO (Transmit Send Offload) and LRO (Large Receive 
Offload). The OS stack is using a virtual MTU of 64K and the NIC does 
segmentation/reassembly in hardware, sort of.

> Also, are there any switch side parameters that can affect the
> performance of HPC codes? Specifically I was trying to run VASP which
> is known to be latency sensitive.

A large MTU has little to no impact on latency.

> I have a 10 Gig E network with a
> RDMA offload card and am getting average latencies (ping pong) using
> rping of around 14 microsecs in the MPI tests.

It is most likely due to the switch. Try back-to-back to measure without 
it. I don't know what hardware you are using, but you can get close to 
10us latency over TCP with a standard 10GE NIC and interrupt coalescing 
disabled. With a NIC supporting OS-bypass (RDMA only make sense for 
bandwidth), you should get at least half that, ideally below 3us.

Patrick


From patrick at myri.com  Sat Dec 12 08:41:48 2009
From: patrick at myri.com (Patrick Geoffray)
Date: Sat, 12 Dec 2009 11:41:48 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com>
Message-ID: <4B23C7CC.4060004@myri.com>

John Hearns wrote:
> I would say take the switch out and do a direct point-to-point link
> between two systems.
> Is this possible with 10gig ethernet?

Yes, no need for crossover cables with 10GE.

Patrick


From patrick at myri.com  Sat Dec 12 10:45:35 2009
From: patrick at myri.com (Patrick Geoffray)
Date: Sat, 12 Dec 2009 13:45:35 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <1399280631.2109511260625170185.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
References: <1399280631.2109511260625170185.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
Message-ID: <4B23E4CF.6010306@myri.com>

Hi Richard,

richard.walsh at comcast.net wrote:
> I would seem that a larger MTU would help in at least two situations,
> clearly applications with very large messages, but also those that
> have transmission bursts of messages below the MTU that could
> take advantage of hardware coalescing.

Such coalescing is typically not done in hardware. TCP with Nagle will 
coalesce small messages going to the same destination. However, Nagle 
trades latency for bandwidth, so latency junkies such as HPC apps often 
disable Nagle when using TCP. Some MPI implementations do coalesce small 
messages, mainly to look good on packet rate benchmarks.

> The common MTU of 1500 was not chosen arbitrarily,

Never underestimate what a standard committee can do.

Patrick


From hearnsj at googlemail.com  Sun Dec 13 02:23:42 2009
From: hearnsj at googlemail.com (John Hearns)
Date: Sun, 13 Dec 2009 10:23:42 +0000
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <4B240378.6020101@earlham.edu>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<9f8092cc0912120045lb9b415fn9f8b3d9f6edca586@mail.gmail.com>
	<4B23C7CC.4060004@myri.com> <4B240378.6020101@earlham.edu>
Message-ID: <9f8092cc0912130223r69c820bw60b2f47ca3281895@mail.gmail.com>

2009/12/12 Kevin Hunter <hunteke at earlham.edu>:

>
> I don't have access to all hardware (obviously), but it's been my
> experience that *any* NIC made in the last 6- years (10G, 1G, 100M) no
> longer needs crossover cables to do direct NIC-to-NIC communication.

10gig uses fibre cables, CX-4 copper or only very recently 1000-GBASE-T
so 'crossover' cables only applicable (or not) in the last case.


From csamuel at vpac.org  Sun Dec 13 18:26:04 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Mon, 14 Dec 2009 13:26:04 +1100 (EST)
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <1269675451.7906811260757522831.JavaMail.root@mail.vpac.org>
Message-ID: <553360800.7906831260757564672.JavaMail.root@mail.vpac.org>


----- "Patrick Geoffray" <patrick at myri.com> wrote:

> richard.walsh at comcast.net wrote:
>
> > The common MTU of 1500 was not chosen arbitrarily,
> 
> Never underestimate what a standard committee can do.

Viz the compromise of a 48 byte payload (53 byte cell)
for ATM between the US and Europe..

http://en.wikipedia.org/wiki/Asynchronous_transfer_mode#Why_cells.3F

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From rpnabar at gmail.com  Sun Dec 13 19:26:52 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Sun, 13 Dec 2009 21:26:52 -0600
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <4B23C791.5070701@myri.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<4B23C791.5070701@myri.com>
Message-ID: <c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>

On Sat, Dec 12, 2009 at 10:40 AM, Patrick Geoffray <patrick at myri.com> wrote:
> Rahul,
>
> Rahul Nabar wrote:
>>
>> I have seen a considerable performance boost for my codes by using
>> Jumbo Frames. But are there any systematic tools or strategies to
>> select the optimum MTU size?
>
> There is no optimal MTU size. This is the maximum payload you can fit in one
> packet, so there is no drawback to a bigger MTU.

Thanks! So I could push it beyond 9000 as well? Reason is I've seen a
steady boost in performance so far. 1500 < 4000 < 9000.

Maybe my performance continues to increase beyond 9000 too?

-- 
Rahul


From bcostescu at gmail.com  Mon Dec 14 02:59:21 2009
From: bcostescu at gmail.com (Bogdan Costescu)
Date: Mon, 14 Dec 2009 11:59:21 +0100
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
Message-ID: <c609bc800912140259q1b2c92f2v41dcb4a017406889@mail.gmail.com>

On Sat, Dec 12, 2009 at 7:59 AM, Rahul Nabar <rpnabar at gmail.com> wrote:
> I have seen a considerable performance boost for my codes by using
> Jumbo Frames. But are there any systematic tools or strategies to
> select the optimum MTU size? I have it set as 9000.

I played with this as well several times and found variable results.
At one time, the memory allocation proved to be the limiting factor:
because the page size was 4K, a packet with a MTU smaller than that
would fit into one page, while a packet of 9000bytes would require 3
contiguous pages, making the search more time consuming; when plotting
the bandwidth vs. MTU it peaked at just below 4K, so an increased MTU
was beneficial compared with the default 1500bytes one, but only as
long as it fits in one page. At another time, the switch was more
likely to drop large frames under high load (maybe something to do
with internal memory management), so the 9000bytes frames worked most
of the time while the 1500bytes ones worked all the time... At yet
another time, the high interrupt load generated by the 1500bytes
fragments would make the computer unstable (probably an Athlon MP
based system, but memory is fuzzy), so larger frames and/or interrupt
coalescing was the only way to actually use that computer.

The MTU can be set to higher values than 9000bytes if all the
components involved support it - switch, network cards and driver. I
remember seeing 2-3 years ago for some network equipment a MTU of 16K
- but again the memory is fuzzy on what equipment that was - so it's
definitely possible to have it higher than 9000bytes. Usually setting
a too large MTU would be seen in bandwidth testing - if fragments
above a certain MTU are dropped or only partly transferred, there will
be retransmissions and the useful bandwidth will drop significantly
(and for a trained eye, the statistics of the network driver and stack
will provide clues as well).

> Also, are there any switch side parameters that can affect the
> performance of HPC codes?

Most (all ?) switches do their job in hardware, to arrive at wire
speed. There is usually nothing that can be set to affect the way the
engine works.

Cheers,
Bogdan


From Greg at keller.net  Mon Dec 14 07:15:56 2009
From: Greg at keller.net (Greg Keller)
Date: Mon, 14 Dec 2009 09:15:56 -0600
Subject: [Beowulf] Re: scalability 
In-Reply-To: <200912121641.nBCGfhxi010197@bluewest.scyld.com>
References: <200912121641.nBCGfhxi010197@bluewest.scyld.com>
Message-ID: <22C7676A-4693-4BA5-9348-653AC8E65935@Keller.net>

On Dec 12, 2009, at 10:41 AM, beowulf-request at beowulf.org wrote:

>
> From: Gus Correa <gus at ldeo.columbia.edu>
>
> Hi Amjad
>
> amjad ali wrote:
>> Hi Gus,
>>
>>    I was told that some people used to run two processes only on
>>    dual-socket dual-core Xeon nodes , leaving the other two cores  
>> idle.
>>    Although it is an apparent waste, the argument was that it paid
>>    off in terms of overall efficiency.
>>
>>
>> I guess I fully agree with this.
>>
The other reason I see folks choose this is licensing.  In some cases  
the "cost" of the license tokens to use the extra cores is too  
expensive given the minimal benefit since they still compete for  
Memory Bandwidth, CPU Cache, or less commonly Network Bandwidth/Latency.
...
>>
>> But still if it is a shared cluster (as in my case) then the cores  
>> you
>> left unbusy may be allocated to another process of another user by  
>> the
>> Batch scheduler. Right??
>
> Unless you request full nodes, as Chris Samuel suggested:
> #PBS -l nodes=10:ppn=8
>
> However, beware that this greedy and wasteful  behavior
> may drive your system administrator and
> the other cluster users mad at you!        :)
> Well, you can always justify it in the name of science, of course. ;)
>
They may thank you for not sharing a "maxed out" node.  If someone is  
doing benchmarking or running a similarly sensitive code you're saving  
everyone a lot of head scratching.  In my world users sharing nodes is  
almost always trouble looking for a way to ruin a weekend.

Cheers!
Greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091214/566153f6/attachment.html>

From rpnabar at gmail.com  Mon Dec 14 10:20:20 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Mon, 14 Dec 2009 12:20:20 -0600
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <20091214125720.218a9397.chekh@pcbi.upenn.edu>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<4B23C791.5070701@myri.com>
	<c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>
	<20091214125720.218a9397.chekh@pcbi.upenn.edu>
Message-ID: <c4d69730912141020u1a2d8d1bm8465ca46caa2991d@mail.gmail.com>

On Mon, Dec 14, 2009 at 11:57 AM, Alex Chekholko <chekh at pcbi.upenn.edu> wrote:
> On Sun, 13 Dec 2009 21:26:52 -0600

> Well, remember, your hardware has to support it, first.

Right. I am checking with the Switch and eth Card specs to make sure now.

-- 
Rahul


From cap at nsc.liu.se  Mon Dec 14 10:35:26 2009
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Mon, 14 Dec 2009 19:35:26 +0100
Subject: [Beowulf] Sony PS3, random news
In-Reply-To: <d2300d1c0912091019h10654b02wf2e847afbdc39a6c@mail.gmail.com>
References: <d2300d1c0912091019h10654b02wf2e847afbdc39a6c@mail.gmail.com>
Message-ID: <200912141935.26586.cap@nsc.liu.se>

On Wednesday 09 December 2009, Jeremy Baker wrote:
> DoD buys PS3 for HPC.
>
> CNN brief at
>
> http://scitech.blogs.cnn.com/2009/12/09/military-purchases-2200-ps3s/
>
> Clip from report:
>
>  "Though a single 3.2 GHz cell processor can deliver over 200 GFLOPS,
> whereas the Sony PS3 configuration delivers approximately 150 GFLOPS, the
> approximately tenfold cost difference per GFLOP makes the Sony PS3 the only
> viable technology for HPC applications."

Given that the "new" PS3s does not support linux (or any "other OS" for that 
matter) this price/performance sweet spot may be going away...

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091214/4b000b9e/attachment.sig>

From rpnabar at gmail.com  Mon Dec 14 11:35:14 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Mon, 14 Dec 2009 13:35:14 -0600
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <4B23C791.5070701@myri.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<4B23C791.5070701@myri.com>
Message-ID: <c4d69730912141135m1ea166dfieaa9eff7bdd8893d@mail.gmail.com>

On Sat, Dec 12, 2009 at 10:40 AM, Patrick Geoffray <patrick at myri.com> wrote:
>
> Most 10GE NIC have no problems reaching line rate at 1500 Bytes (the
> standard Ethernet MTU), the problem is the host OS stack (mainly TCP) where
> the per-packet overhead is important. One trick that all 10GE NICs worth
> their salt are doing these days is to fake a large MTU at the OS level,
> while keeping the wire MTU at 1500 Bytes (for compatibility). This is called
> TSO (Transmit Send Offload) and LRO (Large Receive Offload). The OS stack is
> using a virtual MTU of 64K and the NIC does segmentation/reassembly in
> hardware, sort of.

The TSO and LRO are only relevant to TCP though, aren't they? I am
using RDMA  so that shouldn't matter. Maybe I am wrong.

-- 
Rahul


From rpnabar at gmail.com  Mon Dec 14 11:54:11 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Mon, 14 Dec 2009 13:54:11 -0600
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <4B23C791.5070701@myri.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<4B23C791.5070701@myri.com>
Message-ID: <c4d69730912141154r7e54d0beo2b73c32b8188c690@mail.gmail.com>

On Sat, Dec 12, 2009 at 10:40 AM, Patrick Geoffray <patrick at myri.com> wrote:


>
> It is most likely due to the switch. Try back-to-back to measure without it.
> I don't know what hardware you are using, but you can get close to 10us
> latency over TCP with a standard 10GE NIC and interrupt coalescing disabled.
> With a NIC supporting OS-bypass (RDMA only make sense for bandwidth), you
> should get at least half that, ideally below 3us.

What was your tool to measure this latency? Just curious.

-- 
Rahul


From eugen at leitl.org  Tue Dec 15 06:39:42 2009
From: eugen at leitl.org (Eugen Leitl)
Date: Tue, 15 Dec 2009 15:39:42 +0100
Subject: [Beowulf] Ich bitte um Feedback bzgl. MPI / please I need your
	feedback
Message-ID: <20091215143942.GG17686@leitl.org>

----- Forwarded message from Rolf Rabenseifner <rabenseifner at hlrs.de> -----

From: Rolf Rabenseifner <rabenseifner at hlrs.de>
Date: Tue, 15 Dec 2009 15:42:50 +0100 (CET)
To: eugen at leitl.org
Subject: Ich bitte um Feedback bzgl. MPI / please I need your feedback

Sehr geehrte Damen und Herren,

zur Weiterentwicklung des Message Passing Interface (MPI)
Standards moechten wir, das MPI-3 Forum, Sie bitten uns ein paar 
wichtige Fragen zu beantworten.

Ich bitte Sie, mir 10 Minuten zu schenken und den kurzen Fragen-Katalog
jetzt gleich zu beantworten. Bitte verwenden Sie nicht zu viel
Zeit auf die einzelnen Fragen, wenn Sie Probleme bei der Beantwortung
einer Frage haben sollten. 

Koennten Sie bitte diese Mail auch an Kollegen weiterleiten, die MPI nutzen. 

Hier die URL und das Passwort zu der Umfrage:
 
   URL: http://mpi-forum.questionpro.com/
   Password: mpi3
 
Herzlichen Dank im Voraus - 
eine frohe Weihnachtszeit wuenscht Ihnen
Ihr Rolf Rabenseifner

-------------------------------------
Dear Madam, dear Sir,

for improving the Message Passing Interface (MPI) standard,
we (the MPI-3 Forum) kindly ask you to answer a few important 
questions.

I ask you to spend 10 minutes of your time to answer the short 
questionnaire.  Please do not spend too much time on a single 
question if there occur problems with one question.

Please can you also forward this email to colleagues who use MPI.

Here the URL and the password of the questionnaire. 
 
   URL: http://mpi-forum.questionpro.com/
   Password: mpi3

Thank you in advance and a Merry Christmas,
Rolf Rabenseifner 
 
---------------------------------------------------------------------
Dr. Rolf Rabenseifner .. . . . . . . . . . email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart .. . . . . . . . . fax : ++49(0)711/685-65832
Head of Dpmt Parallel Computing .. .. www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . . (Office: Allmandring 30)
---------------------------------------------------------------------

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE


From jorg.sassmannshausen at strath.ac.uk  Mon Dec 14 07:17:15 2009
From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=)
Date: Mon, 14 Dec 2009 15:17:15 +0000
Subject: [Beowulf] Performance degrading
Message-ID: <200912141517.15432.jorg.sassmannshausen@strath.ac.uk>

Dear all,

I am scratching my head but apart from getting splinters into my fingers I 
cannot find a good answer for the following problem:
I am running a DFT program (NWChem) in parallel on our cluster (AMD Opterons, 
single quad cores in the node, 12 GB of RAM, Gigabit network) and at certain 
stages of the run top is presenting me with that:

top - 15:10:48 up 13 days, 22:20,  1 user,  load average: 0.26, 0.24, 0.19
Tasks: 106 total,   1 running, 105 sleeping,   0 stopped,   0 zombie
Cpu0  :  8.0% us,  2.7% sy,  0.0% ni, 82.7% id,  0.0% wa,  1.3% hi,  5.3% si
Cpu1  :  4.1% us,  1.4% sy,  0.0% ni, 94.6% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu2  :  2.7% us,  0.0% sy,  0.0% ni, 97.3% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu3  :  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:  12250540k total,  5581756k used,  6668784k free,   273396k buffers
Swap: 16779884k total,        0k used, 16779884k free,  3841688k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16885 sassy     15   0 3928m 1.7g 1.4g S    4 14.4 312:19.92 nwchem
16886 sassy     15   0 3928m 1.7g 1.4g S    4 14.5 313:08.77 nwchem
16887 sassy     15   0 3920m 1.7g 1.4g S    3 14.4 316:18.24 nwchem
16888 sassy     15   0 3923m 1.6g 1.3g S    3 13.3 316:13.55 nwchem
16890 sassy     15   0 2943m 1.7g 1.7g S    3 14.8 104:32.33 nwchem

It is not a few seconds it does it, it appears to be for a prolonged period of 
time. I checked it randomly for say 1 min and the performance is well below 
50 % (most of the time around 20 %). I have not noticed that when I am 
running the job within one node.

I have the suspicion that the Gigabit network is the problem, but I really 
would like to pinpoint that so I can get my boss to upgrade to a better 
network for parallel computing (hence my previous question about Open-MX). 
Now how, as I am not an admin of that cluster, would I be able to do that?

Thanks for your comments.

Best wishes from Glasgow!

J?rg

-- 
*************************************************************
J?rg Sa?mannshausen
Research Fellow
University of Strathclyde
Department of Pure and Applied Chemistry
295 Cathedral St.
Glasgow
G1 1XL

email: jorg.sassmannshausen at strath.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html


From chekh at pcbi.upenn.edu  Mon Dec 14 09:57:20 2009
From: chekh at pcbi.upenn.edu (Alex Chekholko)
Date: Mon, 14 Dec 2009 12:57:20 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<4B23C791.5070701@myri.com>
	<c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>
Message-ID: <20091214125720.218a9397.chekh@pcbi.upenn.edu>

On Sun, 13 Dec 2009 21:26:52 -0600
Rahul Nabar <rpnabar at gmail.com> wrote:

> On Sat, Dec 12, 2009 at 10:40 AM, Patrick Geoffray <patrick at myri.com> wrote:
> > Rahul,
> >
> > Rahul Nabar wrote:
> >>
> >> I have seen a considerable performance boost for my codes by using
> >> Jumbo Frames. But are there any systematic tools or strategies to
> >> select the optimum MTU size?
> >
> > There is no optimal MTU size. This is the maximum payload you can fit in one
> > packet, so there is no drawback to a bigger MTU.
> 
> Thanks! So I could push it beyond 9000 as well? Reason is I've seen a
> steady boost in performance so far. 1500 < 4000 < 9000.
> 
> Maybe my performance continues to increase beyond 9000 too?

Well, remember, your hardware has to support it, first.

I have a Foundry FGS648P switch which lists in the specs: "Jumbo Frames
up to 10,240 bytes for 10/100/1000 and 10GbE ports".  I turn that on by
issuing the command "jumbo frames" (then saving to flash, etc).

The 10GigE NICs that I have are the HP NC510C NetXen-based cards.  I
use the driver from the HP support site, and that driver only supports
up to 8000 MTU.  So I use 8000 bytes as my MTU. 

Set it as high as you can; there is no downside except ensuring all
your devices are set to handle that large unit size.  Typically, if the
device doesn't support jumbo frames, it just drops the jumbo frames
silently, which can result in odd intermittent problems.

-- 
Alex Chekholko   chekh at pcbi.upenn.edu


From atchley at myri.com  Tue Dec 15 09:49:23 2009
From: atchley at myri.com (Scott Atchley)
Date: Tue, 15 Dec 2009 12:49:23 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <20091214125720.218a9397.chekh@pcbi.upenn.edu>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<4B23C791.5070701@myri.com>
	<c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>
	<20091214125720.218a9397.chekh@pcbi.upenn.edu>
Message-ID: <437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com>

On Dec 14, 2009, at 12:57 PM, Alex Chekholko wrote:

> Set it as high as you can; there is no downside except ensuring all
> your devices are set to handle that large unit size.  Typically, if  
> the
> device doesn't support jumbo frames, it just drops the jumbo frames
> silently, which can result in odd intermittent problems.

You can test it by using the size parameter with ping:

$ ping -s <size_in_bytes> <host>

If they all drop, then you have exceeded the MTU of some device.

Scott


From chenyon1 at iit.edu  Wed Dec  9 18:09:16 2009
From: chenyon1 at iit.edu (Yong Chen)
Date: Wed, 09 Dec 2009 20:09:16 -0600
Subject: [Beowulf] [hpc-announce] Call For Papers: Intl. Workshop on Parallel
 Programming Models and Systems Software for HEC (P2S2)
Message-ID: <fb0496c17dee.4b2003ec@iit.edu>

CALL FOR PAPERS
===============

Third International Workshop on Parallel Programming Models 
and Systems Software for High-end Computing (P2S2)
Sept. 13th, 2010

To be held in conjunction with ICPP-2010: The 39th International 
Conference on Parallel Processing, Sept. 13-16, 2010, San Diego, CA, USA 

Website: http://www.mcs.anl.gov/events/workshops/p2s2

SCOPE
-----
The goal of this workshop is to bring together researchers and
practitioners in parallel programming models and systems software for
high-end computing systems. Please join us in a discussion of new ideas,
experiences, and the latest trends in these areas at the workshop.


TOPICS OF INTEREST
------------------
The focus areas for this workshop include, but are not limited to:

    *  Systems software for high-end scientific and enterprise computing architectures
          o Communication sub-subsystems for high-end computing
          o High-performance file and storage systems
          o Fault-tolerance techniques and implementations
          o Efficient and high-performance virtualization and other management mechanisms for high-end computing

    * Programming models and their high-performance implementations
          o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel, Fortress and others
          o Hybrid Programming Models

    * Tools for Management, Maintenance, Coordination and Synchronization
          o Software for Enterprise Data-centers using Modern Architectures
          o Job scheduling libraries
          o Management libraries for large-scale system
          o Toolkits for process and task coordination on modern platforms

    * Performance evaluation, analysis and modeling of emerging computing platforms


PROCEEDINGS
-----------
Proceedings of this workshop will be published in CD format and will be available 
at the conference (together with the ICPP conference proceedings) .


SUBMISSION INSTRUCTIONS
-----------------------
Submissions should be in PDF format in U.S. Letter size paper. They
should not exceed 8 pages (all inclusive). Submissions will be judged
based on relevance, significance, originality, correctness and clarity.
Please visit workshop website at: http://www.mcs.anl.gov/events/workshops/p2s2/
for the submission link.


JOURNAL SPECIAL ISSUE
---------------------
The best papers of P2S2'10 will be included in a special issue of the International 
Journal of High Performance Computing Applications (IJHPCA) on Programming Models, 
Software and Tools for High-End Computing. 


IMPORTANT DATES
---------------
Paper Submission: March 3rd, 2010
Author Notification: May 3rd, 2010 
Camera Ready: June 14th, 2010


PROGRAM CHAIRS
--------------
  * Pavan Balaji, Argonne National Laboratory
  * Abhinav Vishnu, Pacific Northwest National Laboratory


PUBLICITY CHAIR
---------------
  * Yong Chen, Illinois Institute of Technology


STEERING COMMITTEE
------------------
  * William D. Gropp, University of Illinois Urbana-Champaign
  * Dhabaleswar K. Panda, Ohio State University
  * Vijay Saraswat, IBM Research


PROGRAM COMMITTEE
-----------------
  * Ahmad Afsahi, Queen's University
  * George Almasi, IBM Research 
  * Taisuke Boku, Tsukuba University
  * Ron Brightwell, Sandia National Laboratory
  * Franck Cappello, INRIA, France
  * Yong Chen, Illinois Institute of Technology
  * Ada Gavrilovska, Georgia Tech
  * Torsten Hoefler, Indiana University
  * Zhiyi Huang, University of Otago, New Zealand
  * Hyun-Wook Jin, Konkuk University, Korea
  * Zhiling Lan, Illinois Institute of Technology
  * Doug Lea, State University of New York at Oswego
  * Jiuxing Liu, IBM Research
  * Guillaume Mercier, INRIA, France
  * Scott Pakin, Los Alamos National Laboratory
  * Fabrizio Petrini, IBM Research
  * Bronis de Supinksi, Lawrence Livermore National Laboratory
  * Sayantan Sur, IBM Research
  * Rajeev Thakur, Argonne National Laboratory
  * Vinod Tipparaju, Oak Ridge National Laboratory
  * Jesper Traff, NEC, Europe
  * Weikuan Yu, Auburn University


If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov

========================================================================
If you do not want to receive any more announcements regarding the
P2S2 workshop, please unsubscribe here:
https://lists.mcs.anl.gov/mailman/listinfo/hpc-announce
========================================================================


From djk at lanl.gov  Mon Dec 14 06:48:20 2009
From: djk at lanl.gov (Darren)
Date: Mon, 14 Dec 2009 07:48:20 -0700
Subject: [Beowulf] [hpc-announce] CFP: 24th International Conference on
 Supercomputing (ICS'10) - 4 weeks remaining
Message-ID: <4B265034.6090806@lanl.gov>

[Please accept our apologies if you have received this announcement 
multiple times]

***************************************************************************** 


            CALL FOR PAPERS - Submission deadline in 4 weeks

         24th International Conference on Supercomputing (ICS'10)
 
                     http://www.ics-conference.org

                            June 1-4, 2010

         Epochal Tsukuba (Tsukuba International Congress Center)
                            Tsukuba, Japan
                     http://www.epochal.or.jp/eng/

                        Sponsored by ACM/SIGARCH

***************************************************************************** 


ICS is the premier international forum for the presentation of
research results in high-performance computing systems. In 2010 the
conference will be held at the Epochal Tsukuba (Tsukuba International
Congress Center) in Tsukuba City, the largest high-tech and academic
city in Japan.

Papers are solicited on all aspects of research, development, and
application of high-performance experimental and commercial systems.
Special emphasis will be given to work that leads to better
understanding of the implications of the new era of million-scale
parallelism and Exa-scale performance; including (but not limited to):

* Computationally challenging scientific and commercial applications:
studies and experiences to exploit ultra large scale parallelism, a
large number of accelerators, and/or cloud computing paradigm.

* High-performance computational and programming models: studies and
proposals of new models, paradigms and languages for scalable
application development, seamless exploitation of accelerators, and
grid/cloud computing.

* Architecture and hardware aspects: processor, accelerator, memory,
interconnection network, storage and I/O architecture to make future
systems scalable, reliable and power efficient.

* Software aspects: compilers and runtime systems, programming and
development tools, middleware and operating systems to enable us to
scale applications and systems easily, efficiently and reliably.

* Performance evaluation studies and theoretical underpinnings of any
of the above topics, especially those giving us perspective toward
future generation high-performance computing.

* Large scale installations in the Petaflop era: design, scaling,
power, and reliability, including case studies and experience reports,
to show the baselines for future systems.


In order to encourage open discussion on future directions, the
program committee will provide higher priority for papers that present
highly innovative and challenging ideas.

Papers should not exceed 6,000 words, and should be submitted
electronically, in PDF format using the ICS'10 submission web
site. Submissions should be blind.  The review process will include a
rebuttal period. Please refer to the ICS'10 web site for detailed
instructions.

Workshop and tutorial proposals are also be solicited and due by
January 18, 2010.  For further information and future updates, refer
to the ICS'10 web site at http://www.ics-conference.org or contact the
General Chair (ics10-chair at hpcs.cs.tsukuba.ac.jp) or Program Co-Chairs
(ics10-chairs at ac.upc.edu).

Important Dates
Abstract submission:  January 11, 2010
Paper submission:     January 18, 2010
Author notification:  March 22, 2010
Final papers:         April 15, 2010

For more information, please visit the conference web site at
http://www.ics-conference.org

[ICS 2010 Committee Members]
GENRAL CHAIR
    Taisuke Boku, U. Tsukuba
PROGRAM CO-CHAIRS
    Hiroshi Nakashima, Kyoto U.
    Avi Mendelson, Microsoft
FINANCE CHAIR
    Kazuki Joe, Nara Women's U.
PUBLICATION CHAIR
    Osamu Tatebe, U. Tsukuba
PUBLICITY CO-CHAIRS
    Darren Kerbyson, LANL
    Hironori Nakajo, Tokyo U. Agric. & Tech.
    Serge Petiton, CNRS/LIFL
WORKSHOP & TUTORIAL CHAIR
    Koji Inoue, Kyushu U.
POSTER CHAIR
    Masahiro Goshima, U. Tokyo
WEB & SUBMISSION CO-CHAIRS
    Eduard Ayguade, BSC/UPC
    Alex Ramirez, BSC/UPC
LOCAL ARRANGEMENT CHAIR
    Daisuke Takahashi, U. Tsukuba

PROGRAM COMMITTEE
    Jung Ho Ahn, Seoul NU.
    Eduard Ayguade, BSC/UPC
    Carl Beckmann, Intel
    Muli Ben-Yehuda, IBM
    Gianfranco Bilardi, U. Padova
    Greg Byrd, NCSU
    Franck Cappello, INRIA/UIUC
    Marcelo Cintra, U. Edinburgh
    Luiz De Rose, Cray
    Bronis De Supinski, LLNL/CASC
    Jack Dongarra, UTenn/ORNL
    Eitan Frachtenberg, Facebook
    Kyle Gallivan, FSU
    Stratis Gallopoulos, ,U. Patras
    Milind Girkar, Intel
    Bill Gropp, UIUC
    Mike Heroux, SNL
    Adolfy Hoisie, LANL
    Koh Hotta, Fujitsu
    Yutaka Ishikawa, U. Tokyo
    Takeshi Iwashita, Kyoto U.
    Kazuki Joe, Nara Woman's U.
    Hironori Kasahara, U. Waseda
    Arun Kejariwal, Yahoo
    Darren Kerbyson, LANL
    Moe Khaleel, PNNL
    Bill Kramer, NCSA
    Andrew Lewis, Griffith U.
    Jose Moreira, IBM
    Walid Najjar, U.C. Riverside
    Kengo Nakajima, U. Tokyo
    Hironori Nakajo, Tokyo U. Agric. & Tech.
    Hiroshi Nakamura, U. Tokyo
    Toshio Nakatani, IBM Research Tokyo
    Michael O'Boyle, U. Edinburgh
    Lenny Oliker, LBNL
    Theodore Papatheodoro, U. Patras
    Miquel Pericas, BSC
    Keshav Pingali, U. Texas
    Depei Qian, Beihang U.
    Alex Ramirez, BSC/UPC
    Valentina Salapura, IBM
    Mitsuhisa Sato, U. Tsukuba
    John Shalf, LBNL
    Takeshi Shimizu, Fujitsu
    Joshua Simons, Sun Microsystems
    Shinji Sumimoto, Fujitsu
    Makoto Taiji, Riken
    Toshikazu Takada, Riken
    Daisuke Takahashi, U. Tsukuba
    Guangming Tan, ICT
    Osamu Tatebe, U. Tsukuba
    Kenjiro Taura, U. Tokyo
    Rajeev Thakur, ANL
    Rong Tian, NCIC
    Robert Van Engelen, FSU
    Harry Wijshoff, Leiden
    Mitsuo Yokokawa, Riken
    Ayal Zaks, IBM
    Yunquan Zhang, ISCAS


From djk at lanl.gov  Mon Dec 14 07:32:46 2009
From: djk at lanl.gov (Darren)
Date: Mon, 14 Dec 2009 08:32:46 -0700
Subject: [Beowulf] [hpc-announce] Extended Deadline: Workshop on Large-Scale
	Parallel Processing (LSPP'10)
Message-ID: <4B265A9E.5010609@lanl.gov>

[Please accept our apologies if you receive multiple copies]

-----------------------------------------------------------------
Call for papers:      Workshop on LARGE-SCALE PARALLEL PROCESSING

               to be held in conjunction with
IEEE International Parallel and Distributed Processing Symposium    
                      Atlanta, Georgia
                      April 23rd, 2010

EXTENDED DEADLINE:  December 18th  2009 (Final)

Selected work presented at the workshop will be published in a
special issue of Parallel Processing Letters.
-----------------------------------------------------------------

The workshop on Large-Scale Parallel Processing is a forum that
focuses on computer systems that utilize thousands of processors
and beyond. This is a very active area given the goals of many
researchers world-wide to enhance science-by-simulation through
installing large-scale multi-petaflop systems at the start of
the next decade. Large-scale systems, referred to by some as
extreme-scale and Ultra-scale, have many important research
aspects that need detailed examination in order for their
effective design, deployment, and utilization to take place.
These include handling the substantial increase in multi-core on
a chip, the ensuing interconnection hierarchy, communication, and
synchronization mechanisms. The workshop aims to bring together
researchers from different communities working on challenging
problems in this area for a dynamic exchange of ideas. Work at
early stages of development as well as work that has been
demonstrated in practice is equally welcome.

Of particular interest are papers that identify and analyze novel
ideas rather than providing incremental advances in the following
areas:

- LARGE-SCALE SYSTEMS : exploiting parallelism at large-scale,
  the coordination of large numbers of processing elements,
  synchronization and communication at large-scale, programming
  models and productivity

- MULTI-CORE : utilization of increased parallelism on a single
  chip (MPP on a chip such as the Cell and GPUs), the possible
  integration of these into large-scale systems, and dealing with
  the resulting hierarchical connectivity.

- NOVEL ARCHITECTURES AND EXPERIMENTAL SYSTEMS : the design of
  novel systems, the use of processors in memory (PIMS),
  parallelism in emerging technologies, future trends.

- APPLICATIONS : novel algorithmic and application methods,
  experiences in the design and use of applications that scale to
  large-scales, overcoming of limitations, performance analysis
  and insights gained.

Results of both theoretical and practical significance will be
considered, as well as work that has demonstrated impact at
small-scale that will also affect large-scale systems. Work may
involve algorithms, languages, various types of models, or
hardware.

-----------------------------------------------------------------
SUBMISSION GUIDELINES

Papers should not exceed eight single-space pages (including
figures, tables and references) using a 12-point font on 8.5x11
inch pages. Submissions in PostScript or PDF should be made
using EDAS (www.edas.info). Informal enquiries can be made to
djk at lanl.gov. Submissions will be judged on correctness,
originality, technical strength, significance, presentation
quality and appropriateness. Submitted papers should not have
appeared in or under consideration for another venue.

IMPORTANT DATES

Submission deadline:         December 18th  2009 (Final)
Notification of acceptance:  January  15th  2010
Camera-Ready Papers due:     February  1st  2010

-----------------------------------------------------------------
WORKSHOP CO-CHAIRS

Darren J. Kerbyson      Los Alamos National Laboratory
Ram Rajamony            IBM Austin Research Lab
Charles Weems           University of Massachusetts

STEERING COMMITTEE

Johnnie Baker           Kent State University
Alex Jones              University of Pittsburgh
H.J. Siegel             Colorado State University

PROGRAM COMMITTEE

Ghoerge Almasi          IBM T.J. Watson Research Lab
Taisuke Boku            University of Tsukuba, Japan
Marco Daneluto          University of Pisa
Martin Herbordt         Boston University
Lei Huang               University of Houston
Daniel Katz             University of Chicago
Jesus Labarta           Barcelona Supercomputer Center, Spain
John Michalakes         NCAR, Boulder
Celso Mendes            University of Illinois Urbana-Champagne
Bernd Mohr              Forschungszentrum Juelich, Germany
Stathis Papaefstathiou  Microsoft
Michael Scherger        Texas A&M University-Corpus Christi
Harvey Wasserman        NERSC/LBNL
Gerhard Wellein         University of Erlangen, Germany
Pat Worley              Oak Ridge National Laboratory

Workshop Webpage: http://www.ccs3.lanl.gov/LSPP


From sbyna at nec-labs.com  Tue Dec 15 08:59:30 2009
From: sbyna at nec-labs.com (Surendra Byna)
Date: Tue, 15 Dec 2009 11:59:30 -0500
Subject: [Beowulf] [hpc-announce] CfP: Special Issue of JPDC on "Data
	Intensive Computing", Submission: One month from Today
Message-ID: <951A499AA688EF47A898B45F25BD8EE807039EEC@mailer.nec-labs.com>

Dear Colleagues:

 
The paper submission deadline for the Special Issue of Journal of
Parallel and Distributed Computing (JPDC) on "Data Intensive Computing"
is a month from Today (January 15th 2010). We welcome your submissions.
We appreciate sharing this announcement with anyone who might be
interested. 

Thank you.

 
Suren Byna 

NEC Labs America, Inc. 

4 Independence Way, Suite 200

Princeton, NJ.

 
Xian-He Sun

Department of Computer Science

Illinois Institute of Technology

Chicago, IL.

 
======================================================================

           Our apologies for duplicated copies for this CfP

======================================================================

 
Call for Papers:

 
Special Issue of Journal of Parallel and Distributed Computing on "Data
Intensive Computing"

------------------------------------------------------------------------
---

 
Data intensive computing is posing many challenges in exploiting
parallelism of current and upcoming computer architectures. Data volumes
of applications in the fields of sciences and engineering, finance,
media, online information resources, etc. are expected to double every
two years over the next decade and further. With this continuing data
explosion, it is necessary to store and process data efficiently by
utilizing enormous computing power that is available in the form of
multicore/manycore platforms. There is no doubt in the industry and
research community that the importance of data intensive computing has
been raising and will continue to be the foremost fields of research.
This raise brings up many research issues, in forms of capturing and
accessing data effectively and fast, processing it while still achieving
high performance and high throughput, and storing it efficiently for
future use. Programming for high performance yielding data intensive
computing is an important challenging issue. Expressing data access
requirements of applications and designing programming language
abstractions to exploit parallelism are at immediate need. 

Application and domain specific optimizations are also parts of a viable
solution in data intensive computing. While these are a few examples of
issues, research in data intensive computing has become quite intense
during the last few years yielding strong results.

 
This special issue of the Journal Parallel and Distributed Computing

(JPDC) is seeking original unpublished research articles that describe
recent advances and efforts in the design and development of data
intensive computing, functionalities and capabilities that will benefit
many applications.

 
Topics of interest include (but are not limited to):

 
*  Data-intensive applications and their challenges

*  Storage and file systems

*  High performance data access toolkits

*  Fault tolerance, reliability, and availability

*  Meta-data management

*  Remote data access

*  Programming models, abstractions for data intensive computing

*  Compiler and runtime support

*  Data capturing, management, and scheduling techniques

*  Future research challenges of data intensive computing

*  Performance optimization techniques

*  Replication, archiving, preservation strategies

*  Real-time data intensive computing

*  Network support for data intensive computing

*  Challenges and solutions in the era of multi/many-core platforms

*  Stream computing

*  Green (Power efficient) data intensive computing

*  Security and protection of sensitive data in collaborative
environments

* Data intensive computing on accelerators and GPUs

 
Guide for Authors

 
Papers need not be solely abstract or conceptual in nature: proofs and
experimental results can be included as appropriate.

 
Authors should follow the JPDC manuscript format as described in the
"Information for Authors" at the end of each issue of JPDC or at
http://ees.elsevier.com/jpdc/ . The journal version will be reviewed as
per JPDC review process for special issues.

 
Important Dates:

 
Paper Submission                                            :  January
15, 2010

Notification of Acceptance/Rejection    :  May 31, 2010

Final Version of the Paper                            :  September 15,
2010

 
Submission Guidelines

 
All manuscripts and any supplementary material should be submitted
through Elsevier Editorial System (EES) at http://ees.elsevier.com/jpdc.
Authors must select "Special Issue: Data Intensive Computing" when they
reach the "Article Type" step in the submission process. First time
users must register themselves as Author. For the latest details of the
JPDC special issue see  http://www.cs.iit.edu/~suren/jpdc. 

 
Guest Editors:

Dr. Surendra Byna

NEC Labs America

E-mail: sbyna at nec-labs.com

 
Prof. Xian-He Sun

Illinois Institute of Technology

E-mail: sun at cs.iit.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091215/55c057cf/attachment.html>

From rpnabar at gmail.com  Tue Dec 15 11:24:27 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Tue, 15 Dec 2009 13:24:27 -0600
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<4B23C791.5070701@myri.com>
	<c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>
	<20091214125720.218a9397.chekh@pcbi.upenn.edu>
	<437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com>
Message-ID: <c4d69730912151124n538f8d9ao7ed486dd3adf9828@mail.gmail.com>

On Tue, Dec 15, 2009 at 11:49 AM, Scott Atchley <atchley at myri.com> wrote:
> On Dec 14, 2009, at 12:57 PM, Alex Chekholko wrote:
>
>> Set it as high as you can; there is no downside except ensuring all
>> your devices are set to handle that large unit size. ?Typically, if the
>> device doesn't support jumbo frames, it just drops the jumbo frames
>> silently, which can result in odd intermittent problems.
>
> You can test it by using the size parameter with ping:
>
> $ ping -s <size_in_bytes> <host>
>
> If they all drop, then you have exceeded the MTU of some device.

Thanks Scott. 9000 seems the max. Neither my Switch  nor my eth
adapter like higher values.

-- 
Rahul


From gus at ldeo.columbia.edu  Tue Dec 15 11:36:51 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Tue, 15 Dec 2009 14:36:51 -0500
Subject: [Beowulf] Performance degrading
In-Reply-To: <200912141517.15432.jorg.sassmannshausen@strath.ac.uk>
References: <200912141517.15432.jorg.sassmannshausen@strath.ac.uk>
Message-ID: <4B27E553.2020702@ldeo.columbia.edu>

Hi Jorg

If you have single quad core nodes as you said,
then top shows that you are oversubscribing the cores.
There are five nwchem processes are running.

In my experience, oversubscription only works in relatively
light MPI programs (say the example programs that come with OpenMPI or
MPICH).
Real world applications tend to be very inefficient,
and can even hang on oversubscribed CPUs.

What happens when you launch four or less processes
on a node instead of five?

My $0.02.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


J?rg Sa?mannshausen wrote:
> Dear all,
> 
> I am scratching my head but apart from getting splinters into my fingers I 
> cannot find a good answer for the following problem:
> I am running a DFT program (NWChem) in parallel on our cluster (AMD Opterons, 
> single quad cores in the node, 12 GB of RAM, Gigabit network) and at certain 
> stages of the run top is presenting me with that:
> 
> top - 15:10:48 up 13 days, 22:20,  1 user,  load average: 0.26, 0.24, 0.19
> Tasks: 106 total,   1 running, 105 sleeping,   0 stopped,   0 zombie
> Cpu0  :  8.0% us,  2.7% sy,  0.0% ni, 82.7% id,  0.0% wa,  1.3% hi,  5.3% si
> Cpu1  :  4.1% us,  1.4% sy,  0.0% ni, 94.6% id,  0.0% wa,  0.0% hi,  0.0% si
> Cpu2  :  2.7% us,  0.0% sy,  0.0% ni, 97.3% id,  0.0% wa,  0.0% hi,  0.0% si
> Cpu3  :  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
> Mem:  12250540k total,  5581756k used,  6668784k free,   273396k buffers
> Swap: 16779884k total,        0k used, 16779884k free,  3841688k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 16885 sassy     15   0 3928m 1.7g 1.4g S    4 14.4 312:19.92 nwchem
> 16886 sassy     15   0 3928m 1.7g 1.4g S    4 14.5 313:08.77 nwchem
> 16887 sassy     15   0 3920m 1.7g 1.4g S    3 14.4 316:18.24 nwchem
> 16888 sassy     15   0 3923m 1.6g 1.3g S    3 13.3 316:13.55 nwchem
> 16890 sassy     15   0 2943m 1.7g 1.7g S    3 14.8 104:32.33 nwchem
> 
> It is not a few seconds it does it, it appears to be for a prolonged period of 
> time. I checked it randomly for say 1 min and the performance is well below 
> 50 % (most of the time around 20 %). I have not noticed that when I am 
> running the job within one node.
> 
> I have the suspicion that the Gigabit network is the problem, but I really 
> would like to pinpoint that so I can get my boss to upgrade to a better 
> network for parallel computing (hence my previous question about Open-MX). 
> Now how, as I am not an admin of that cluster, would I be able to do that?
> 
> Thanks for your comments.
> 
> Best wishes from Glasgow!
> 
> J?rg
> 


From Glen.Beane at jax.org  Tue Dec 15 12:10:47 2009
From: Glen.Beane at jax.org (Glen Beane)
Date: Tue, 15 Dec 2009 15:10:47 -0500
Subject: [Beowulf] Performance degrading
In-Reply-To: <4B27E553.2020702@ldeo.columbia.edu>
Message-ID: <C74D5777.593C%glen.beane@jax.org>


On 12/15/09 2:36 PM, "Gus Correa" <gus at ldeo.columbia.edu> wrote:

If you have single quad core nodes as you said,
then top shows that you are oversubscribing the cores.
There are five nwchem processes are running.


It has been a very long time,  but wasn't that normal behavior for mpich under certain instances?  If I recall correctly it had an extra process that was required by the implementation. I don't think it returned from MPI_Init, so you'd have a bunch of processes consuming nearly a full CPU and then one that was mostly idle doing something behind the scenes.  I don't remember if this was for mpich/p4 (with or without -with-comm=shared) or for mpich-gm.


--
Glen L. Beane
Software Engineer
The Jackson Laboratory
Phone (207) 288-6153

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091215/0bb7b89a/attachment.html>

From lindahl at pbm.com  Tue Dec 15 13:22:56 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 15 Dec 2009 13:22:56 -0800
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<4B23C791.5070701@myri.com>
	<c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>
	<20091214125720.218a9397.chekh@pcbi.upenn.edu>
	<437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com>
Message-ID: <20091215212256.GC28010@bx9.net>

On Tue, Dec 15, 2009 at 12:49:23PM -0500, Scott Atchley wrote:

> You can test it by using the size parameter with ping:
>
> $ ping -s <size_in_bytes> <host>
>
> If they all drop, then you have exceeded the MTU of some device.

[lindahl at greg-desk b]$ ping -s 60000 rich-desk
PING rich-desk (64.13.159.69) 60000(60028) bytes of data.
60008 bytes from rich-desk (64.13.159.69): icmp_seq=1 ttl=64 time=1.36 ms
60008 bytes from rich-desk (64.13.159.69): icmp_seq=2 ttl=64 time=1.32 ms

I never knew I bought such advanced network gear!

-- greg


From chekh at pcbi.upenn.edu  Tue Dec 15 13:43:49 2009
From: chekh at pcbi.upenn.edu (Alex Chekholko)
Date: Tue, 15 Dec 2009 16:43:49 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <20091215212256.GC28010@bx9.net>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<4B23C791.5070701@myri.com>
	<c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>
	<20091214125720.218a9397.chekh@pcbi.upenn.edu>
	<437A0D5B-D5F4-4F7A-B8C6-9B09EC723F47@myri.com>
	<20091215212256.GC28010@bx9.net>
Message-ID: <20091215164349.6d413f8d.chekh@pcbi.upenn.edu>

On Tue, 15 Dec 2009 13:22:56 -0800
Greg Lindahl <lindahl at pbm.com> wrote:

> On Tue, Dec 15, 2009 at 12:49:23PM -0500, Scott Atchley wrote:
> 
> > You can test it by using the size parameter with ping:
> >
> > $ ping -s <size_in_bytes> <host>
> >
> > If they all drop, then you have exceeded the MTU of some device.
> 
> [lindahl at greg-desk b]$ ping -s 60000 rich-desk
> PING rich-desk (64.13.159.69) 60000(60028) bytes of data.
> 60008 bytes from rich-desk (64.13.159.69): icmp_seq=1 ttl=64 time=1.36 ms
> 60008 bytes from rich-desk (64.13.159.69): icmp_seq=2 ttl=64 time=1.32 ms
> 
> I never knew I bought such advanced network gear!

Not sure if you're joking, but yes, you also have to tell ping to set
the "don't fragment" bit.

So on Ubuntu 9.04 it would be:

ping -M do -s SIZE whatever.host.com

Regards,
-- 
Alex Chekholko   chekh at pcbi.upenn.edu


From jac67 at georgetown.edu  Tue Dec 15 14:22:24 2009
From: jac67 at georgetown.edu (Jess Cannata)
Date: Tue, 15 Dec 2009 17:22:24 -0500
Subject: [Beowulf] PXE/TFTP and Xen Kernel Issues
Message-ID: <4B280C20.5060700@georgetown.edu>

I'm having a problem booting Xen kernels via PXE. I want to boot a 
machine via PXE that will then host Xen virtual machines. The client 
machine PXE boots, receives the pxelinux.0 file, and then grabs the Xen 
kernel (vmlinuz-2.6.18-164.6.1.el5xen). However, it can never load the 
Xen kernel. On the client, I get the following error:

Invalid or corrupt kernel image.

I have tried the following three kernels (two stock Centos kernels and 
one custom compiled kernel) and only the Xen kernel fails:

-rw-r--r-- 1 root root 2030154 Dec 10 15:28 vmlinuz-2.6.18-164.6.1.el5xen
-rw-r--r-- 1 root root 1932284 Sep 25 16:17 vmlinuz-2.6.18-164.el5
-rw-r--r-- 1 root root 3277584 Dec 10 15:29 vmlinuz-2.6.27.15-jw-node

The others load without error. I have checked multiple times that the 
Xen kernel is not corrupt via md5sums and by booting it via grub. It 
just seems not to like the PXE system. Here is a snippet of the dnsmasq 
log to show that the file is sent correctly to the client:

Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent /tftpboot/pxelinux.0 to 
192.168.0.6
Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent 
/tftpboot/pxelinux.cfg/default to 192.168.0.6
Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent 
/tftpboot/vmlinuz-2.6.18-164.6.1.el5xen to 192.168.0.6

I have tried three different systems for the DHCP, TFTP, and PXE Servers 
(using stock RHEL/Centos packages). Here are the specs:

System 1
Centos 5.4 (64-bit) with nvidia Ethernet adapters
dnsmasq for both DHCP and TFTP Servers
syslinux for PXE

System 2
Centos 5.4 (64-bit) with e1000 Ethernet adapters
dnsmasq for both DHCP and TFTP Servers
syslinux for PXE

System 3
Centos 5.3 (32-bit) with e1000 Ethernet adapters (trying 32-bit version 
of the Xen kernel)
Config One:
dnsmasq for both DHCP and TFTP Servers
syslinux for PXE

Config Two:
dnsmasq for DHCP Server
tftp-server for TFTP Server
syslinux for PXE

The client machines use the same hardware as the servers. I haven't seen 
anything about Xen kernels having issues with PXE. Before I start trying 
different flavors of Linux, I'm curious if anyone else has seen or heard 
of this problem.

Many thanks in advance.

Jess


From gus at ldeo.columbia.edu  Tue Dec 15 17:04:10 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Tue, 15 Dec 2009 20:04:10 -0500
Subject: [Beowulf] Performance degrading
In-Reply-To: <C74D5777.593C%glen.beane@jax.org>
References: <C74D5777.593C%glen.beane@jax.org>
Message-ID: <4B28320A.7040604@ldeo.columbia.edu>

Hi Glen, Jorg

Glen: Yes, you are right about MPICH1/P4 starting extra processes.
However, I wonder if that is what is happening to Jorg,
of if what he reported is just plain CPU oversubscription.

Jorg:  Do you use MPICH1/P4?
How many processes did you launch on a single node, four or five?

Glen:  Out of curiosity, I dug out the MPICH1/P4 I still have on an
old system, compiled and ran "cpi.c".
Indeed there are extra processes there, besides the ones that
I intentionally started in the mpirun command line.
When I launch two processes on a two-single-core-CPU machine,
I also get two (not only one) extra processes, in a total of four.

However, as you mentioned,
the extra processes do not seem to use any significant CPU.
Top shows the two actual processes close to 100% and the
extra ones close to zero.
Furthermore, the extra processes don't use any
significant memory either.

Anyway, in Jorg's case all processes consumed about
the same (low) amount of CPU, but ~15% memory each,
and there were 5 processes (only one "extra"?, is it one per CPU socket?
is it one per core? one per node?).
Hence, I would guess Jorg's context is different.
But ... who knows ... only Jorg can clarify.

These extra processes seem to be related to the
mechanism used by MPICH1/P4 to launch MPI programs.
They don't seem to appear in recent OpenMPI or MPICH2,
which have other launching mechanisms.
Hence my guess that Jorg had an oversubscription problem.

Considering that MPICH1/P4 is old, no longer maintained,
and seems to cause more distress than joy in current kernels,
I would not recommend it to Jorg or to anybody anyway.

Thank you,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Glen Beane wrote:
> 
> 
> 
> On 12/15/09 2:36 PM, "Gus Correa" <gus at ldeo.columbia.edu> wrote:
> 
>     If you have single quad core nodes as you said,
>     then top shows that you are oversubscribing the cores.
>     There are five nwchem processes are running.
> 
> 
> 
> It has been a very long time,  but wasn?t that normal behavior for mpich 
> under certain instances?  If I recall correctly it had an extra process 
> that was required by the implementation. I don?t think it returned from 
> MPI_Init, so you?d have a bunch of processes consuming nearly a full CPU 
> and then one that was mostly idle doing something behind the scenes.  I 
> don?t remember if this was for mpich/p4 (with or without 
> ?with-comm=shared) or for mpich-gm.
> 
> 
> 
> 
> -- 
> Glen L. Beane
> Software Engineer
> The Jackson Laboratory
> Phone (207) 288-6153
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From patrick at myri.com  Tue Dec 15 20:05:26 2009
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 15 Dec 2009 23:05:26 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>	
	<4B23C791.5070701@myri.com>
	<c4d69730912131926g5c1f912at16471233b5f7b525@mail.gmail.com>
Message-ID: <4B285C86.10902@myri.com>

Rahul Nabar wrote:
> Thanks! So I could push it beyond 9000 as well?

1500 Bytes is the standard MTU for Ethernet, anything larger is out of 
spec. The convention for a larger MTU is Jumbo Frames at 9000 Bytes, and 
most switches support it these days. Some hardware even support Super 
Jumbo Frames at 64K but it's rare (and useless IMHO).

Since Jumbo Frames are out of spec, they are typically not enabled by 
default in switches.

Patrick


From patrick at myri.com  Tue Dec 15 20:26:23 2009
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 15 Dec 2009 23:26:23 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <c4d69730912141135m1ea166dfieaa9eff7bdd8893d@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>	
	<4B23C791.5070701@myri.com>
	<c4d69730912141135m1ea166dfieaa9eff7bdd8893d@mail.gmail.com>
Message-ID: <4B28616F.3000600@myri.com>

Rahul Nabar wrote:
> The TSO and LRO are only relevant to TCP though, aren't they? I am
> using RDMA  so that shouldn't matter. Maybe I am wrong.

TSO/LRO applies to TCP, but you can have the same technique with 
different protocol, USO for UDP Send Offload for example.

RDMA is everything you want it to be, but it is not a wire protocol. 
Anyway, NICs that implement zero-copy communication are in a way 
offloading segmentation and reassembly too.

BTW, if your RDMA-capable hardware is running iWarp, it is using TCP on 
the wire.

Patrick


From patrick at myri.com  Tue Dec 15 20:32:38 2009
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 15 Dec 2009 23:32:38 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <c4d69730912141154r7e54d0beo2b73c32b8188c690@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>	
	<4B23C791.5070701@myri.com>
	<c4d69730912141154r7e54d0beo2b73c32b8188c690@mail.gmail.com>
Message-ID: <4B2862E6.2030501@myri.com>

Rahul Nabar wrote:
> What was your tool to measure this latency? Just curious.

I like to use netperf to measure performance over Sockets, including 
latency (it's there but not obvious). For OS-bypass interfaces, your 
favorite MPI benchmark is fine.

Patrick


From patrick at myri.com  Tue Dec 15 20:41:59 2009
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 15 Dec 2009 23:41:59 -0500
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <c609bc800912140259q1b2c92f2v41dcb4a017406889@mail.gmail.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<c609bc800912140259q1b2c92f2v41dcb4a017406889@mail.gmail.com>
Message-ID: <4B286517.10009@myri.com>

Bogdan Costescu wrote:
> long as it fits in one page. At another time, the switch was more
> likely to drop large frames under high load (maybe something to do
> with internal memory management), so the 9000bytes frames worked most
> of the time while the 1500bytes ones worked all the time...

This is an important point. The way hardware flow-control works in 
Ethernet, a switch has to be able to buffer two full frames plus the 
time on the wire for the round-trip. For the curious, the PAUSEs packets 
are sent in-band and you cannot send or receive partial frames.
So, instead of requiring ~4K per port minimum, you need about ~20K per 
port. Add to that up to 8 priorities with DCB and the buffering 
requirement are quickly getting out of hand. That's one big drawbacks of 
large MTUs, along with contention with wormhole switching.

Patrick


From lindahl at pbm.com  Wed Dec 16 00:01:18 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 16 Dec 2009 00:01:18 -0800
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <4B286517.10009@myri.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<c609bc800912140259q1b2c92f2v41dcb4a017406889@mail.gmail.com>
	<4B286517.10009@myri.com>
Message-ID: <20091216080118.GB8679@bx9.net>

On Tue, Dec 15, 2009 at 11:41:59PM -0500, Patrick Geoffray wrote:

> So, instead of requiring ~4K per port minimum, you need about ~20K per  
> port. Add to that up to 8 priorities with DCB and the buffering  
> requirement are quickly getting out of hand.

Don't worry, switch vendors will simply implement it all poorly, just
like InfiniBand. That's what always happens with overly-complicated
QOS schemes.

-- greg


From rpnabar at gmail.com  Wed Dec 16 08:02:29 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 16 Dec 2009 10:02:29 -0600
Subject: [Beowulf] Performance tuning for Jumbo Frames
In-Reply-To: <4B286517.10009@myri.com>
References: <c4d69730912112259g61030249vccb772d967f84985@mail.gmail.com>
	<c609bc800912140259q1b2c92f2v41dcb4a017406889@mail.gmail.com>
	<4B286517.10009@myri.com>
Message-ID: <c4d69730912160802v4da9f164xa6faa6cabf60ed0e@mail.gmail.com>

On Tue, Dec 15, 2009 at 10:41 PM, Patrick Geoffray <patrick at myri.com> wrote:
> Bogdan Costescu wrote:

> So, instead of requiring ~4K per port minimum, you need about ~20K per port.
> Add to that up to 8 priorities with DCB and the buffering requirement are
> quickly getting out of hand. That's one big drawbacks of large MTUs, along
> with contention with wormhole switching.


On closer investigation I am seeing TXPause frames and dropped
packets. Have to dig deeper into this.Gotta figure  out how much RAM
per port this switch has.

-- 
Rahul


From lindahl at pbm.com  Wed Dec 16 09:33:05 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 16 Dec 2009 09:33:05 -0800
Subject: [Beowulf] Intel compiler part of the anti-trust lawsuit
Message-ID: <20091216173305.GA4233@bx9.net>

You folks will recall that Intel, a while ago, stopped enabling their
compiler's highest optimization levels for chips that weren't "Genuine
Intel(tm)". Well, that's part of the new FTC complaint against Intel:

 Intel secretly redesigned key software, known as a compiler, in a way
 that deliberately stunted the performance of competitors? CPU
 chips. Intel told its customers and the public that software performed
 better on Intel CPUs than on competitors? CPUs, but the company
 deceived them by failing to disclose that these differences were due
 largely or entirely to Intel?s compiler design.

PathScale was subpoenaed a long time ago by both AMD and Intel about
this issue for the AMD/Intel lawsuit, recently settled.

The bundling of chipsets with Atom processors (it's cheaper to buy
both than a naked cpu) seems to also be part of the suit.

-- greg


From gus at ldeo.columbia.edu  Wed Dec 16 12:03:34 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Wed, 16 Dec 2009 15:03:34 -0500
Subject: [Beowulf] A question about antique hardware
Message-ID: <4B293D16.5090704@ldeo.columbia.edu>

Dear Beowulfers

Did anybody ever get Gigabit Ethernet NICs to work on
the Tyan Tiger S2466-4M motherboards under Linux?

If so, I would appreciate any words of wisdom about which
NICs work, the appropriate BIOS settings,
which PCI slots to use, etc.

***

I flashed the Tyan S2466-4M BIOS to the latest version,
V4.06 (super, final 2003 edition).

I need to set this head node up with two GigE ports.
I have two Intel 82543 Fiber Gigabit Ethernet PCI adapters,
which use the e1000 driver.
However, I would happily use other NICs and drivers,
anything that works, including copper based GigE.

***

I googled up to find tips and solutions,
and I tried  a number of different combinations:
disabling the onboard 3Com Ethernet 100 port with a jumper;
placing the NICs on the PCI-64 and on the PCI-32 slots;
disabling the BIOS "option RAM scan" on the NICs' PCI slots;
disabling USB on BIOS;
trying one NIC at a time; etc.

However, so far no game.
The NICs are recognized,
link LEDs light up,
ping works,
but the system seems to be unstable,
ifdown/ifup hangs,
hence the system hangs when it tries to
take down the GigE ports during shutdown.

Moreover, I get many of this kernel message on dmesg:

Warning: kfree_skb on hard IRQ f88e47b2

***

This is the head node of our old Linux NetworX cluster.
The original head node motherboard, ASUS A7M266,
supported the aforementioned Intel NICs.
Unfortunately it seems to have died.

I bought an used-but-functional Tyan S2466-4M board
on E-Bay as a replacement.
These S2466-4M boards seem to have been very popular
on servers and Beowulfs.
It sounded to me as a good choice.
After all, we have this board on all compute nodes.
The compute nodes don't have GigE,
only the onboard 3Com Ethernet 100 for service and I/O,
plus Myrinet-2000 for MPI.
They have been working fine for 8 years now.

***

Thank you.

Happy Holidays!

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


From gus at ldeo.columbia.edu  Wed Dec 16 12:32:41 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Wed, 16 Dec 2009 15:32:41 -0500
Subject: [Beowulf] Intel compiler part of the anti-trust lawsuit
In-Reply-To: <20091216173305.GA4233@bx9.net>
References: <20091216173305.GA4233@bx9.net>
Message-ID: <4B2943E9.8080202@ldeo.columbia.edu>

Incidentally, have you watched the "cannonbells" movie?

http://www.reghardware.co.uk/2009/12/16/intel_chime_stunt/print.html
http://www.theregister.co.uk/2009/12/16/intel_ftc/

Gus Correa

Greg Lindahl wrote:
> You folks will recall that Intel, a while ago, stopped enabling their
> compiler's highest optimization levels for chips that weren't "Genuine
> Intel(tm)". Well, that's part of the new FTC complaint against Intel:
> 
>  Intel secretly redesigned key software, known as a compiler, in a way
>  that deliberately stunted the performance of competitors? CPU
>  chips. Intel told its customers and the public that software performed
>  better on Intel CPUs than on competitors? CPUs, but the company
>  deceived them by failing to disclose that these differences were due
>  largely or entirely to Intel?s compiler design.
> 
> PathScale was subpoenaed a long time ago by both AMD and Intel about
> this issue for the AMD/Intel lawsuit, recently settled.
> 
> The bundling of chipsets with Atom processors (it's cheaper to buy
> both than a naked cpu) seems to also be part of the suit.
> 
> -- greg
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at pbm.com  Wed Dec 16 13:26:34 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 16 Dec 2009 13:26:34 -0800
Subject: [Beowulf] A question about antique hardware
In-Reply-To: <4B293D16.5090704@ldeo.columbia.edu>
References: <4B293D16.5090704@ldeo.columbia.edu>
Message-ID: <20091216212634.GB29173@bx9.net>

> Did anybody ever get Gigabit Ethernet NICs to work on
> the Tyan Tiger S2466-4M motherboards under Linux?

You talked a lot about the BIOS but didn't say if it was a 6 year old
Linux version. Presumably your old mobos worked fine with the same
version of Linux that this guy is failing with, but still, it could
be a Linux-related problem, and not the motherboard or BIOS.

-- greg


From mathog at caltech.edu  Wed Dec 16 14:27:29 2009
From: mathog at caltech.edu (David Mathog)
Date: Wed, 16 Dec 2009 14:27:29 -0800
Subject: [Beowulf] Geriatric computer does not stay up
Message-ID: <E1NL2LB-0006Oa-8A@mendel.bio.caltech.edu>

So we have a cluster of Tyan S2466 nodes and one of them has failed in
an odd way. (Yes, these are very old, and they would be gone if we had a
replacment.)  On applying power the system boots normally and gets far
into the boot sequence, sometimes to the login prompt, then it locks up.
 If booted failsafe it will stay up for tens of minutes before locking.
 It locked once on "man smartctl" and once on "service network start". 
However, on the next reboot, it didn't lock with another "man smartctl",
so it isn't like it hit a bad part of the disk and died.  Smartctl test
has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
healthy with no blocks swapped out.  Power stays on when it locks, and
the display remains as it was just before the lock.  When it locks it
will not respond to either the keyboard or the network.  (The network
interface light still flashes.)  There is nothing in any of the logs to
indicate the nature of the problem.

The odd thing is that the system is remarkably stable in some ways.  For
instance, the PS tests good and heat isn't the issue: after running
sensors in a tight loop to a log file, waiting for it to lock up, then
looking at the log on the next failsafe boot, there were negligible
fluctuation on any of the voltages, fan speeds, or temperatures.  It
will happily sit for 30 minutes in the BIOS, or hours running memtest86
(without errors).  The motherboard battery is good, and the inside of
the case is very clean, with no dust visible at all.  Reset the BIOS but
it didn't change anything.

Here are my current hypotheses for what's wrong with this beast:

1. The drive is failing electrically, puts voltage spikes out on some
operations, and these crash the system.
2. The motherboard capacitors are failing and letting too much noise in.
 The noise which is fatal is only seen on an active system, so sitting
in the BIOS or in Memtest86 does not do it. (But the caps all look good,
no swelling, no leaks.)  It will run memtest86 overnight though, just in
case.
3. The PS capacitors are failing, so that when loaded there is enough
voltage fluctuation to crash the system.  (Does not agree very well with
the sensors measurements, but it could be really high frequency noise
superimposed on a steady base voltage.)
4. Evil Djinn ;-(

Any thoughts on what else this might be? 

Thanks. 

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From prentice at ias.edu  Wed Dec 16 14:39:06 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Wed, 16 Dec 2009 17:39:06 -0500
Subject: [Beowulf] Intel Cannonbells
In-Reply-To: <4B2943E9.8080202@ldeo.columbia.edu>
References: <20091216173305.GA4233@bx9.net>
	<4B2943E9.8080202@ldeo.columbia.edu>
Message-ID: <4B29618A.3080808@ias.edu>

Gus Correa wrote:
> Incidentally, have you watched the "cannonbells" movie?
> 
> http://www.reghardware.co.uk/2009/12/16/intel_chime_stunt/print.html
> http://www.theregister.co.uk/2009/12/16/intel_ftc/
> 
>

I'm calling BS on that one. While possible, I suspect the human
projectiles were CG'ed in.

--
Prentice


From gerry.creager at tamu.edu  Wed Dec 16 15:46:54 2009
From: gerry.creager at tamu.edu (Gerald Creager)
Date: Wed, 16 Dec 2009 17:46:54 -0600
Subject: [Beowulf] Geriatric computer does not stay up
In-Reply-To: <E1NL2LB-0006Oa-8A@mendel.bio.caltech.edu>
References: <E1NL2LB-0006Oa-8A@mendel.bio.caltech.edu>
Message-ID: <4B29716E.4090600@tamu.edu>

David Mathog wrote:
> So we have a cluster of Tyan S2466 nodes and one of them has failed in
> an odd way. (Yes, these are very old, and they would be gone if we had a
> replacment.)  On applying power the system boots normally and gets far
> into the boot sequence, sometimes to the login prompt, then it locks up.
>  If booted failsafe it will stay up for tens of minutes before locking.
>  It locked once on "man smartctl" and once on "service network start". 
> However, on the next reboot, it didn't lock with another "man smartctl",
> so it isn't like it hit a bad part of the disk and died.  Smartctl test
> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
> healthy with no blocks swapped out.  Power stays on when it locks, and
> the display remains as it was just before the lock.  When it locks it
> will not respond to either the keyboard or the network.  (The network
> interface light still flashes.)  There is nothing in any of the logs to
> indicate the nature of the problem.
> 
> The odd thing is that the system is remarkably stable in some ways.  For
> instance, the PS tests good and heat isn't the issue: after running
> sensors in a tight loop to a log file, waiting for it to lock up, then
> looking at the log on the next failsafe boot, there were negligible
> fluctuation on any of the voltages, fan speeds, or temperatures.  It
> will happily sit for 30 minutes in the BIOS, or hours running memtest86
> (without errors).  The motherboard battery is good, and the inside of
> the case is very clean, with no dust visible at all.  Reset the BIOS but
> it didn't change anything.
> 
> Here are my current hypotheses for what's wrong with this beast:
> 
> 1. The drive is failing electrically, puts voltage spikes out on some
> operations, and these crash the system.
> 2. The motherboard capacitors are failing and letting too much noise in.
>  The noise which is fatal is only seen on an active system, so sitting
> in the BIOS or in Memtest86 does not do it. (But the caps all look good,
> no swelling, no leaks.)  It will run memtest86 overnight though, just in
> case.
> 3. The PS capacitors are failing, so that when loaded there is enough
> voltage fluctuation to crash the system.  (Does not agree very well with
> the sensors measurements, but it could be really high frequency noise
> superimposed on a steady base voltage.)
> 4. Evil Djinn ;-(
> 
> Any thoughts on what else this might be? 


I'd also be suspicious of memory failures.  We have had DIMM failures 
that were unseen on repeated MemTest86 runs until they failed hard, 
hard, HARD.  While they were still trying to decide, they'd pass MemTest 
and we'd try using them.

Capacitor failures are a potential problem but if the systems have been 
in a stable environment and not subject to a lot of thermal stressors, 
they should be fine.  Especially the power supply caps shouldn't decide 
to get old and fail (I'm assuming you're talking electrolytics).  The 
old paper electrolytics might have exhibited this behavior, but not even 
tantalums will do this.  And, if tantalum caps go, they tend to be more 
spectacular and take lots of other parts with them.

More to the point, (ceramic) chip caps that haven't been in a 
wet/moist/temp-varying, humid environment shouldn't crack and fail.

Option 4 has potential, though.

gc


From bernard at vanhpc.org  Wed Dec 16 16:13:08 2009
From: bernard at vanhpc.org (Bernard Li)
Date: Wed, 16 Dec 2009 16:13:08 -0800
Subject: [Beowulf] PXE/TFTP and Xen Kernel Issues
In-Reply-To: <4B280C20.5060700@georgetown.edu>
References: <4B280C20.5060700@georgetown.edu>
Message-ID: <d4c731da0912161613y39d6e5e8had796385cd4fff41@mail.gmail.com>

Hi Jess:

With Xen-based kernels, you should be using the xen.gz "kernel"
instead of vmlinuz.  Here's what a grub entry looks like for booting
Xen-based kernels:

title CentOS (2.6.18-164.el5xen)
	root (hd0,0)
	kernel /xen.gz-3.4.0
	module /vmlinuz-2.6.18-164.el5xen ro root=/dev/VolGroup00/LogVol00
	module /initrd-2.6.18-164.el5xen.img

Good luck!

Cheers,

Bernard

On Tue, Dec 15, 2009 at 2:22 PM, Jess Cannata <jac67 at georgetown.edu> wrote:
> I'm having a problem booting Xen kernels via PXE. I want to boot a machine
> via PXE that will then host Xen virtual machines. The client machine PXE
> boots, receives the pxelinux.0 file, and then grabs the Xen kernel
> (vmlinuz-2.6.18-164.6.1.el5xen). However, it can never load the Xen kernel.
> On the client, I get the following error:
>
> Invalid or corrupt kernel image.
>
> I have tried the following three kernels (two stock Centos kernels and one
> custom compiled kernel) and only the Xen kernel fails:
>
> -rw-r--r-- 1 root root 2030154 Dec 10 15:28 vmlinuz-2.6.18-164.6.1.el5xen
> -rw-r--r-- 1 root root 1932284 Sep 25 16:17 vmlinuz-2.6.18-164.el5
> -rw-r--r-- 1 root root 3277584 Dec 10 15:29 vmlinuz-2.6.27.15-jw-node
>
> The others load without error. I have checked multiple times that the Xen
> kernel is not corrupt via md5sums and by booting it via grub. It just seems
> not to like the PXE system. Here is a snippet of the dnsmasq log to show
> that the file is sent correctly to the client:
>
> Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent /tftpboot/pxelinux.0 to
> 192.168.0.6
> Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent
> /tftpboot/pxelinux.cfg/default to 192.168.0.6
> Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent
> /tftpboot/vmlinuz-2.6.18-164.6.1.el5xen to 192.168.0.6
>
> I have tried three different systems for the DHCP, TFTP, and PXE Servers
> (using stock RHEL/Centos packages). Here are the specs:
>
> System 1
> Centos 5.4 (64-bit) with nvidia Ethernet adapters
> dnsmasq for both DHCP and TFTP Servers
> syslinux for PXE
>
> System 2
> Centos 5.4 (64-bit) with e1000 Ethernet adapters
> dnsmasq for both DHCP and TFTP Servers
> syslinux for PXE
>
> System 3
> Centos 5.3 (32-bit) with e1000 Ethernet adapters (trying 32-bit version of
> the Xen kernel)
> Config One:
> dnsmasq for both DHCP and TFTP Servers
> syslinux for PXE
>
> Config Two:
> dnsmasq for DHCP Server
> tftp-server for TFTP Server
> syslinux for PXE
>
> The client machines use the same hardware as the servers. I haven't seen
> anything about Xen kernels having issues with PXE. Before I start trying
> different flavors of Linux, I'm curious if anyone else has seen or heard of
> this problem.
>
> Many thanks in advance.
>
> Jess
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


From gus at ldeo.columbia.edu  Wed Dec 16 17:28:04 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Wed, 16 Dec 2009 20:28:04 -0500
Subject: [Beowulf] Geriatric computer does not stay up
In-Reply-To: <4B29716E.4090600@tamu.edu>
References: <E1NL2LB-0006Oa-8A@mendel.bio.caltech.edu>
	<4B29716E.4090600@tamu.edu>
Message-ID: <4B298924.3090805@ldeo.columbia.edu>

Hi David

Some of the built-in 3Com Ethernet 100 interfaces on
Tyan S2466[-4M] motherboards we have here became flaky/failed
after many years of use.
Those are main boards in in several standalone workstations/PCs.
I don't administer those systems, but I believe the symptoms
were somewhat random, as those you describe.

Disabling the onboard Ethernet (by jumper), and replacing them by
PCI Ethernet 100 cards, gave those systems additional lifetime.
Would this be the case of your cluster node?

Interesting that I also posted today a note asking for help with
Gigabit Ethernet on these very same motherboards!
We also have them in an old workhorse cluster.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Gerald Creager wrote:
> David Mathog wrote:
>> So we have a cluster of Tyan S2466 nodes and one of them has failed in
>> an odd way. (Yes, these are very old, and they would be gone if we had a
>> replacment.)  On applying power the system boots normally and gets far
>> into the boot sequence, sometimes to the login prompt, then it locks up.
>>  If booted failsafe it will stay up for tens of minutes before locking.
>>  It locked once on "man smartctl" and once on "service network start". 
>> However, on the next reboot, it didn't lock with another "man smartctl",
>> so it isn't like it hit a bad part of the disk and died.  Smartctl test
>> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
>> healthy with no blocks swapped out.  Power stays on when it locks, and
>> the display remains as it was just before the lock.  When it locks it
>> will not respond to either the keyboard or the network.  (The network
>> interface light still flashes.)  There is nothing in any of the logs to
>> indicate the nature of the problem.
>>
>> The odd thing is that the system is remarkably stable in some ways.  For
>> instance, the PS tests good and heat isn't the issue: after running
>> sensors in a tight loop to a log file, waiting for it to lock up, then
>> looking at the log on the next failsafe boot, there were negligible
>> fluctuation on any of the voltages, fan speeds, or temperatures.  It
>> will happily sit for 30 minutes in the BIOS, or hours running memtest86
>> (without errors).  The motherboard battery is good, and the inside of
>> the case is very clean, with no dust visible at all.  Reset the BIOS but
>> it didn't change anything.
>>
>> Here are my current hypotheses for what's wrong with this beast:
>>
>> 1. The drive is failing electrically, puts voltage spikes out on some
>> operations, and these crash the system.
>> 2. The motherboard capacitors are failing and letting too much noise in.
>>  The noise which is fatal is only seen on an active system, so sitting
>> in the BIOS or in Memtest86 does not do it. (But the caps all look good,
>> no swelling, no leaks.)  It will run memtest86 overnight though, just in
>> case.
>> 3. The PS capacitors are failing, so that when loaded there is enough
>> voltage fluctuation to crash the system.  (Does not agree very well with
>> the sensors measurements, but it could be really high frequency noise
>> superimposed on a steady base voltage.)
>> 4. Evil Djinn ;-(
>>
>> Any thoughts on what else this might be? 
> 
> 
> I'd also be suspicious of memory failures.  We have had DIMM failures 
> that were unseen on repeated MemTest86 runs until they failed hard, 
> hard, HARD.  While they were still trying to decide, they'd pass MemTest 
> and we'd try using them.
> 
> Capacitor failures are a potential problem but if the systems have been 
> in a stable environment and not subject to a lot of thermal stressors, 
> they should be fine.  Especially the power supply caps shouldn't decide 
> to get old and fail (I'm assuming you're talking electrolytics).  The 
> old paper electrolytics might have exhibited this behavior, but not even 
> tantalums will do this.  And, if tantalum caps go, they tend to be more 
> spectacular and take lots of other parts with them.
> 
> More to the point, (ceramic) chip caps that haven't been in a 
> wet/moist/temp-varying, humid environment shouldn't crack and fail.
> 
> Option 4 has potential, though.
> 
> gc
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at pbm.com  Wed Dec 16 18:05:48 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 16 Dec 2009 18:05:48 -0800
Subject: [Beowulf] Kernel action relevant to us
Message-ID: <20091217020548.GC19867@bx9.net>

The following patch, not yet accepted into the kernel, should allow
local TCP connections to start up faster, while remote ones keep the
same behavior of slow start.

----- Forwarded message from chavey at google.com -----

From: chavey at google.com
Date: Tue, 15 Dec 2009 13:15:28 -0800
To: davem at davemloft.net
CC: netdev at vger.kernel.org, therbert at google.com, chavey at google.com,
	eric.dumazet at gmail.com
Subject: [PATCH] Add rtnetlink init_rcvwnd to set the TCP initial receive window
X-Mailing-List: netdev at vger.kernel.org

Add rtnetlink init_rcvwnd to set the TCP initial receive window size
advertised by passive and active TCP connections.
The current Linux TCP implementation limits the advertised TCP initial
receive window to the one prescribed by slow start. For short lived
TCP connections used for transaction type of traffic (i.e. http
requests), bounding the advertised TCP initial receive window results
in increased latency to complete the transaction.
Support for setting initial congestion window is already supported
using rtnetlink init_cwnd, but the feature is useless without the
ability to set a larger TCP initial receive window.
The rtnetlink init_rcvwnd allows increasing the TCP initial receive
window, allowing TCP connection to advertise larger TCP receive window
than the ones bounded by slow start.

Signed-off-by: Laurent Chavey <chavey at google.com>
---
 include/linux/rtnetlink.h |    2 ++
 include/net/dst.h         |    2 --
 include/net/tcp.h         |    3 ++-
 net/ipv4/syncookies.c     |    3 ++-
 net/ipv4/tcp_output.c     |   17 +++++++++++++----
 net/ipv6/syncookies.c     |    3 ++-
 6 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index adf2068..db6f614 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -371,6 +371,8 @@ enum
 #define RTAX_FEATURES RTAX_FEATURES
 	RTAX_RTO_MIN,
 #define RTAX_RTO_MIN RTAX_RTO_MIN
+	RTAX_INITRWND,
+#define RTAX_INITRWND RTAX_INITRWND
 	__RTAX_MAX
 };
 
diff --git a/include/net/dst.h b/include/net/dst.h
index 5a900dd..6ef812a 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -84,8 +84,6 @@ struct dst_entry
 	 * (L1_CACHE_SIZE would be too much)
 	 */
 #ifdef CONFIG_64BIT
-	long			__pad_to_align_refcnt[2];
-#else
 	long			__pad_to_align_refcnt[1];
 #endif
 	/*
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..6f95d32 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -972,7 +972,8 @@ static inline void tcp_sack_reset(struct tcp_options_received *rx_opt)
 /* Determine a window scaling and initial window to offer. */
 extern void tcp_select_initial_window(int __space, __u32 mss,
 				      __u32 *rcv_wnd, __u32 *window_clamp,
-				      int wscale_ok, __u8 *rcv_wscale);
+				      int wscale_ok, __u8 *rcv_wscale,
+				      __u32 init_rcv_wnd);
 
 static inline int tcp_win_from_space(int space)
 {
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index a6e0e07..d43173c 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -356,7 +356,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 
 	tcp_select_initial_window(tcp_full_space(sk), req->mss,
 				  &req->rcv_wnd, &req->window_clamp,
-				  ireq->wscale_ok, &rcv_wscale);
+				  ireq->wscale_ok, &rcv_wscale,
+				  dst_metric(&rt->u.dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale  = rcv_wscale;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fcd278a..ee42c75 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -179,7 +179,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts)
  */
 void tcp_select_initial_window(int __space, __u32 mss,
 			       __u32 *rcv_wnd, __u32 *window_clamp,
-			       int wscale_ok, __u8 *rcv_wscale)
+			       int wscale_ok, __u8 *rcv_wscale,
+			       __u32 init_rcv_wnd)
 {
 	unsigned int space = (__space < 0 ? 0 : __space);
 
@@ -228,7 +229,13 @@ void tcp_select_initial_window(int __space, __u32 mss,
 			init_cwnd = 2;
 		else if (mss > 1460)
 			init_cwnd = 3;
-		if (*rcv_wnd > init_cwnd * mss)
+		/* when initializing use the value from init_rcv_wnd
+		 * rather than the default from above
+		 */
+		if (init_rcv_wnd &&
+		    (*rcv_wnd > init_rcv_wnd * mss))
+			*rcv_wnd = init_rcv_wnd * mss;
+		else if (*rcv_wnd > init_cwnd * mss)
 			*rcv_wnd = init_cwnd * mss;
 	}
 
@@ -2254,7 +2261,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 			&req->rcv_wnd,
 			&req->window_clamp,
 			ireq->wscale_ok,
-			&rcv_wscale);
+			&rcv_wscale,
+			dst_metric(dst, RTAX_INITRWND));
 		ireq->rcv_wscale = rcv_wscale;
 	}
 
@@ -2342,7 +2350,8 @@ static void tcp_connect_init(struct sock *sk)
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,
 				  sysctl_tcp_window_scaling,
-				  &rcv_wscale);
+				  &rcv_wscale,
+				  dst_metric(dst, RTAX_INITRWND));
 
 	tp->rx_opt.rcv_wscale = rcv_wscale;
 	tp->rcv_ssthresh = tp->rcv_wnd;
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 6b6ae91..c8982aa 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -267,7 +267,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
 	tcp_select_initial_window(tcp_full_space(sk), req->mss,
 				  &req->rcv_wnd, &req->window_clamp,
-				  ireq->wscale_ok, &rcv_wscale);
+				  ireq->wscale_ok, &rcv_wscale,
+				  dst_metric(dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale = rcv_wscale;
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

----- End forwarded message -----


From h-bugge at online.no  Thu Dec 17 00:33:36 2009
From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=)
Date: Thu, 17 Dec 2009 09:33:36 +0100
Subject: [Beowulf] FTC and Intel
Message-ID: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no>

In http://www.ftc.gov/os/adjpro/d9341/091216intelcmpt.pdf, there are  
allegations against Intel, such as "20. Intel?s efforts to deny  
interoperability between competitors? (e.g., Nvidia, AMD, and Via)
GPUs and Intel?s newest CPUs".

I was unaware of this. Anyone know what kind of interoperability we  
are talking about here?


H?kon


From bcostescu at gmail.com  Thu Dec 17 10:07:42 2009
From: bcostescu at gmail.com (Bogdan Costescu)
Date: Thu, 17 Dec 2009 19:07:42 +0100
Subject: [Beowulf] A question about antique hardware
In-Reply-To: <4B293D16.5090704@ldeo.columbia.edu>
References: <4B293D16.5090704@ldeo.columbia.edu>
Message-ID: <c609bc800912171007v5fbfcd67k54d94e6ed160cdfc@mail.gmail.com>

On Wed, Dec 16, 2009 at 9:03 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Did anybody ever get Gigabit Ethernet NICs to work on
> the Tyan Tiger S2466-4M motherboards under Linux?

I had a few nodes with this mainboard and used one Intel 82541 (not
sure about the number though, it's a short card which supports both
PCI-32 and PCI-64, Gbit over copper) in each. The nodes were usually
stable under load-moderate load, but would not survive a combined
net+disk load if I was to use them f.e. as NFS servers. Generally
speaking, I had the impression that the interrupt handling on these
mainboards was faulty; plus maybe also the power regulation... Nodes
with these mainboards were the most unstable that I've ever had so,
for reduced headache, I would suggest just going with something more
recent.

Cheers,
Bogdan


From mathog at caltech.edu  Thu Dec 17 10:32:15 2009
From: mathog at caltech.edu (David Mathog)
Date: Thu, 17 Dec 2009 10:32:15 -0800
Subject: [Beowulf] Re: Geriatric computer does not stay up
Message-ID: <E1NLL95-0006za-RK@mendel.bio.caltech.edu>

Gus Correa <gus at ldeo.columbia.edu> wrote

> Some of the built-in 3Com Ethernet 100 interfaces on
> Tyan S2466[-4M] motherboards we have here became flaky/failed
> after many years of use.
> Those are main boards in in several standalone workstations/PCs.
> I don't administer those systems, but I believe the symptoms
> were somewhat random, as those you describe.
> 
> Disabling the onboard Ethernet (by jumper), and replacing them by
> PCI Ethernet 100 cards, gave those systems additional lifetime.
> Would this be the case of your cluster node?

Tyan S2466 MPX does not seem to have such a jumper.  Possibly it can be
disabled in the BIOS. Oddly, the system is fine PXE booting over that
interface, but every attempt at:

  service network start

hangs instantly.  Tried booting with a serial console like this from
pxelinux.cfg:

LABEL serial
  KERNEL vmlinuz-2.6.24.7-desktop-2mnb
  APPEND initrd=initrd-2.6.24.7-desktop-2mnb.img root=/dev/hda3 failsafe
 console=ttyS0,38400

which uses the initrd and vmlinuz downloaded from the server, and the
disk from the iffy machine for the programs.  That booted fine, but the
kernel emitted nothing on the serial line when the machine hung.

Running smartctl now, after that will boot a rescue linux and see if
that too has network issues.  

Ran memory tests for over 20 hours without a single hiccup.

I'll keep looking.  Thanks all.

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From gus at ldeo.columbia.edu  Thu Dec 17 11:39:48 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Thu, 17 Dec 2009 14:39:48 -0500
Subject: [Beowulf] Re: Geriatric computer does not stay up
In-Reply-To: <E1NLL95-0006za-RK@mendel.bio.caltech.edu>
References: <E1NLL95-0006za-RK@mendel.bio.caltech.edu>
Message-ID: <4B2A8904.3010002@ldeo.columbia.edu>

Hi David

There is more than one version of this motherboard,
hence I may be dead wrong about yours.

In any case, I have the mobo user manual here.
You can download it from Tyan also (with an appendix):

http://www.tyan.com/archive/products/html/tigermpx.html
ftp://ftp.tyan.com/manuals/m_s2466_120.pdf
ftp://ftp.tyan.com/manuals/a_s2466_110.pdf

The manual says that jumper J86 disables/enables
the onboard "LAN" (3Com 3C905c).
Onboard Ethernet is *disabled* when the jumper is *closed*.
(It is probably open now, no jumper,
assuming onboard Ethernet is currently enabled.)

J86 is next to the motherboard edge, near PCI slot #4,
probably near the back of the node case.

Also, it is unclear whether PXE boot will work with a PCI NIC,
but it may.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

David Mathog wrote:
> Gus Correa <gus at ldeo.columbia.edu> wrote
> 
>> Some of the built-in 3Com Ethernet 100 interfaces on
>> Tyan S2466[-4M] motherboards we have here became flaky/failed
>> after many years of use.
>> Those are main boards in in several standalone workstations/PCs.
>> I don't administer those systems, but I believe the symptoms
>> were somewhat random, as those you describe.
>>
>> Disabling the onboard Ethernet (by jumper), and replacing them by
>> PCI Ethernet 100 cards, gave those systems additional lifetime.
>> Would this be the case of your cluster node?
> 
> Tyan S2466 MPX does not seem to have such a jumper.  Possibly it can be
> disabled in the BIOS. Oddly, the system is fine PXE booting over that
> interface, but every attempt at:
> 
>   service network start
> 
> hangs instantly.  Tried booting with a serial console like this from
> pxelinux.cfg:
> 
> LABEL serial
>   KERNEL vmlinuz-2.6.24.7-desktop-2mnb
>   APPEND initrd=initrd-2.6.24.7-desktop-2mnb.img root=/dev/hda3 failsafe
>  console=ttyS0,38400
> 
> which uses the initrd and vmlinuz downloaded from the server, and the
> disk from the iffy machine for the programs.  That booted fine, but the
> kernel emitted nothing on the serial line when the machine hung.
> 
> Running smartctl now, after that will boot a rescue linux and see if
> that too has network issues.  
> 
> Ran memory tests for over 20 hours without a single hiccup.
> 
> I'll keep looking.  Thanks all.
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From cousins at umit.maine.edu  Thu Dec 17 11:55:52 2009
From: cousins at umit.maine.edu (Steve Cousins)
Date: Thu, 17 Dec 2009 14:55:52 -0500 (EST)
Subject: [Beowulf] Re: A question about antique hardware
In-Reply-To: <200912170130.nBH1Uvvr016863@bluewest.scyld.com>
References: <200912170130.nBH1Uvvr016863@bluewest.scyld.com>
Message-ID: <alpine.LFD.2.00.0912171443520.24375@razzo.umeoce.maine.edu>


Gus Correa wrote:

> Dear Beowulfers
>
> Did anybody ever get Gigabit Ethernet NICs to work on
> the Tyan Tiger S2466-4M motherboards under Linux?

Hi Gus,

I have a S2466 Tiger MPX board that has been running for years with a 
copper Intel Pro/1000 MT NIC (82540EM) without trouble. The onboard 3Com 
3c905C is running too and I've never had any problems with it (that I can 
remember!). The Gigabit card is using the e1000 driver and the 3Com is 
using 3c59x with a 2.4.20-20.7smp kernel with Redhat 7.3. (yikes!) As you 
can see it is a machine that I have pretty much forgotten about but it is 
closed off to the internet and it still serves a purpose.

I hope this helps.

Steve


> If so, I would appreciate any words of wisdom about which
> NICs work, the appropriate BIOS settings,
> which PCI slots to use, etc.
>
> ***
>
> I flashed the Tyan S2466-4M BIOS to the latest version,
> V4.06 (super, final 2003 edition).
>
> I need to set this head node up with two GigE ports.
> I have two Intel 82543 Fiber Gigabit Ethernet PCI adapters,
> which use the e1000 driver.
> However, I would happily use other NICs and drivers,
> anything that works, including copper based GigE.
>
> ***
>
> I googled up to find tips and solutions,
> and I tried  a number of different combinations:
> disabling the onboard 3Com Ethernet 100 port with a jumper;
> placing the NICs on the PCI-64 and on the PCI-32 slots;
> disabling the BIOS "option RAM scan" on the NICs' PCI slots;
> disabling USB on BIOS;
> trying one NIC at a time; etc.
>
> However, so far no game.
> The NICs are recognized,
> link LEDs light up,
> ping works,
> but the system seems to be unstable,
> ifdown/ifup hangs,
> hence the system hangs when it tries to
> take down the GigE ports during shutdown.
>
> Moreover, I get many of this kernel message on dmesg:
>
> Warning: kfree_skb on hard IRQ f88e47b2
>
> ***
>
> This is the head node of our old Linux NetworX cluster.
> The original head node motherboard, ASUS A7M266,
> supported the aforementioned Intel NICs.
> Unfortunately it seems to have died.
>
> I bought an used-but-functional Tyan S2466-4M board
> on E-Bay as a replacement.
> These S2466-4M boards seem to have been very popular
> on servers and Beowulfs.
> It sounded to me as a good choice.
> After all, we have this board on all compute nodes.
> The compute nodes don't have GigE,
> only the onboard 3Com Ethernet 100 for service and I/O,
> plus Myrinet-2000 for MPI.
> They have been working fine for 8 years now.
>
> ***
>
> Thank you.
>
> Happy Holidays!
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------


From jac67 at georgetown.edu  Thu Dec 17 13:05:46 2009
From: jac67 at georgetown.edu (Jess Cannata)
Date: Thu, 17 Dec 2009 16:05:46 -0500
Subject: [Beowulf] PXE/TFTP and Xen Kernel Issues
In-Reply-To: <d4c731da0912161613y39d6e5e8had796385cd4fff41@mail.gmail.com>
References: <4B280C20.5060700@georgetown.edu>
	<d4c731da0912161613y39d6e5e8had796385cd4fff41@mail.gmail.com>
Message-ID: <4B2A9D2A.2060104@georgetown.edu>

Thanks Bernard! This is what I was looking for.

Jess

On 12/16/2009 07:13 PM, Bernard Li wrote:
> Hi Jess:
>
> With Xen-based kernels, you should be using the xen.gz "kernel"
> instead of vmlinuz.  Here's what a grub entry looks like for booting
> Xen-based kernels:
>
> title CentOS (2.6.18-164.el5xen)
> 	root (hd0,0)
> 	kernel /xen.gz-3.4.0
> 	module /vmlinuz-2.6.18-164.el5xen ro root=/dev/VolGroup00/LogVol00
> 	module /initrd-2.6.18-164.el5xen.img
>
> Good luck!
>
> Cheers,
>
> Bernard
>
> On Tue, Dec 15, 2009 at 2:22 PM, Jess Cannata<jac67 at georgetown.edu>  wrote:
>    
>> I'm having a problem booting Xen kernels via PXE. I want to boot a machine
>> via PXE that will then host Xen virtual machines. The client machine PXE
>> boots, receives the pxelinux.0 file, and then grabs the Xen kernel
>> (vmlinuz-2.6.18-164.6.1.el5xen). However, it can never load the Xen kernel.
>> On the client, I get the following error:
>>
>> Invalid or corrupt kernel image.
>>
>> I have tried the following three kernels (two stock Centos kernels and one
>> custom compiled kernel) and only the Xen kernel fails:
>>
>> -rw-r--r-- 1 root root 2030154 Dec 10 15:28 vmlinuz-2.6.18-164.6.1.el5xen
>> -rw-r--r-- 1 root root 1932284 Sep 25 16:17 vmlinuz-2.6.18-164.el5
>> -rw-r--r-- 1 root root 3277584 Dec 10 15:29 vmlinuz-2.6.27.15-jw-node
>>
>> The others load without error. I have checked multiple times that the Xen
>> kernel is not corrupt via md5sums and by booting it via grub. It just seems
>> not to like the PXE system. Here is a snippet of the dnsmasq log to show
>> that the file is sent correctly to the client:
>>
>> Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent /tftpboot/pxelinux.0 to
>> 192.168.0.6
>> Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent
>> /tftpboot/pxelinux.cfg/default to 192.168.0.6
>> Dec 11 04:12:57 julie dnsmasq[9117]: TFTP sent
>> /tftpboot/vmlinuz-2.6.18-164.6.1.el5xen to 192.168.0.6
>>
>> I have tried three different systems for the DHCP, TFTP, and PXE Servers
>> (using stock RHEL/Centos packages). Here are the specs:
>>
>> System 1
>> Centos 5.4 (64-bit) with nvidia Ethernet adapters
>> dnsmasq for both DHCP and TFTP Servers
>> syslinux for PXE
>>
>> System 2
>> Centos 5.4 (64-bit) with e1000 Ethernet adapters
>> dnsmasq for both DHCP and TFTP Servers
>> syslinux for PXE
>>
>> System 3
>> Centos 5.3 (32-bit) with e1000 Ethernet adapters (trying 32-bit version of
>> the Xen kernel)
>> Config One:
>> dnsmasq for both DHCP and TFTP Servers
>> syslinux for PXE
>>
>> Config Two:
>> dnsmasq for DHCP Server
>> tftp-server for TFTP Server
>> syslinux for PXE
>>
>> The client machines use the same hardware as the servers. I haven't seen
>> anything about Xen kernels having issues with PXE. Before I start trying
>> different flavors of Linux, I'm curious if anyone else has seen or heard of
>> this problem.
>>
>> Many thanks in advance.
>>
>> Jess
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>      


From gus at ldeo.columbia.edu  Thu Dec 17 13:17:31 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Thu, 17 Dec 2009 16:17:31 -0500
Subject: [Beowulf] Re: Geriatric computer does not stay up
In-Reply-To: <E1NLMsV-000730-Kn@mendel.bio.caltech.edu>
References: <E1NLMsV-000730-Kn@mendel.bio.caltech.edu>
Message-ID: <4B2A9FEB.4080103@ldeo.columbia.edu>

Hi David

A PCI riser card may help:

http://www.logicsupply.com/categories/accessories/pci_riser_cards
http://www.plinkusa.net/riser.htm

We have a single riser card on our old cluster compute nodes for the
bulky Myrinet-2000 cards (same S2466 motherboard, 2U chassis).

However, we don't have a graphics card on the compute nodes.
I wonder how you can fit both (if graphics is needed).
Maybe with two riser cards of different heights.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

David Mathog wrote:
>> The manual says that jumper J86 disables/enables
>> the onboard "LAN" (3Com 3C905c).
>> Onboard Ethernet is *disabled* when the jumper is *closed*.
>> (It is probably open now, no jumper,
>> assuming onboard Ethernet is currently enabled.)
> 
> You are right.  The jumper was under the graphics card, which was
> mounted sideways in an AGP to PCI adapter.  (No idea why they didn't
> just put it in a PCI slot in the first place.)  Hopefully disabling the
> onboard NIC will do it, because the PS and disk have been eliminated as
> possible trouble spots.
> 
> 
>> Also, it is unclear whether PXE boot will work with a PCI NIC,
>> but it may.
> 
> First got to find a half height NIC to fit in the case, then I'll let
> you know.  
> 
> Putting that jumper on did eliminate the onboard PXE part of the boot
> sequence.
> 
> Thanks,
> 
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech


From mdidomenico4 at gmail.com  Fri Dec 18 05:40:47 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Fri, 18 Dec 2009 08:40:47 -0500
Subject: [Beowulf] tg3 driver and rx dropped packets
Message-ID: <e75d22a90912180540qc37650dw3bbd8691e1d56de8@mail.gmail.com>

Perhaps my brain is already checking out for the holidays, perhaps
someone might be able to shed some light...

I have several Dell t3500 workstations which we've installed PCI-X
BroadCom cards specifically the BCM5703 Fiber cards.

We're also using RedHat v5.4 2.6.18-164.6.1 kernel with all the stock drivers.

For some reason which i cannot determine, when we read a large amount
of data into the workstation we see the RX dropped counter steadily
(rapidly) increase, eventually locking the TCP transmissions, which
results in an aborted file operation.

This only happens on reads, if we do a write operation we do not see TX drops.

We've tested this between all types of devices (nfs, http) servers at
various points in the network and the only common thing is the NIC.

I'd think it was just a bad nic, but we have several nics across t3500
and t3400's doing this.

Is anyone aware of such an issue?  Can anyone recommend some steps i
can take to isolate why the packets are being dropped?

I hooked up wireshark on one of the servers while we were running the
test and i see a lot of Duplicate ACK and TCP Checksum errors, in the
communications between the two hosts.  But im not sure that actually
points to anything.

Thanks


From bcostescu at gmail.com  Fri Dec 18 06:00:34 2009
From: bcostescu at gmail.com (Bogdan Costescu)
Date: Fri, 18 Dec 2009 15:00:34 +0100
Subject: [Beowulf] tg3 driver and rx dropped packets
In-Reply-To: <e75d22a90912180540qc37650dw3bbd8691e1d56de8@mail.gmail.com>
References: <e75d22a90912180540qc37650dw3bbd8691e1d56de8@mail.gmail.com>
Message-ID: <c609bc800912180600j4dd4e779n14cb513abf061e18@mail.gmail.com>

On Fri, Dec 18, 2009 at 2:40 PM, Michael Di Domenico
<mdidomenico4 at gmail.com> wrote:
> For some reason which i cannot determine, when we read a large amount
> of data into the workstation we see the RX dropped counter steadily
> (rapidly) increase, eventually locking the TCP transmissions, which
> results in an aborted file operation.

Try turning off rx-checksumming and/or TSO. You can find it you have
it enabled with:

ethtool -k eth0

and turn it on/off with:

ethtool -K eth0 tso off

Cheers,
Bogdan


From lindahl at pbm.com  Fri Dec 18 11:36:35 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Fri, 18 Dec 2009 11:36:35 -0800
Subject: [Beowulf] tg3 driver and rx dropped packets
In-Reply-To: <e75d22a90912180540qc37650dw3bbd8691e1d56de8@mail.gmail.com>
References: <e75d22a90912180540qc37650dw3bbd8691e1d56de8@mail.gmail.com>
Message-ID: <20091218193635.GA14097@bx9.net>

On Fri, Dec 18, 2009 at 08:40:47AM -0500, Michael Di Domenico wrote:

> I hooked up wireshark on one of the servers while we were running the
> test and i see a lot of Duplicate ACK and TCP Checksum errors, in the
> communications between the two hosts.  But im not sure that actually
> points to anything.

Well, it points to there being a significant problem. Packets are
protected on the wire by a strong checksum, and so if there's
corruption, it should be detected there. If that checksum is correct
but the weak TCP checksum is wrong, that means something corrupted the
packet in the host, for example a bad PCI card.

The TCP checksum is so weak that if you see a lot of errors detected,
you probably have some undetected errors sneaking through.

-- greg


From mathog at caltech.edu  Fri Dec 18 16:12:03 2009
From: mathog at caltech.edu (David Mathog)
Date: Fri, 18 Dec 2009 16:12:03 -0800
Subject: [Beowulf] Re: Geriatric computer does not stay up
Message-ID: <E1NLmvT-00004W-II@mendel.bio.caltech.edu>

No Joy.

Disabled the on board NIC and plugged in a couple of different NICs in
different PCI slots - and they were all just as bad as the one on the
motherboard.  Let it rest unplugged overnight (just in case it was a
stuck bit) and it didn't recover.  Loading the PCI graphics card also
crashed it.  Seems like a failing circuit in the chipset in or around
the PCI bridge.  First time I've seen that failure.  Replaced that
motherboard with an even older one, a tiny bit slower, but still working.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From peter.st.john at gmail.com  Sat Dec 19 13:20:48 2009
From: peter.st.john at gmail.com (Peter St. John)
Date: Sat, 19 Dec 2009 16:20:48 -0500
Subject: [Beowulf] FTC and Intel
In-Reply-To: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no>
References: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no>
Message-ID: <e4d4fd070912191320h1883f69bv3cc8e934ae81fcc4@mail.gmail.com>

Haakon,
What I saw (in a recent Slashdot, which I ought to be able to find) was the
idea that an Intel compiler disables it's own highest levels of
optimizations if it detects that the host processor is not Intel. The
complaint is based on that I believe.

FWIW, I imagine that if an automobile engine detected poor octane in the
fuel, it might throttle down the maximum speed of the car; but if it did so
after detecting a competitor's brand of gasoline, it could be considered
anti-competitive. But of course IMNAL or however we announce we ain't
lawyers so CGS (cum grano salis).
Peter

On Thu, Dec 17, 2009 at 3:33 AM, H?kon Bugge <h-bugge at online.no> wrote:

> In http://www.ftc.gov/os/adjpro/d9341/091216intelcmpt.pdf, there are
> allegations against Intel, such as "20. Intel?s efforts to deny
> interoperability between competitors? (e.g., Nvidia, AMD, and Via)
> GPUs and Intel?s newest CPUs".
>
> I was unaware of this. Anyone know what kind of interoperability we are
> talking about here?
>
>
> H?kon
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091219/1805a443/attachment.html>

From sassy1 at gmx.de  Tue Dec 15 14:22:11 2009
From: sassy1 at gmx.de (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=)
Date: Tue, 15 Dec 2009 22:22:11 +0000
Subject: [Beowulf] Performance degrading
In-Reply-To: <200912152000.nBFK05ZA010546@bluewest.scyld.com>
References: <200912152000.nBFK05ZA010546@bluewest.scyld.com>
Message-ID: <200912152222.11820.sassy1@gmx.de>

Hi Gus,

thanks for your comments. The problem is not that there are 5 NWChem running. 
I am only starting 4 processes and the additional one is the master which 
does nothing more but coordinating the slaves. 

Other parts of the program behaving more as you expect it (again parallel 
between nodes, taken from one node):
14902 sassy     25   0 2161m 325m 124m R  100  2.8  14258:15 nwchem
14903 sassy     25   0 2169m 335m 128m R  100  2.9  14231:15 nwchem
14901 sassy     25   0 2177m 338m 133m R  100  2.9  14277:23 nwchem
14904 sassy     25   0 2161m 333m 132m R   97  2.9  14213:44 nwchem
14906 sassy     15   0  978m  71m  69m S    3  0.6 582:57.22 nwchem

As you can see, there are 5 NWChem running but the fifth one does very little. 
So for me it looks like that the internode communication is a problem here and 
I would like to pin that down.

For example, on the new dual quadcore I can get:
13555 sassy     20   0 2073m 212m 113m R  100  0.9 367:57.27 nwchem
13556 sassy     20   0 2074m 209m 109m R  100  0.9 369:11.21 nwchem
13557 sassy     20   0 2074m 206m 107m R  100  0.9 369:13.76 nwchem
13558 sassy     20   0 2072m 203m 103m R  100  0.8 368:18.53 nwchem
13559 sassy     20   0 2072m 178m  78m R  100  0.7 369:11.49 nwchem
13560 sassy     20   0 2072m 172m  73m R  100  0.7 369:14.35 nwchem
13561 sassy     20   0 2074m 171m  72m R  100  0.7 369:12.34 nwchem
13562 sassy     20   0 2072m 170m  72m R  100  0.7 368:56.30 nwchem
So here there is no internode communication and hence I get the performance I 
would expect.

The main problem is I am no longer the administrator of that cluster so 
anything which requires root access is not possible for me :-(

But thanks for your 2 cent! :-)

All the best

J?rg

Am Dienstag 15 Dezember 2009 schrieb beowulf-request at beowulf.org:
> Hi Jorg
>
> If you have single quad core nodes as you said,
> then top shows that you are oversubscribing the cores.
> There are five nwchem processes are running.
>
> In my experience, oversubscription only works in relatively
> light MPI programs (say the example programs that come with OpenMPI or
> MPICH).
> Real world applications tend to be very inefficient,
> and can even hang on oversubscribed CPUs.
>
> What happens when you launch four or less processes
> on a node instead of five?
>
> My $0.02.
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------


From jorg.sassmannshausen at strath.ac.uk  Tue Dec 15 14:31:36 2009
From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=)
Date: Tue, 15 Dec 2009 22:31:36 +0000
Subject: [Beowulf] Performance degrading
Message-ID: <200912152231.36348.jorg.sassmannshausen@strath.ac.uk>

Hi Gus,

thanks for your comments. The problem is not that there are 5 NWChem running. 
I am only starting 4 processes and the additional one is the master which 
does nothing more but coordinating the slaves. 

Other parts of the program behaving more as you expect it (again parallel 
between nodes, taken from one node):
14902 sassy     25   0 2161m 325m 124m R  100  2.8  14258:15 nwchem
14903 sassy     25   0 2169m 335m 128m R  100  2.9  14231:15 nwchem
14901 sassy     25   0 2177m 338m 133m R  100  2.9  14277:23 nwchem
14904 sassy     25   0 2161m 333m 132m R   97  2.9  14213:44 nwchem
14906 sassy     15   0  978m  71m  69m S    3  0.6 582:57.22 nwchem

As you can see, there are 5 NWChem running but the fifth one does very little. 
So for me it looks like that the internode communication is a problem here and 
I would like to pin that down.

For example, on the new dual quadcore I can get:
13555 sassy     20   0 2073m 212m 113m R  100  0.9 367:57.27 nwchem
13556 sassy     20   0 2074m 209m 109m R  100  0.9 369:11.21 nwchem
13557 sassy     20   0 2074m 206m 107m R  100  0.9 369:13.76 nwchem
13558 sassy     20   0 2072m 203m 103m R  100  0.8 368:18.53 nwchem
13559 sassy     20   0 2072m 178m  78m R  100  0.7 369:11.49 nwchem
13560 sassy     20   0 2072m 172m  73m R  100  0.7 369:14.35 nwchem
13561 sassy     20   0 2074m 171m  72m R  100  0.7 369:12.34 nwchem
13562 sassy     20   0 2072m 170m  72m R  100  0.7 368:56.30 nwchem
So here there is no internode communication and hence I get the performance I 
would expect.

The main problem is I am no longer the administrator of that cluster so 
anything which requires root access is not possible for me :-(

But thanks for your 2 cent! :-)

All the best

J?rg

Am Dienstag 15 Dezember 2009 schrieb beowulf-request at beowulf.org:
> Hi Jorg
>
> If you have single quad core nodes as you said,
> then top shows that you are oversubscribing the cores.
> There are five nwchem processes are running.
>
> In my experience, oversubscription only works in relatively
> light MPI programs (say the example programs that come with OpenMPI or
> MPICH).
> Real world applications tend to be very inefficient,
> and can even hang on oversubscribed CPUs.
>
> What happens when you launch four or less processes
> on a node instead of five?
>
> My $0.02.
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
-- 
*************************************************************
J?rg Sa?mannshausen
Research Fellow
University of Strathclyde
Department of Pure and Applied Chemistry
295 Cathedral St.
Glasgow
G1 1XL

email: jorg.sassmannshausen at strath.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html


From jorg.sassmannshausen at strath.ac.uk  Wed Dec 16 01:41:39 2009
From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=)
Date: Wed, 16 Dec 2009 09:41:39 +0000
Subject: [Beowulf] Re:  Performance degrading
In-Reply-To: <200912160442.nBG4gYni023476@bluewest.scyld.com>
References: <200912160442.nBG4gYni023476@bluewest.scyld.com>
Message-ID: <200912160941.39449.jorg.sassmannshausen@strath.ac.uk>

Hi guys,

ok, some more information. 
I am using OpenMPI-1.2.8 and I only start 4 processes per node. So my hostfile 
looks like that:
comp12 slots=4
comp18 slots=4
comp08 slots=4

And yes, one process is the idle one which does things in the background. I 
have observed similar degradions before with a different program (GAMESS) 
where in the end, running a job on one node was _faster_ then running it on 
more than one nodes. Clearly, there is a problem here.

Interesting to note that the fith process is consuming memory as well, I did 
not see that at the time when I posted it. That is somehow odd as well, as a 
different calculation (same program) does not show that behaviour. I assume 
it is one extra process per job-group which will act as a master or shepherd 
for the slave processes. I know that GAMESS (which does not use MPI but ddi) 
has one additional process as data-server.

IIRC, the extra process does come from NWChem, but I doubt I am 
oversubscribing the node as it usually should not do much, as mentioned 
before. 

I am still wondering whether that could be a network issue?

Thanks for your comments!

All the best

Jorg


On Wednesday 16 December 2009 04:42:59 beowulf-request at beowulf.org wrote:
> Hi Glen, Jorg
>
> Glen: Yes, you are right about MPICH1/P4 starting extra processes.
> However, I wonder if that is what is happening to Jorg,
> of if what he reported is just plain CPU oversubscription.
>
> Jorg: ?Do you use MPICH1/P4?
> How many processes did you launch on a single node, four or five?
>
> Glen: ?Out of curiosity, I dug out the MPICH1/P4 I still have on an
> old system, compiled and ran "cpi.c".
> Indeed there are extra processes there, besides the ones that
> I intentionally started in the mpirun command line.
> When I launch two processes on a two-single-core-CPU machine,
> I also get two (not only one) extra processes, in a total of four.
>
> However, as you mentioned,
> the extra processes do not seem to use any significant CPU.
> Top shows the two actual processes close to 100% and the
> extra ones close to zero.
> Furthermore, the extra processes don't use any
> significant memory either.
>
> Anyway, in Jorg's case all processes consumed about
> the same (low) amount of CPU, but ~15% memory each,
> and there were 5 processes (only one "extra"?, is it one per CPU socket?
> is it one per core? one per node?).
> Hence, I would guess Jorg's context is different.
> But ... who knows ... only Jorg can clarify.
>
> These extra processes seem to be related to the
> mechanism used by MPICH1/P4 to launch MPI programs.
> They don't seem to appear in recent OpenMPI or MPICH2,
> which have other launching mechanisms.
> Hence my guess that Jorg had an oversubscription problem.
>
> Considering that MPICH1/P4 is old, no longer maintained,
> and seems to cause more distress than joy in current kernels,
> I would not recommend it to Jorg or to anybody anyway.
>
> Thank you,
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------

-- 
*************************************************************
J?rg Sa?mannshausen
Research Fellow
University of Strathclyde
Department of Pure and Applied Chemistry
295 Cathedral St.
Glasgow
G1 1XL

email: jorg.sassmannshausen at strath.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html


From jack at crepinc.com  Wed Dec 16 14:36:05 2009
From: jack at crepinc.com (Jack Carrozzo)
Date: Wed, 16 Dec 2009 17:36:05 -0500
Subject: [Beowulf] Geriatric computer does not stay up
In-Reply-To: <E1NL2LB-0006Oa-8A@mendel.bio.caltech.edu>
References: <E1NL2LB-0006Oa-8A@mendel.bio.caltech.edu>
Message-ID: <2ad0f9f60912161436w522c1e53n124a720c6031f7f3@mail.gmail.com>

I assume you've done this but forgot to mention it in the email - did
you test the RAM?

-Jack Carrozzo

On Wed, Dec 16, 2009 at 5:27 PM, David Mathog <mathog at caltech.edu> wrote:
> So we have a cluster of Tyan S2466 nodes and one of them has failed in
> an odd way. (Yes, these are very old, and they would be gone if we had a
> replacment.) ?On applying power the system boots normally and gets far
> into the boot sequence, sometimes to the login prompt, then it locks up.
> ?If booted failsafe it will stay up for tens of minutes before locking.
> ?It locked once on "man smartctl" and once on "service network start".
> However, on the next reboot, it didn't lock with another "man smartctl",
> so it isn't like it hit a bad part of the disk and died. ?Smartctl test
> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
> healthy with no blocks swapped out. ?Power stays on when it locks, and
> the display remains as it was just before the lock. ?When it locks it
> will not respond to either the keyboard or the network. ?(The network
> interface light still flashes.) ?There is nothing in any of the logs to
> indicate the nature of the problem.
>
> The odd thing is that the system is remarkably stable in some ways. ?For
> instance, the PS tests good and heat isn't the issue: after running
> sensors in a tight loop to a log file, waiting for it to lock up, then
> looking at the log on the next failsafe boot, there were negligible
> fluctuation on any of the voltages, fan speeds, or temperatures. ?It
> will happily sit for 30 minutes in the BIOS, or hours running memtest86
> (without errors). ?The motherboard battery is good, and the inside of
> the case is very clean, with no dust visible at all. ?Reset the BIOS but
> it didn't change anything.
>
> Here are my current hypotheses for what's wrong with this beast:
>
> 1. The drive is failing electrically, puts voltage spikes out on some
> operations, and these crash the system.
> 2. The motherboard capacitors are failing and letting too much noise in.
> ?The noise which is fatal is only seen on an active system, so sitting
> in the BIOS or in Memtest86 does not do it. (But the caps all look good,
> no swelling, no leaks.) ?It will run memtest86 overnight though, just in
> case.
> 3. The PS capacitors are failing, so that when loaded there is enough
> voltage fluctuation to crash the system. ?(Does not agree very well with
> the sensors measurements, but it could be really high frequency noise
> superimposed on a steady base voltage.)
> 4. Evil Djinn ;-(
>
> Any thoughts on what else this might be?
>
> Thanks.
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From pauljohn32 at gmail.com  Sun Dec 20 12:59:08 2009
From: pauljohn32 at gmail.com (Paul Johnson)
Date: Sun, 20 Dec 2009 14:59:08 -0600
Subject: [Beowulf] if you had 2 switches and a bunch of cable,
	what would you do?
Message-ID: <13e802630912201259k72971fa9x2627fb9ae4f68174@mail.gmail.com>

I inherited a few big racks of unused Dell PowerEdge servers.  There
are 2 1GB switches in each rack, and each node in the cluster has at
least two ethernet connections.

I have a head node and some compute nodes up and running in Rocks
Cluster 5.2, but in my haste to see that the hardware actually works,
I've only only cut and patched enough cables to use one switch for the
internal network.

What to do with the other switch? As far as I understand it, I have 2
options (aside from selling the extra switch on Ebay :))

Option 1. Create 2 separate internal networks. In wiring drawings for
clusters, I often see one administrative network and one for
computations (mpi, and so forth).

The downside for that is that I don't yet understand how user programs
are supposed to differentiate the 2 internal networks and send
messages through the computation network. I've not had the luxury of 2
ethernet cards before.

Option 2. Try "channel bonding" to try to increase the throughput on a
single ethernet node.  That has some appeal because the users just see
one network.

I'm not aiming for the "high availability" approach of using the
second switch as a fallback.  Rather, I'm aiming for the fastest &
widest connection possible in and out of each compute node.

My head node still has just one 1GB ethernet connection going into it,
and if I could believe that channel bonding would actually improve
throughput, I suppose I could get another line (or even 2 more) going
into the head node.  The headnode has 4 ethernet jacks, so I suppose I
could double up inside and outside.

I'd be glad to hear your thoughts.


-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas


From hackercasta at esdebian.org  Sat Dec 19 13:13:44 2009
From: hackercasta at esdebian.org (=?ISO-8859-1?Q?_=09Iker_Casta=F1os_Chavarri_?=)
Date: Sat, 19 Dec 2009 22:13:44 +0100
Subject: [Beowulf] new distribution to create a beowulf cluster:
	ABC(Automated Beowulf Cluster) GNU/Linux
In-Reply-To: <90202de50912180405sc9bd612t67f2fdc17da285b@mail.gmail.com>
References: <90202de50912180405sc9bd612t67f2fdc17da285b@mail.gmail.com>
Message-ID: <90202de50912191313h34a77c09md2625fe63001f6ba@mail.gmail.com>

Good night,

This Ubuntu GNU/Linux based distribution alaws to automatically build
Beowulf clusters either live or installing the software in the
frontend. All nodes run diskless. Connect your computers with a switch
and insert the DVD on one of them. It has the ganglia monitor, which
is nice for seeing the details of the operation of the cluser. The
system is beeing suported sporadically and you may contact technical
support at abclinuxsupport at gmail.com ABC has been presented at the
symposium ICAT2009 and published a research paper in the IEEE. The
developer is Iker Casta?os at the EUITI of Bilbao, University of the
Basque Country (Spain). The proyect website:
http://www.ehu.es/AC/ABC.htm

This is a demostration video:

http://www.youtube.com/watch?v=Xn2M1SoVg6U


Best regards,

Iker Casta?os


From h-bugge at online.no  Mon Dec 21 01:05:08 2009
From: h-bugge at online.no (=?WINDOWS-1252?Q?H=E5kon_Bugge?=)
Date: Mon, 21 Dec 2009 10:05:08 +0100
Subject: [Beowulf] FTC and Intel
In-Reply-To: <e4d4fd070912191320h1883f69bv3cc8e934ae81fcc4@mail.gmail.com>
References: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no>
	<e4d4fd070912191320h1883f69bv3cc8e934ae81fcc4@mail.gmail.com>
Message-ID: <EE78268C-C514-4569-BF33-3B85EE0CF0D0@online.no>

Peter,

On Dec 19, 2009, at 22:20 , Peter St. John wrote:

> Haakon,
> What I saw (in a recent Slashdot, which I ought to be able to find)  
> was the idea that an Intel compiler disables it's own highest levels  
> of optimizations if it detects that the host processor is not Intel.  
> The complaint is based on that I believe.
>
> FWIW, I imagine that if an automobile engine detected poor octane in  
> the fuel, it might throttle down the maximum speed of the car; but  
> if it did so after detecting a competitor's brand of gasoline, it  
> could be considered anti-competitive. But of course IMNAL or however  
> we announce we ain't lawyers so CGS (cum grano salis).
> Peter

Intel's compiler generates code that will not run on AMD CPUs (i.e.  
non GenuineIntel) if an instruction-set higher than SSE2 is selected.  
This issue is covered by the referenced complaints elsewhere. I know  
this issue pretty well, as I wrote a piece of software used by Scali/ 
Platform MPI which reverted the Intel compiler's check for  
GenuineIntel, which transparently allowed MPI programs to run on AMD  
CPUs with SSE3/SSSE3 instruction set enabled.

Claim 20 in the complaints is not related to this, but  
_interoperability_ between GPUs and Intel CPUs. That is what I tried  
to get a better understanding of. I have received insightful comments  
around this issue off-list.


Thanks anyway, H?kon

> On Thu, Dec 17, 2009 at 3:33 AM, H?kon Bugge <h-bugge at online.no>  
> wrote:
> In http://www.ftc.gov/os/adjpro/d9341/091216intelcmpt.pdf, there are  
> allegations against Intel, such as "20. Intel?s efforts to deny  
> interoperability between competitors? (e.g., Nvidia, AMD, and Via)
> GPUs and Intel?s newest CPUs".
>
> I was unaware of this. Anyone know what kind of interoperability we  
> are talking about here?
>
>
> H?kon
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

Mvh.,

H?kon Bugge
h-bugge at online.no
+47 924 84 514


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091221/dedeb1c9/attachment.html>

From john.hearns at mclaren.com  Mon Dec 21 01:54:05 2009
From: john.hearns at mclaren.com (Hearns, John)
Date: Mon, 21 Dec 2009 09:54:05 -0000
Subject: [Beowulf] if you had 2 switches and a bunch of cable,what would
	you do?
In-Reply-To: <13e802630912201259k72971fa9x2627fb9ae4f68174@mail.gmail.com>
References: <13e802630912201259k72971fa9x2627fb9ae4f68174@mail.gmail.com>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0EA7BDEE@milexchmb1.mil.tagmclarengroup.com>

Option 1. Create 2 separate internal networks. In wiring drawings for
clusters, I often see one administrative network and one for
computations (mpi, and so forth).


Paul, definitely recommend option 1.
Use the second switch for MPI traffic.

The way you achieve this is to use your batch scheduling system and run
a script
which takes the machines list provided by the batch system and
translates it into one
which is fed to to the mpirun utility, ie the hostnames are turned into
something like 
hostname-eth1


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From john.hearns at mclaren.com  Mon Dec 21 02:04:37 2009
From: john.hearns at mclaren.com (Hearns, John)
Date: Mon, 21 Dec 2009 10:04:37 -0000
Subject: [Beowulf] Performance degrading
In-Reply-To: <200912152231.36348.jorg.sassmannshausen@strath.ac.uk>
References: <200912152231.36348.jorg.sassmannshausen@strath.ac.uk>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0EA7BE0D@milexchmb1.mil.tagmclarengroup.com>


The main problem is I am no longer the administrator of that cluster so 
anything which requires root access is not possible for me :-(

But thanks for your 2 cent! :-)


Jorg, you live in Glasgow.
The solution is simple.
Proceed without delay to the Scotch Whisky Shop in the Buchanan
Galleries (just down the hill).
Buy one bottle of single malt whisky - you may like to taste a few just
to get the right one.
Present this bottle as a Christmas gift to your BOFH (ahem, sorry,
friendly systems admin).


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From csamuel at vpac.org  Mon Dec 21 03:57:10 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Mon, 21 Dec 2009 22:57:10 +1100 (EST)
Subject: [Beowulf] if you had 2 switches and a bunch of cable, what
	would you do?
In-Reply-To: <777881520.276391261396089675.JavaMail.root@mail.vpac.org>
Message-ID: <60238355.276411261396630847.JavaMail.root@mail.vpac.org>


----- "Paul Johnson" <pauljohn32 at gmail.com> wrote:

> What to do with the other switch?

We use one ethernet switch for management, one ethernet
switch (with jumbo frames) for our NFS storage network
and IB/Myrinet for MPI.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From reuti at staff.uni-marburg.de  Mon Dec 21 04:38:06 2009
From: reuti at staff.uni-marburg.de (Reuti)
Date: Mon, 21 Dec 2009 13:38:06 +0100
Subject: [Beowulf] if you had 2 switches and a bunch of cable,
	what would you do?
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0EA7BDEE@milexchmb1.mil.tagmclarengroup.com>
References: <13e802630912201259k72971fa9x2627fb9ae4f68174@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0EA7BDEE@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <6C5F76ED-A3F0-4601-97FA-3F6FB670A225@staff.uni-marburg.de>

Hi,

Am 21.12.2009 um 10:54 schrieb Hearns, John:

> Option 1. Create 2 separate internal networks. In wiring drawings for
> clusters, I often see one administrative network and one for
> computations (mpi, and so forth).

first a matter of definition: what's "administrative network"? Just  
the option to ssh to a node & SGE, or to have access to some  
facitility of a dedicated service processor like "Lights Out"  
management?

In the former case, I would do it the other way round: use the  
primary one for ssh, SGE and MPI, and the second one for NFS. Simply  
because then there is no need to alter the generated list of hosts  
(ssh to the nodes is in my case only for admin staff anyway), and  
SGE's is communication to the nodes is not so high (When local spool  
directories on the nodes are used, there is no further communication  
needed to store this informaton. Otherwise it will go via NFS to the  
central spool directory, but this would be the second network then.)

With some MPI implementations it can be tricky (but possible) to  
force them to use the secondary interface, especially for both  
directions. In MPICH(1) (old, but sometimes still used) also the  
environment variable MPI_HOST must be set to have the name of the  
secondary interface.

Well, you need a second (or third) network for the headnode then: one  
for NFS and maybe one for going to the outside world (this way all  
internal traffic can use private addresses and are invisible from the  
outside).

==

If the administrative network is "Lights Out" management, I would  
look for a switch with less performance laying around. If your  
servers have it built-in, I would use it. If you have enough ports on  
the switches, you can also connect it to the first network from above.

-- Reuti


> Paul, definitely recommend option 1.
> Use the second switch for MPI traffic.
>
> The way you achieve this is to use your batch scheduling system and  
> run
> a script
> which takes the machines list provided by the batch system and
> translates it into one
> which is fed to to the mpirun utility, ie the hostnames are turned  
> into
> something like
> hostname-eth1
>
> The contents of this email are confidential and for the exclusive  
> use of the intended recipient.  If you receive this email in error  
> you should not copy it, retransmit it, use it or disclose its  
> contents but should return it to the sender immediately and delete  
> your copy.
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf


From peter.st.john at gmail.com  Mon Dec 21 09:41:38 2009
From: peter.st.john at gmail.com (Peter St. John)
Date: Mon, 21 Dec 2009 12:41:38 -0500
Subject: [Beowulf] FTC and Intel
In-Reply-To: <EE78268C-C514-4569-BF33-3B85EE0CF0D0@online.no>
References: <1E5EA6E5-75A5-486C-9B7E-E98B78BF5DD6@online.no>
	<e4d4fd070912191320h1883f69bv3cc8e934ae81fcc4@mail.gmail.com>
	<EE78268C-C514-4569-BF33-3B85EE0CF0D0@online.no>
Message-ID: <e4d4fd070912210941s47444af8xd706e85f4fc819c6@mail.gmail.com>

Haakon,
And thanks for correcting me. I had been surprised your question went
unanswered so long as it did; I should always be suspicious that means I had
superficially misconstrued the question.
Peter

On Mon, Dec 21, 2009 at 4:05 AM, H?kon Bugge <h-bugge at online.no> wrote:

> Peter,
>
> On Dec 19, 2009, at 22:20 , Peter St. John wrote:
>
> Haakon,
> What I saw (in a recent Slashdot, which I ought to be able to find) was the
> idea that an Intel compiler disables it's own highest levels of
> optimizations if it detects that the host processor is not Intel. The
> complaint is based on that I believe.
>
> FWIW, I imagine that if an automobile engine detected poor octane in the
> fuel, it might throttle down the maximum speed of the car; but if it did so
> after detecting a competitor's brand of gasoline, it could be considered
> anti-competitive. But of course IMNAL or however we announce we ain't
> lawyers so CGS (cum grano salis).
> Peter
>
>
> Intel's compiler generates code that will not run on AMD CPUs (i.e. non
> GenuineIntel) if an instruction-set higher than SSE2 is selected. This issue
> is covered by the referenced complaints elsewhere. I know this issue pretty
> well, as I wrote a piece of software used by Scali/Platform MPI which
> reverted the Intel compiler's check for GenuineIntel, which transparently
> allowed MPI programs to run on AMD CPUs with SSE3/SSSE3 instruction set
> enabled.
>
> Claim 20 in the complaints is not related to this, but _interoperability_
> between GPUs and Intel CPUs. That is what I tried to get a better
> understanding of. I have received insightful comments around this issue
> off-list.
>
>
> Thanks anyway, H?kon
>
> On Thu, Dec 17, 2009 at 3:33 AM, H?kon Bugge <h-bugge at online.no> wrote:
>
>> In http://www.ftc.gov/os/adjpro/d9341/091216intelcmpt.pdf, there are
>> allegations against Intel, such as "20. Intel?s efforts to deny
>> interoperability between competitors? (e.g., Nvidia, AMD, and Via)
>> GPUs and Intel?s newest CPUs".
>>
>> I was unaware of this. Anyone know what kind of interoperability we are
>> talking about here?
>>
>>
>> H?kon
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
> Mvh.,
>
> H?kon Bugge
> h-bugge at online.no
> +47 924 84 514
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091221/ed5abffe/attachment.html>

From gus at ldeo.columbia.edu  Mon Dec 21 10:06:04 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Mon, 21 Dec 2009 13:06:04 -0500
Subject: [Beowulf] Re:  Performance degrading
In-Reply-To: <200912160941.39449.jorg.sassmannshausen@strath.ac.uk>
References: <200912160442.nBG4gYni023476@bluewest.scyld.com>
	<200912160941.39449.jorg.sassmannshausen@strath.ac.uk>
Message-ID: <4B2FB90C.5040109@ldeo.columbia.edu>

Hi Jorg

To clarify what is going on,
I would try the cpi.c (comes with MPICH2), or the "connectivity_c.c",
"ring_c.c" (come with OpenMPI) programs.
Get them with the source code in the MPICH2 and OpenMPI sites.
These programs are in the "examples" directories.
Compiliation is straightforward with mpicc.
Run these programs on one node (4 processes) first,
then on several nodes (say -np 8, -np 12, etc).
Remember OpenMPI mpiexec has the "-byslot" and "-bynode" options that
allow you to experiment with different process vs.
core/node configurations (see "man mpiexec").

If cpi.c runs, then it should not be an OpenMPI problem,
or with your network, etc.
This will narrow your investigation also.

I know nothing about your programs, but I find strange that
it starts more processes than you request.
As noted by Glen this used to be the case with the old
MPICH1, which you are not using.
Hence, your program seems to be doing stuff under the hood,
and beyond mpiexec.

In any case, have you tried "mpiexec -np 3 newchem" (if 3 is an
acceptable number for newchem)?

Also, not being the administrator doesn't prevent you from installing
a newer OpenMPI  (current version is 1.4)
from source code in your area and using it.
You just need to set the right PATH, LD_LIBRARY_PATH, and MANPATH
to your own OpenMPI on your .bashrc/.cshrc file.

I am no expert, just a user, but here is what I think may be happening.
Oversubscribing processors/cores leads to context switching
across the processes, which is a killer for MPI performance.
Oversubscribing memory (e.g. total of all user processes memory
above 80% or so), leads to memory paging, another performance killer.
I would guess both situations open plenty of opportunity for
gridlocks, one process trying to communicate with another that is
on hold, and when the other that is on hold becomes active,
the one that was trying to talk goes on hold, and so on.
Sometimes the programs just hang, sometimes one MPI process goes astray 
losing communication with the others.
Something like this may be happening to you.
I think the message is: MPI and oversubscription
(of processors or memory) don't mix well.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


J?rg Sa?mannshausen wrote:
> Hi guys,
> 
> ok, some more information. 
> I am using OpenMPI-1.2.8 and I only start 4 processes per node. So my hostfile 
> looks like that:
> comp12 slots=4
> comp18 slots=4
> comp08 slots=4
> 
> And yes, one process is the idle one which does things in the background. I 
> have observed similar degradions before with a different program (GAMESS) 
> where in the end, running a job on one node was _faster_ then running it on 
> more than one nodes. Clearly, there is a problem here.
> 
> Interesting to note that the fith process is consuming memory as well, I did 
> not see that at the time when I posted it. That is somehow odd as well, as a 
> different calculation (same program) does not show that behaviour. I assume 
> it is one extra process per job-group which will act as a master or shepherd 
> for the slave processes. I know that GAMESS (which does not use MPI but ddi) 
> has one additional process as data-server.
> 
> IIRC, the extra process does come from NWChem, but I doubt I am 
> oversubscribing the node as it usually should not do much, as mentioned 
> before. 
> 
> I am still wondering whether that could be a network issue?
> 
> Thanks for your comments!
> 
> All the best
> 
> Jorg
> 
> 
> On Wednesday 16 December 2009 04:42:59 beowulf-request at beowulf.org wrote:
>> Hi Glen, Jorg
>>
>> Glen: Yes, you are right about MPICH1/P4 starting extra processes.
>> However, I wonder if that is what is happening to Jorg,
>> of if what he reported is just plain CPU oversubscription.
>>
>> Jorg:  Do you use MPICH1/P4?
>> How many processes did you launch on a single node, four or five?
>>
>> Glen:  Out of curiosity, I dug out the MPICH1/P4 I still have on an
>> old system, compiled and ran "cpi.c".
>> Indeed there are extra processes there, besides the ones that
>> I intentionally started in the mpirun command line.
>> When I launch two processes on a two-single-core-CPU machine,
>> I also get two (not only one) extra processes, in a total of four.
>>
>> However, as you mentioned,
>> the extra processes do not seem to use any significant CPU.
>> Top shows the two actual processes close to 100% and the
>> extra ones close to zero.
>> Furthermore, the extra processes don't use any
>> significant memory either.
>>
>> Anyway, in Jorg's case all processes consumed about
>> the same (low) amount of CPU, but ~15% memory each,
>> and there were 5 processes (only one "extra"?, is it one per CPU socket?
>> is it one per core? one per node?).
>> Hence, I would guess Jorg's context is different.
>> But ... who knows ... only Jorg can clarify.
>>
>> These extra processes seem to be related to the
>> mechanism used by MPICH1/P4 to launch MPI programs.
>> They don't seem to appear in recent OpenMPI or MPICH2,
>> which have other launching mechanisms.
>> Hence my guess that Jorg had an oversubscription problem.
>>
>> Considering that MPICH1/P4 is old, no longer maintained,
>> and seems to cause more distress than joy in current kernels,
>> I would not recommend it to Jorg or to anybody anyway.
>>
>> Thank you,
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
> 


From atp at piskorski.com  Mon Dec 21 11:56:10 2009
From: atp at piskorski.com (Andrew Piskorski)
Date: Mon, 21 Dec 2009 14:56:10 -0500
Subject: [Beowulf] new distribution to create a beowulf cluster:
	ABC(Automated Beowulf Cluster) GNU/Linux
In-Reply-To: <90202de50912191313h34a77c09md2625fe63001f6ba@mail.gmail.com>
References: <90202de50912191313h34a77c09md2625fe63001f6ba@mail.gmail.com>
Message-ID: <20091221195610.GA59020@piskorski.com>

On Sat, Dec 19, 2009 at 10:13:44PM +0100, Iker Casta?os Chavarri wrote:

> This Ubuntu GNU/Linux based distribution alaws to automatically build
> Beowulf clusters either live or installing the software in the
> frontend. All nodes run diskless.

> http://www.ehu.es/AC/ABC.htm

How is this similar/different to Perceus/Warewulf, xCAT, Scyld, etc.?
What use cases drove you to build your own cluster provisioning
toolkit rather than use an existing one?  Is there a document
somewhere explaining your overall system design and rationale?

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/


From kyron at neuralbs.com  Mon Dec 21 11:05:45 2009
From: kyron at neuralbs.com (Eric Thibodeau)
Date: Mon, 21 Dec 2009 14:05:45 -0500
Subject: [Beowulf] Geriatric computer does not stay up
In-Reply-To: <2ad0f9f60912161436w522c1e53n124a720c6031f7f3@mail.gmail.com>
References: <E1NL2LB-0006Oa-8A@mendel.bio.caltech.edu>
	<2ad0f9f60912161436w522c1e53n124a720c6031f7f3@mail.gmail.com>
Message-ID: <B529E850-5FC8-49E1-BE90-67B185A090C8@neuralbs.com>

This smells like the hell I went through when one of the CPUs needed to be changed in our dep's Tyan VX50... Try swapping CPUs if you have spares.

ET
On 2009-12-16, at 5:36 PM, Jack Carrozzo wrote:

> I assume you've done this but forgot to mention it in the email - did
> you test the RAM?
> 
> -Jack Carrozzo
> 
> On Wed, Dec 16, 2009 at 5:27 PM, David Mathog <mathog at caltech.edu> wrote:
>> So we have a cluster of Tyan S2466 nodes and one of them has failed in
>> an odd way. (Yes, these are very old, and they would be gone if we had a
>> replacment.)  On applying power the system boots normally and gets far
>> into the boot sequence, sometimes to the login prompt, then it locks up.
>>  If booted failsafe it will stay up for tens of minutes before locking.
>>  It locked once on "man smartctl" and once on "service network start".
>> However, on the next reboot, it didn't lock with another "man smartctl",
>> so it isn't like it hit a bad part of the disk and died.  Smartctl test
>> has not been run, but "smartctl -a /dev/hda" on the one disk shows it as
>> healthy with no blocks swapped out.  Power stays on when it locks, and
>> the display remains as it was just before the lock.  When it locks it
>> will not respond to either the keyboard or the network.  (The network
>> interface light still flashes.)  There is nothing in any of the logs to
>> indicate the nature of the problem.
>> 
>> The odd thing is that the system is remarkably stable in some ways.  For
>> instance, the PS tests good and heat isn't the issue: after running
>> sensors in a tight loop to a log file, waiting for it to lock up, then
>> looking at the log on the next failsafe boot, there were negligible
>> fluctuation on any of the voltages, fan speeds, or temperatures.  It
>> will happily sit for 30 minutes in the BIOS, or hours running memtest86
>> (without errors).  The motherboard battery is good, and the inside of
>> the case is very clean, with no dust visible at all.  Reset the BIOS but
>> it didn't change anything.
>> 
>> Here are my current hypotheses for what's wrong with this beast:
>> 
>> 1. The drive is failing electrically, puts voltage spikes out on some
>> operations, and these crash the system.
>> 2. The motherboard capacitors are failing and letting too much noise in.
>>  The noise which is fatal is only seen on an active system, so sitting
>> in the BIOS or in Memtest86 does not do it. (But the caps all look good,
>> no swelling, no leaks.)  It will run memtest86 overnight though, just in
>> case.
>> 3. The PS capacitors are failing, so that when loaded there is enough
>> voltage fluctuation to crash the system.  (Does not agree very well with
>> the sensors measurements, but it could be really high frequency noise
>> superimposed on a steady base voltage.)
>> 4. Evil Djinn ;-(
>> 
>> Any thoughts on what else this might be?
>> 
>> Thanks.
>> 
>> David Mathog
>> mathog at caltech.edu
>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From reuti at staff.uni-marburg.de  Mon Dec 21 16:01:57 2009
From: reuti at staff.uni-marburg.de (Reuti)
Date: Tue, 22 Dec 2009 01:01:57 +0100
Subject: [Beowulf] Performance degrading
In-Reply-To: <200912152222.11820.sassy1@gmx.de>
References: <200912152000.nBFK05ZA010546@bluewest.scyld.com>
	<200912152222.11820.sassy1@gmx.de>
Message-ID: <D6AB1AD7-3377-4E32-B3E4-D0C6644C96E8@staff.uni-marburg.de>

Hi,

Am 15.12.2009 um 23:22 schrieb J?rg Sa?mannshausen:

> Hi Gus,
>
> thanks for your comments. The problem is not that there are 5  
> NWChem running.
> I am only starting 4 processes and the additional one is the master  
> which
> does nothing more but coordinating the slaves.

isn't NWChem using Global Arrays internally, and Open MPI is only  
used for communication? Which version of GA is included with your  
current NWChem?

-- Reuti


> Other parts of the program behaving more as you expect it (again  
> parallel
> between nodes, taken from one node):
> 14902 sassy     25   0 2161m 325m 124m R  100  2.8  14258:15 nwchem
> 14903 sassy     25   0 2169m 335m 128m R  100  2.9  14231:15 nwchem
> 14901 sassy     25   0 2177m 338m 133m R  100  2.9  14277:23 nwchem
> 14904 sassy     25   0 2161m 333m 132m R   97  2.9  14213:44 nwchem
> 14906 sassy     15   0  978m  71m  69m S    3  0.6 582:57.22 nwchem
>
> As you can see, there are 5 NWChem running but the fifth one does  
> very little.
> So for me it looks like that the internode communication is a  
> problem here and
> I would like to pin that down.
>
> For example, on the new dual quadcore I can get:
> 13555 sassy     20   0 2073m 212m 113m R  100  0.9 367:57.27 nwchem
> 13556 sassy     20   0 2074m 209m 109m R  100  0.9 369:11.21 nwchem
> 13557 sassy     20   0 2074m 206m 107m R  100  0.9 369:13.76 nwchem
> 13558 sassy     20   0 2072m 203m 103m R  100  0.8 368:18.53 nwchem
> 13559 sassy     20   0 2072m 178m  78m R  100  0.7 369:11.49 nwchem
> 13560 sassy     20   0 2072m 172m  73m R  100  0.7 369:14.35 nwchem
> 13561 sassy     20   0 2074m 171m  72m R  100  0.7 369:12.34 nwchem
> 13562 sassy     20   0 2072m 170m  72m R  100  0.7 368:56.30 nwchem
> So here there is no internode communication and hence I get the  
> performance I
> would expect.
>
> The main problem is I am no longer the administrator of that  
> cluster so
> anything which requires root access is not possible for me :-(
>
> But thanks for your 2 cent! :-)
>
> All the best
>
> J?rg
>
> Am Dienstag 15 Dezember 2009 schrieb beowulf-request at beowulf.org:
>> Hi Jorg
>>
>> If you have single quad core nodes as you said,
>> then top shows that you are oversubscribing the cores.
>> There are five nwchem processes are running.
>>
>> In my experience, oversubscription only works in relatively
>> light MPI programs (say the example programs that come with  
>> OpenMPI or
>> MPICH).
>> Real world applications tend to be very inefficient,
>> and can even hang on oversubscribed CPUs.
>>
>> What happens when you launch four or less processes
>> on a node instead of five?
>>
>> My $0.02.
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf


From jorg.sassmannshausen at strath.ac.uk  Tue Dec 22 02:30:54 2009
From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=)
Date: Tue, 22 Dec 2009 10:30:54 +0000
Subject: [Beowulf] Re: Performance degrading
Message-ID: <200912221030.54293.jorg.sassmannshausen@strath.ac.uk>

Hi all,

right, following the various suggestion and the idea about oversubscribing the 
node, I have started the same calculation again on 3 nodes but this time only 
with 3 processes on the node (started), so that will leave room for the 4th 
process which I believe will be started by NWChem.
Has anything changed? No.

top - 10:24:06 up 21 days, 17:29,  1 user,  load average: 0.25, 0.26, 0.26
Tasks: 131 total,   1 running, 130 sleeping,   0 stopped,   0 zombie
Cpu0  :  4.3% us,  1.0% sy,  0.0% ni, 93.7% id,  0.0% wa,  0.0% hi,  1.0% si
Cpu1  :  2.7% us,  0.0% sy,  0.0% ni, 97.3% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu2  :  0.7% us,  0.0% sy,  0.0% ni, 99.3% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu3  :  0.0% us,  0.0% sy,  0.0% ni, 99.0% id,  1.0% wa,  0.0% hi,  0.0% si
Mem:  12308356k total,  5251744k used,  7056612k free,   377052k buffers
Swap: 24619604k total,        0k used, 24619604k free,  3647568k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
29017 sassy     15   0 3921m 1.7g 1.4g S    3 14.8 384:50.34 nwchem
29019 sassy     15   0 3920m 1.8g 1.5g S    2 15.4 379:26.55 nwchem
29018 sassy     15   0 3920m 1.8g 1.5g S    1 15.5 380:31.15 nwchem
29021 sassy     15   0 2943m 1.7g 1.7g S    1 14.9  42:33.81 nwchem   <- 
process started by NWChem I suppose

As Reuti pointed out to me, NWChem is using Global Arrays internally and only 
MPI for communication. I don't think the problem is the OpenMPI I have. I 
could upgrade to the latest, but that means I have to re-link all the 
programs I am using.

Could the problem be the GA?

All the best

J?rg


-- 
*************************************************************
J?rg Sa?mannshausen
Research Fellow
University of Strathclyde
Department of Pure and Applied Chemistry
295 Cathedral St.
Glasgow
G1 1XL

email: jorg.sassmannshausen at strath.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html


From gus at ldeo.columbia.edu  Tue Dec 22 16:35:49 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Tue, 22 Dec 2009 19:35:49 -0500
Subject: [Beowulf] Re: Performance degrading
In-Reply-To: <200912221030.54293.jorg.sassmannshausen@strath.ac.uk>
References: <200912221030.54293.jorg.sassmannshausen@strath.ac.uk>
Message-ID: <4B3165E5.1000403@ldeo.columbia.edu>

Hi Jorg

I agree your old OpenMPI  1.2.8 should not be the problem,
and upgrading now will only add confusion.
I only suggested running simple test programs (cpi.c, connectivity_c.c)
to make sure all works right, including your network setup.
However, you or somebody else may already have done this in the past.

Hybrid communication schemes have to be handled with care.
We have plenty of MPI+OpenMP programs here.
Normally we request one processor per OpenMP thread for each
MPI task. For instance, if each MPI task opens four threads,
and you want three MPI tasks, then a total of 12 processors,
is requested from the batch system (e.g. #PBS -l nodes=3:ppn=4).
However, only 3 MPI processes are launched in mpiexec
(mpiexec -bynode -np 3 executable_name).
The "-bynode" option will put one MPI task on each of three nodes,
and each of these three MPI tasks will launch 4 OpenMP threads,
using the 4 local processors.
However, GA is not the same as OpenMP,
and another scheme may apply.

I don't know about GA.
I looked up their web page (at PNL),
but I didn't find a direct answer to your problem.
I would have to read more to learn about GA.
See if this GA support page may shed some light on how you
configured it (or how it is configured inside NWChem):
http://www.emsl.pnl.gov/docs/global/support.html
See their comments about the SHMMAX (shared memory segment size)
in Linux Kernels, as this may perhaps be the problem.
Here, on 32-bit Linux machines I have a number
smaller than they recommend (134217728 bytes, 128MB):

cat /proc/sys/kernel/shmmax
33554432

But on 64-bit machines it is much larger:

cat /proc/sys/kernel/shmmax
68719476736

You may check this out on your nodes, and if you have a low number
(say in a 32-bit node), perhaps try their suggestion,
and ask the system administrator to change this kernel parameter
on the nodes by doing:

echo "134217728" >/proc/sys/kernel/shmmax

GA seems to be a heavy user of shared memory,
hence it is likely to require more shared memory
resources than normal programs do.
Therefore, there is a flimsy chance that increasing SHMMAX may help.

I also found the NWChem web site (also at PNL).
You may know well all about this, so forgive me any silly suggestions.
I am not a Chemist, computational or otherwise.
I still need to understand PH and Hydrogen bridges right.
The NWChem User Guide, Appendix D (around page 401, big guide!)
has suggestions on how to run in different machines,
including Linux clusters with MPI (section D.3).
http://www.emsl.pnl.gov/capabilities/computing/nwchem/docs/usermanual.pdf

They have also FAQ about Linux clusters:
http://www.emsl.pnl.gov/capabilities/computing/nwchem/support/faq.jsp#Linux

They also have a "known bugs" list:
http://www.emsl.pnl.gov/root/capabilities/computing/nwchem/support/knownbugs/
Somehow they seem to talk only about MPICH
(not sure if MPICH1 or MPICH2),
but not about OpenMPI.
Likewise, browsing through the GA stuff I could find no direct reference
to OpenMPI, only to MPICH,
although theoretically MPI is supposed to be a standard,
and portable programs ans libraries should work right with any MPI 
flavor (in practice I am not so sure this is true).

Also, GA seems to have specific instructions for Infiniband
(but not for Ethernet), see the web link on GA above.
What do you have IB or Ethernet?
If you have both you can select one (say, IB)
with the OpenMPI mca parameters
in the mpiexec command line.

I know it won't help ... but I tried ...  :(

Good luck, and Happy Holidays!

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


J?rg Sa?mannshausen wrote:
> Hi all,
> 
> right, following the various suggestion and the idea about oversubscribing the 
> node, I have started the same calculation again on 3 nodes but this time only 
> with 3 processes on the node (started), so that will leave room for the 4th 
> process which I believe will be started by NWChem.
> Has anything changed? No.
> 
> top - 10:24:06 up 21 days, 17:29,  1 user,  load average: 0.25, 0.26, 0.26
> Tasks: 131 total,   1 running, 130 sleeping,   0 stopped,   0 zombie
> Cpu0  :  4.3% us,  1.0% sy,  0.0% ni, 93.7% id,  0.0% wa,  0.0% hi,  1.0% si
> Cpu1  :  2.7% us,  0.0% sy,  0.0% ni, 97.3% id,  0.0% wa,  0.0% hi,  0.0% si
> Cpu2  :  0.7% us,  0.0% sy,  0.0% ni, 99.3% id,  0.0% wa,  0.0% hi,  0.0% si
> Cpu3  :  0.0% us,  0.0% sy,  0.0% ni, 99.0% id,  1.0% wa,  0.0% hi,  0.0% si
> Mem:  12308356k total,  5251744k used,  7056612k free,   377052k buffers
> Swap: 24619604k total,        0k used, 24619604k free,  3647568k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 29017 sassy     15   0 3921m 1.7g 1.4g S    3 14.8 384:50.34 nwchem
> 29019 sassy     15   0 3920m 1.8g 1.5g S    2 15.4 379:26.55 nwchem
> 29018 sassy     15   0 3920m 1.8g 1.5g S    1 15.5 380:31.15 nwchem
> 29021 sassy     15   0 2943m 1.7g 1.7g S    1 14.9  42:33.81 nwchem   <- 
> process started by NWChem I suppose
> 
> As Reuti pointed out to me, NWChem is using Global Arrays internally and only 
> MPI for communication. I don't think the problem is the OpenMPI I have. I 
> could upgrade to the latest, but that means I have to re-link all the 
> programs I am using.
> 
> Could the problem be the GA?
> 
> All the best
> 
> J?rg
> 
> 


From reuti at staff.uni-marburg.de  Tue Dec 22 17:39:32 2009
From: reuti at staff.uni-marburg.de (Reuti)
Date: Wed, 23 Dec 2009 02:39:32 +0100
Subject: [Beowulf] Re: Performance degrading
In-Reply-To: <4B3165E5.1000403@ldeo.columbia.edu>
References: <200912221030.54293.jorg.sassmannshausen@strath.ac.uk>
	<4B3165E5.1000403@ldeo.columbia.edu>
Message-ID: <5B9409EF-5C77-4D11-A867-14C9EBCB406E@staff.uni-marburg.de>

Hi,

Am 23.12.2009 um 01:35 schrieb Gus Correa:

> <snip>
> Likewise, browsing through the GA stuff I could find no direct  
> reference
> to OpenMPI, only to MPICH,
> although theoretically MPI is supposed to be a standard,
> and portable programs ans libraries should work right with any MPI  
> flavor (in practice I am not so sure this is true).

GA can be used with Open MPI. I just did this for Molpro. Side note:  
Molpro supports now also a pure MPI-2 compilation besides the former  
via TCGMSG over MPI. The TCGMSG-MPI version is faster than the plain  
MPI-2 compilation. Okay, one extra step, and you get the time back.

-- Reuti


From hackercasta at esdebian.org  Mon Dec 21 12:12:23 2009
From: hackercasta at esdebian.org (=?ISO-8859-1?Q?_=09Iker_Casta=F1os_Chavarri_?=)
Date: Mon, 21 Dec 2009 21:12:23 +0100
Subject: [Beowulf] new distribution to create a beowulf cluster: 
	ABC(Automated Beowulf Cluster) GNU/Linux
In-Reply-To: <20091221195610.GA59020@piskorski.com>
References: <90202de50912191313h34a77c09md2625fe63001f6ba@mail.gmail.com>
	<20091221195610.GA59020@piskorski.com>
Message-ID: <90202de50912211212v7f95e608kc793c32e283fe119@mail.gmail.com>

Excuse me english please. This is similiar to PelicanHPC but this
distributions alaws the instalation of the operating system in the
frontend using "ubiquity" and is very easy the instalation. I'm
writing a quickstart guide but you have an article published y the
IEEE. I'm the author.

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=5348420&isnumber=5348395

If you go to the pelicanHPC website you can see that Michael Creel
recomend to give a try ABC GNU/Linux

http://pareto.uab.es/mcreel/PelicanHPC/

Best regards.

2009/12/21 Andrew Piskorski <atp at piskorski.com>:
> On Sat, Dec 19, 2009 at 10:13:44PM +0100, Iker Casta?os Chavarri wrote:
>
>> This Ubuntu GNU/Linux based distribution alaws to automatically build
>> Beowulf clusters either live or installing the software in the
>> frontend. All nodes run diskless.
>
>> http://www.ehu.es/AC/ABC.htm
>
> How is this similar/different to Perceus/Warewulf, xCAT, Scyld, etc.?
> What use cases drove you to build your own cluster provisioning
> toolkit rather than use an existing one? ?Is there a document
> somewhere explaining your overall system design and rationale?
>
> --
> Andrew Piskorski <atp at piskorski.com>
> http://www.piskorski.com/
>


-- 
Iker Casta?os


From csamuel at vpac.org  Tue Dec 29 17:28:18 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Wed, 30 Dec 2009 12:28:18 +1100 (EST)
Subject: [Beowulf] A change of address..
In-Reply-To: <1799055416.758081262136340551.JavaMail.root@mail.vpac.org>
Message-ID: <1413881953.758131262136498659.JavaMail.root@mail.vpac.org>

Hi all,

As some of you already know I'm leaving VPAC on the 8th
January to take up a post at the University of Melbourne
running some large HPC clusters for the Victorian Life
Sciences Computational Initiative (VLSCI) project.

 http://www.vlsci.unimelb.edu.au/overview.html

Yes, I'm upgrading from a 4 to a 5 letter acronym. ;-)

In preparation for that I'm going to move my many mailing
list memberships from my VPAC address to my home email
address (chris at csamuel.org) as I don't know how well the
UniMelb mail system will cope with mailing lists..

cheers!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From vlad at geociencias.unam.mx  Sat Dec 26 20:04:43 2009
From: vlad at geociencias.unam.mx (Vlad Manea)
Date: Sat, 26 Dec 2009 22:04:43 -0600
Subject: [Beowulf] PERC 5/E problems
Message-ID: <4B36DCDB.2030802@geociencias.unam.mx>

Hi,

I have a PERC 5/E card installed on my frontend (Dell PE 2970) that will be
used to connect a MD1000 from Dell.
I have a problem: PERC 5/E is not showing in BIOS.  When the server is 
starting up, I cannot
press Ctrl-R to launch the PowerEdge Expandable RAID Controller BIOS
(the card does not display it's bios boot message saying "hit ctrl-r for 
perc 5...").

I tried a different PCI slot and riser but with no luck.

Is out there anybody that might give a hand fixing this?

Thanks,
V.


From skylar at cs.earlham.edu  Thu Dec 31 09:02:09 2009
From: skylar at cs.earlham.edu (Skylar Thompson)
Date: Thu, 31 Dec 2009 11:02:09 -0600
Subject: [Beowulf] PERC 5/E problems
In-Reply-To: <4B36DCDB.2030802@geociencias.unam.mx>
References: <4B36DCDB.2030802@geociencias.unam.mx>
Message-ID: <4B3CD911.2030705@cs.earlham.edu>

Vlad Manea wrote:
> Hi,
>
> I have a PERC 5/E card installed on my frontend (Dell PE 2970) that
> will be
> used to connect a MD1000 from Dell.
> I have a problem: PERC 5/E is not showing in BIOS.  When the server is
> starting up, I cannot
> press Ctrl-R to launch the PowerEdge Expandable RAID Controller BIOS
> (the card does not display it's bios boot message saying "hit ctrl-r
> for perc 5...").
>
> I tried a different PCI slot and riser but with no luck.
>
> Is out there anybody that might give a hand fixing this?

I can't remember if the Dell BIOS has this option, but some BIOSs allow
you to clear the PCI bus cache. That will trigger a full rescan of all
the cards that are attached and could get it listed in the boot process
again. If the BIOS doesn't have that option, you could try setting the
BIOS clear jumper.

-- 
-- Skylar Thompson (skylar at cs.earlham.edu)
-- http://www.cs.earlham.edu/~skylar/


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 251 bytes
Desc: OpenPGP digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091231/38d66cdb/attachment.sig>

From skylar at cs.earlham.edu  Thu Dec 31 09:22:08 2009
From: skylar at cs.earlham.edu (Skylar Thompson)
Date: Thu, 31 Dec 2009 11:22:08 -0600
Subject: [Beowulf] PERC 5/E problems
In-Reply-To: <4B3CDBD4.6000106@geociencias.unam.mx>
References: <4B36DCDB.2030802@geociencias.unam.mx>
	<4B3CD911.2030705@cs.earlham.edu>
	<4B3CDBD4.6000106@geociencias.unam.mx>
Message-ID: <4B3CDDC0.2080305@cs.earlham.edu>

Vlad Manea wrote:
> Thanks all for your replays,
>
> In the end I think I found the problem: It looks like I have the PERC
> model M778G which apparently
> does NOT do RAID (maybe some of you can confirm that :-) ). I was
> thinking (wrongly maybe...) that
> all PERC cards do RAID...
I can't speak to that card specifically, but Dell in the past did sneaky
things like calling a system "RAID-capable", but in order to make it
actually do RAID you'd have to buy a hardware key or daughter card at
some inflated price.

-- 
-- Skylar Thompson (skylar at cs.earlham.edu)
-- http://www.cs.earlham.edu/~skylar/


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 251 bytes
Desc: OpenPGP digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091231/65d85f6f/attachment.sig>

From vlad at geociencias.unam.mx  Thu Dec 31 09:13:56 2009
From: vlad at geociencias.unam.mx (Vlad Manea)
Date: Thu, 31 Dec 2009 11:13:56 -0600
Subject: [Beowulf] PERC 5/E problems
In-Reply-To: <4B3CD911.2030705@cs.earlham.edu>
References: <4B36DCDB.2030802@geociencias.unam.mx>
	<4B3CD911.2030705@cs.earlham.edu>
Message-ID: <4B3CDBD4.6000106@geociencias.unam.mx>

Thanks all for your replays,

In the end I think I found the problem: It looks like I have the PERC 
model M778G which apparently
does NOT do RAID (maybe some of you can confirm that :-) ). I was 
thinking (wrongly maybe...) that
all PERC cards do RAID...

Cheers,
Vlad


Skylar Thompson escribi?:
> Vlad Manea wrote:
>   
>> Hi,
>>
>> I have a PERC 5/E card installed on my frontend (Dell PE 2970) that
>> will be
>> used to connect a MD1000 from Dell.
>> I have a problem: PERC 5/E is not showing in BIOS.  When the server is
>> starting up, I cannot
>> press Ctrl-R to launch the PowerEdge Expandable RAID Controller BIOS
>> (the card does not display it's bios boot message saying "hit ctrl-r
>> for perc 5...").
>>
>> I tried a different PCI slot and riser but with no luck.
>>
>> Is out there anybody that might give a hand fixing this?
>>     
>
> I can't remember if the Dell BIOS has this option, but some BIOSs allow
> you to clear the PCI bus cache. That will trigger a full rescan of all
> the cards that are attached and could get it listed in the boot process
> again. If the BIOS doesn't have that option, you could try setting the
> BIOS clear jumper.
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091231/94cbf6a9/attachment.html>