From brockp at umich.edu  Mon Mar  1 07:19:43 2010
From: brockp at umich.edu (Brock Palen)
Date: Mon, 1 Mar 2010 10:19:43 -0500
Subject: [Beowulf] which mpi library should I focus on?
In-Reply-To: <cb60cbc41002232140qe5ce76ehfdfe531c186a805f@mail.gmail.com>
References: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com>
	<Pine.LNX.4.64.1002230907330.13403@coffee.psychology.mcmaster.ca>
	<EF026256-2212-408D-BBCE-1B6A26CDE3E1@umich.edu>
	<20100223154639.GB695@sopalepc>
	<cb60cbc41002232140qe5ce76ehfdfe531c186a805f@mail.gmail.com>
Message-ID: <1BB43403-C149-49CE-85D6-33131ED70677@umich.edu>

Just as a follow up to this message the MPICH2 show is now out, if you  
want to hear Bill and Rusty talk about MPICH2, what it does and where  
it came from:
http://www.rce-cast.com/index.php/Podcast/rce-28-mpich2.html

Thanks

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985


On Feb 24, 2010, at 12:40 AM, Sangamesh B wrote:

> Hi,
>
>   I hope you are developing MPI codes and wants to run in cluster  
> environment. If so, I prefer you to use Open MPI. Because,
>
>   Open MPI is well developed and its stable
>   Has a very good FAQ section, where you will get clear your doubts  
> easily.
>   It has a in-built tight-integration method with cluster  
> schedulers- SGE, PBS, LSF etc.
>   It has an option to choose ETHERNET or INFINIBAND network  
> connectivity during run-time.
>
> Thanks,
> Sangamesh
>
> On Tue, Feb 23, 2010 at 9:16 PM, Douglas Guptill <douglas.guptill at dal.ca 
> > wrote:
> On Tue, Feb 23, 2010 at 09:25:45AM -0500, Brock Palen wrote:
>
> > (shameless plug) if you want, listen to our podcast on OpenMPI
> > http://www.rce-cast.com/index.php/Podcast/rce01-openmpi.html
> >
> > The MPICH2 show is recorded (edited it last night, almost done!),  
> and
> > will be released this Saturday Midnight Eastern.
> > If you want to hear the rough cut, to compare to OpenMPI, email me  
> and I
> > will send you the unfinished mp3.
>
> That sounds like a nice pair.  OpenMPI vs MPICH2.
>
> Douglas.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From ljdursi at scinet.utoronto.ca  Mon Mar  1 08:29:49 2010
From: ljdursi at scinet.utoronto.ca (Jonathan Dursi)
Date: Mon, 01 Mar 2010 11:29:49 -0500
Subject: [Beowulf] confidential data on public HPC cluster
Message-ID: <4B8BEB7D.9050904@scinet.utoronto.ca>

Hi;

We're a fairly typical academic HPC centre, and we're starting to have 
users talk to us about using our new clusters for projects that have 
various requirements for keeping data confidential.    We expect these 
to be the first of many requests, so we want to think now about how we 
can and can't help such users.   We have people here quite familiar in 
general cluster security issues, but as is usually the case in academia, 
we're normally concerned about hardening the cluster from the outside, 
and less about protecting the users from each other.   We've started 
doing some research, but presumably people on this list have run into 
these issues in the past and can give us some guidance.

Obviously, the degree to which we and our clusters can be of use to 
these users depend on the details and stringency of their legal, 
contractual, or other requirements.   If even having small fractions of 
the data unencrypted in memory on a node that someone else could login 
to (even if only as root) is not allowed, then I imagine it's going to 
be hard for them to use any machine they don't physically control.   But 
presumably many other users will have less strict conditions on what is 
and isn't allowed.

Are there good discussions of this somewhere?  What resources do you 
point users to when they have such requirements, and what sorts of 
things can we put in place on our end to make life easier for such users 
without imposing new requirements on the rest of our user base?

	- Jonathan

-- 
Jonathan Dursi     <ljdursi at scinet.utoronto.ca>


From jlforrest at berkeley.edu  Mon Mar  1 08:51:37 2010
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Mon, 01 Mar 2010 08:51:37 -0800
Subject: [Beowulf] confidential data on public HPC cluster
In-Reply-To: <4B8BEB7D.9050904@scinet.utoronto.ca>
References: <4B8BEB7D.9050904@scinet.utoronto.ca>
Message-ID: <4B8BF099.6050709@berkeley.edu>

On 3/1/2010 8:29 AM, Jonathan Dursi wrote:

> Are there good discussions of this somewhere? What resources do you
> point users to when they have such requirements, and what sorts of
> things can we put in place on our end to make life easier for such users
> without imposing new requirements on the rest of our user base?

My suggestion is to follow the wit and wisdom of
Nancy Reagan, and "just say no". That is getting
intimate knowledge of all the cracks and crevices
of sensitive/confidential data rules will be a huge
time sink, and will probably take your attention
away from the presumably more enjoyable benefits
of a modern HPC cluster.

Everytime I read about some embarrassing break-in
to a confidential data storage environment, I count
my lucky stars that I don't have any such data.
I know that I couldn't do any better than the
people whose names are now on the front page of
the paper.

Cordially,

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From hahn at mcmaster.ca  Mon Mar  1 09:35:07 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Mon, 1 Mar 2010 12:35:07 -0500 (EST)
Subject: [Beowulf] confidential data on public HPC cluster
In-Reply-To: <4B8BEB7D.9050904@scinet.utoronto.ca>
References: <4B8BEB7D.9050904@scinet.utoronto.ca>
Message-ID: <Pine.LNX.4.64.1003011210000.31772@coffee.psychology.mcmaster.ca>

> requirements for keeping data confidential.    We expect these to be the

it's critically important to pin down exactly what they mean by that.
for instance, anything involving human subjects, not limited to clinical
data, needs to be blinded.  that's a standard requirement from any 
research-ethics board.

it's also worth going over the basics of permissions, since researchers 
often don't understand what rwxrx-- means ;)

> other requirements.   If even having small fractions of the data unencrypted 
> in memory on a node that someone else could login to (even if only as root) 
> is not allowed, then I imagine it's going to be hard for them to use any 
> machine they don't physically control.   But presumably many other users will 
> have less strict conditions on what is and isn't allowed.

researchers also don't think like a security person: there's no way
someone can expect confidentiality from root unless the machine is 
completely under their control (bare machine + install media, etc).
we have a facility on campus that has data from StatsCan, and does 
indeed go to these sort of lengths.  but that's completely incompatible
with any sort of shared facility.

it's easy to imagine "security theater" which might make people feel better
though.  for instance, one might offer them VM hosting, instead of the 
traditional just-another-unix-user approach.  or even a deal to wipe the 
machine and install from scratch at the begining of the job - reboot when
you're done!  but these are simply making it harder to compromise, and 
IMO would just lead to a tar pit of obfuscation, not real security.
(for instance, compromising a running VM is probably not hard, but tweaking
the image before it runs would be easier.  does the occupant then try to 
validate the integrity of the VM?  how hard is it to intercept that check?
can they then detect the interception?  this applies to installing a node
from media, as well.)

ultimately, someone somewhere needs admin access, so it's not really a
question of whether disclosure is possible, but rather who you trust.
as a sysadmin, I wouldn't be upset about being asked to go through 
a background check, and my employer could obtain bonding for me.

an audit-trail is "post-coital", but may still make sensitive clients more
comfortable (though it's likely to be security theatre as well...)
consider, for instance, if a group's storage is on a separate server,
whose access is limited to specific admins, and whose mountd logs 
are available for the group's perusal.  even setting up jobs to use sshfs
back to the group's own server may make them feel better because they'll
be able to look at the logs (again, not impregnable, just harder to get.)

regards, mark hahn
PS: my apology to anyone allergic to innuendo!


From john.hearns at mclaren.com  Mon Mar  1 10:30:08 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Mon, 1 Mar 2010 18:30:08 -0000
Subject: [Beowulf] confidential data on public HPC cluster
In-Reply-To: <4B8BEB7D.9050904@scinet.utoronto.ca>
References: <4B8BEB7D.9050904@scinet.utoronto.ca>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com>

I think Mark Hahn has given a lot of good advice here.

It depends on the nature of the data. Is it:

a) industrially confidential

b) clinically confidential (ie patient identifiable data)

c) government confidential (ie government department stats)

d) Nuclear eyes-only

If (d) you're on your own.


Actually, the way you should look at this is 
"what happens if this data does leak out" and this depends on who gets
hold of it

Data from (c) leaking would probably cause little lasting or real damage
- but the headlines in the press
along the lines of "Government cannot keep data safe" are pretty
embarrassing.

I think the response is to demonstrate you took reasonable steps to
secure the data.

The audit trail concept is good.


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From james.p.lux at jpl.nasa.gov  Mon Mar  1 10:57:45 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Mon, 1 Mar 2010 10:57:45 -0800
Subject: [Beowulf] confidential data on public HPC cluster
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com>
References: <4B8BEB7D.9050904@scinet.utoronto.ca>
	<68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47FED2B93EDEA@ALTPHYEMBEVSP20.RES.AD.JPL>

Don't forget export controls, too. (both ITAR, internationally, and also (at least for US) Commerce department

> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Hearns, John
> Sent: Monday, March 01, 2010 10:30 AM
> To: beowulf at beowulf.org
> Subject: RE: [Beowulf] confidential data on public HPC cluster
> 
> I think Mark Hahn has given a lot of good advice here.
> 
> It depends on the nature of the data. Is it:
> 
> a) industrially confidential
> 
> b) clinically confidential (ie patient identifiable data)
> 
> c) government confidential (ie government department stats)
> 
> d) Nuclear eyes-only
> 
> If (d) you're on your own.
> 
> 
> Actually, the way you should look at this is
> "what happens if this data does leak out" and this depends on who gets
> hold of it
> 
> Data from (c) leaking would probably cause little lasting or real damage
> - but the headlines in the press
> along the lines of "Government cannot keep data safe" are pretty
> embarrassing.
> 
> I think the response is to demonstrate you took reasonable steps to
> secure the data.
> 
> The audit trail concept is good.
> 
> 
> The contents of this email are confidential and for the exclusive use of the intended recipient.  If
> you receive this email in error you should not copy it, retransmit it, use it or disclose its contents
> but should return it to the sender immediately and delete your copy.
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


From ljdursi at scinet.utoronto.ca  Mon Mar  1 11:08:43 2010
From: ljdursi at scinet.utoronto.ca (Jonathan Dursi)
Date: Mon, 01 Mar 2010 14:08:43 -0500
Subject: [Beowulf] confidential data on public HPC cluster
In-Reply-To: <ECE7A93BD093E1439C20020FBE87C47FED2B93EDEA@ALTPHYEMBEVSP20.RES.AD.JPL>
References: <4B8BEB7D.9050904@scinet.utoronto.ca>	<68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com>
	<ECE7A93BD093E1439C20020FBE87C47FED2B93EDEA@ALTPHYEMBEVSP20.RES.AD.JPL>
Message-ID: <4B8C10BB.1080907@scinet.utoronto.ca>

These are all good things to keep in mind.

There must be people out there with users who do biomed work with its 
attendant confidentiality issues, or users who work on commercial 
confidential data sets -- engineering or otherwise.   What do those 
users do on your systems, and have you had to implement things on the 
system side to help them?

	- Jonathan
-- 
Jonathan Dursi     <ljdursi at scinet.utoronto.ca>


From ashley at pittman.co.uk  Mon Mar  1 15:24:11 2010
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Mon, 1 Mar 2010 23:24:11 +0000
Subject: [Beowulf] confidential data on public HPC cluster
In-Reply-To: <4B8C10BB.1080907@scinet.utoronto.ca>
References: <4B8BEB7D.9050904@scinet.utoronto.ca>	<68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com>
	<ECE7A93BD093E1439C20020FBE87C47FED2B93EDEA@ALTPHYEMBEVSP20.RES.AD.JPL>
	<4B8C10BB.1080907@scinet.utoronto.ca>
Message-ID: <9175CD08-82D5-4848-88B4-DBF253BE9380@pittman.co.uk>


On 1 Mar 2010, at 19:08, Jonathan Dursi wrote:

> These are all good things to keep in mind.
> 
> There must be people out there with users who do biomed work with its attendant confidentiality issues, or users who work on commercial confidential data sets -- engineering or otherwise.

When we put this very question to the medical ethics board the conclusion was that it was ok to send patient data (3d scans in this case) over the wider academic network as long as the data was not traceable to an individual patient.  I don't think confidentially of data was ever assumed as it's very difficult to do with shared resources, merely care was taken that the data would not be of potential use to 3rd parties.  As I recall the concern was about the data initially leaving the hospital, once it had done that there was little distinction between it being on a cluster or traveling over the wire somewhere to get there.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From gmpc at sanger.ac.uk  Tue Mar  2 02:25:56 2010
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Tue, 02 Mar 2010 10:25:56 +0000
Subject: [Beowulf] confidential data on public HPC cluster
In-Reply-To: <9175CD08-82D5-4848-88B4-DBF253BE9380@pittman.co.uk>
References: <4B8BEB7D.9050904@scinet.utoronto.ca>	<68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com>	<ECE7A93BD093E1439C20020FBE87C47FED2B93EDEA@ALTPHYEMBEVSP20.RES.AD.JPL>	<4B8C10BB.1080907@scinet.utoronto.ca>
	<9175CD08-82D5-4848-88B4-DBF253BE9380@pittman.co.uk>
Message-ID: <4B8CE7B4.6010304@sanger.ac.uk>

Ashley Pittman wrote:
> On 1 Mar 2010, at 19:08, Jonathan Dursi wrote:
> 
>> These are all good things to keep in mind.
>>
>> There must be people out there with users who do biomed work with its attendant confidentiality issues, 
> or users who work on commercial confidential data sets -- engineering or otherwise.

Hi all,


The usual answer you will get from lawyers and compliance officers is
that:

"You should take reasonable care to ensure that data is kept appropriately."


However, most (all?) biomedical projects should have some sort of
data-access agreement (DAA). That document states what  patients have
given consent for, who should have access to the data and under what
conditions. That should give you a good starting point for working out
what your security policy should be. (If you are going to be doing
systems stuff for the group, you should also have signed the agreement.)

Generally speaking, the greater the chance of being to trace data back
to a specific individual, then the more paranoid you have to be about
the data. It is up to the primary investigators, lawyers, compliance
officers and sys-admins to turn that into a security policy.

At Sanger, we run through the whole range of security policies. We have
projects that deal routinely with full medical histories. They run on a
set of machines physically separated from the rest of our datacentre
infrastructure, with data held in encrypted databases with 2 factor
logins. Data is not allowed to be removed from that setting.

We have other projects that are using anonymised datasets, and that data
can be held on our main cluster with the appropriate unix access controls.

In the future we will probably have projects whose security requirements
would be somewhere in the middle of those two extremes. The key do
dealing with those projects are the words "reasonable care".

Would we worry about data being kept un-encrypted in memory? Probably
not.  Would we put in place an automated audit process to ensure data
kept on filesystems have appropriate ACLs set? Probably yes.

And remember, if someone goes out of their way to get access to data
that they should not, then that is a contravention of the AUP and/or
local computer crime laws.

(You do make your users sign an AUP, right...?)


There are some example DAAs below.

https://www.wtccc.org.uk/info/access_to_data_samples.shtml

https://www.wtccc.org.uk/docs/Data_Access_Agreement_v17.pdf

http://www.icgc.org/icgc_document/policies_and_guidelines/informed_consent_access_and_ethical_oversight

Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From rigved.sharma123 at gmail.com  Mon Mar  1 02:54:33 2010
From: rigved.sharma123 at gmail.com (rigved sharma)
Date: Mon, 1 Mar 2010 16:24:33 +0530
Subject: [Beowulf] error while make mpijava on amd_64
Message-ID: <e3ccefa01003010254p60f77076t749cfc3bda092da2@mail.gmail.com>

hi,
i am getting this error when i do make for mpijava:

make[2]: Leaving directory `/misc/local/mpiJAVA/mpiJava/src/Java'
--- Making C
make[2]: Entering directory `/misc/local/mpiJAVA/mpiJava/src/C'
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_MPI.o mpi_MPI.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Comm.o mpi_Comm
.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Op.o mpi_Op.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Datatype.o mpi_
Datatype.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Intracomm.o mpi
_Intracomm.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Intercomm.o mpi
_Intercomm.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Cartcomm.o mpi_
Cartcomm.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Graphcomm.o mpi
_Graphcomm.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Group.o mpi_Gro
up.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Status.o mpi_St
atus.c
mpi_Status.c:244:8: warning: extra tokens at end of #endif directive
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Request.o mpi_R
equest.c
/usr/local/mpich-1.2.6/bin/mpicc -c   -I/usr/java/j2sdk1.4.2/include
-I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include  -o
mpi_Errhandler.o mp
i_Errhandler.c
rm -f ../../lib/libmpijava.so
/usr/local/mpich-1.2.6/bin/mpicc  -o ../../lib/libmpijava.so \
                -L/usr/local/mpich-1.2.6/lib mpi_MPI.o       mpi_Comm.o
mpi_Op.o        mpi_Datatype.o mpi_Intracomm.o mpi_Intercomm.o
mpi_Cartcomm.o  mpi_Gr
aphcomm.o mpi_Group.o     mpi_Status.o mpi_Request.o mpi_Errhandler.o ;
/usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64/crt1.o(.text+0x21):
In function `_start':
: undefined reference to `main'
collect2: ld returned 1 exit status
make[2]: *** [../../lib/libmpi.so] Error 1
make[2]: Leaving directory `/misc/local/mpiJAVA/mpiJava/src/C'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/misc/local/mpiJAVA/mpiJava/src'
make: *** [all] Error 2
-----------------------------------


uname -a : Linux,testmc,2.6.9-42.0.2.EL_lustre.1.4.7.3smp #1 SMP 2006 x86_64
x86_64 x86_64 GNU/Linux,

mpich :/usr/local/mpich-1.2.6
java : /usr/java/j2sdk1.4.2

both are part of path variable...what is wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100301/10ab0a62/attachment.html>

From chenyon1 at iit.edu  Mon Mar  1 12:38:38 2010
From: chenyon1 at iit.edu (Yong Chen)
Date: Mon, 01 Mar 2010 14:38:38 -0600
Subject: [Beowulf] [hpc-announce] Submission deadline of P2S2-2010 extended
	to 3/10/2010
Message-ID: <fb19f4d419cc.4b8bd16e@iit.edu>

[Apologies if you got multiple copies of this email. If you'd like to
opt out of these announcements, information on how to unsubscribe is
available at the bottom of this email.]

Dear Colleague:

We would like to inform you that the paper submission deadline of the Third International Workshop 
on Parallel Programming Models and Systems Software for High-end Computing (P2S2)
has been extended to March 10th, 2010.

A full CFP can be found below. 

Thank you.


CALL FOR PAPERS
===============

Third International Workshop on Parallel Programming Models 
and Systems Software for High-end Computing (P2S2)
Sept. 13th, 2010

To be held in conjunction with ICPP-2010: The 39th International 
Conference on Parallel Processing, Sept. 13-16, 2010, San Diego, CA, USA 

Website: http://www.mcs.anl.gov/events/workshops/p2s2

SCOPE
-----
The goal of this workshop is to bring together researchers and
practitioners in parallel programming models and systems software for
high-end computing systems. Please join us in a discussion of new ideas,
experiences, and the latest trends in these areas at the workshop.


TOPICS OF INTEREST
------------------
The focus areas for this workshop include, but are not limited to:

    *  Systems software for high-end scientific and enterprise computing architectures
          o Communication sub-subsystems for high-end computing
          o High-performance file and storage systems
          o Fault-tolerance techniques and implementations
          o Efficient and high-performance virtualization and other management 
            mechanisms for high-end computing

    * Programming models and their high-performance implementations
          o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel, Fortress and others
          o Hybrid Programming Models

    * Tools for Management, Maintenance, Coordination and Synchronization
          o Software for Enterprise Data-centers using Modern Architectures
          o Job scheduling libraries
          o Management libraries for large-scale system
          o Toolkits for process and task coordination on modern platforms

    * Performance evaluation, analysis and modeling of emerging computing platforms


PROCEEDINGS
-----------
Proceedings of this workshop will be published in CD format and will be available 
at the conference (together with the ICPP conference proceedings) .


SUBMISSION INSTRUCTIONS
-----------------------
Submissions should be in PDF format in U.S. Letter size paper. They
should not exceed 8 pages (all inclusive). Submissions will be judged
based on relevance, significance, originality, correctness and clarity.
Please visit workshop website at: http://www.mcs.anl.gov/events/workshops/p2s2/
for the submission link.


JOURNAL SPECIAL ISSUE
---------------------
The best papers of P2S2'10 will be included in a special issue of the International 
Journal of High Performance Computing Applications (IJHPCA) on Programming Models, 
Software and Tools for High-End Computing. 


IMPORTANT DATES
---------------
Paper Submission: March 10th, 2010
Author Notification: May 3rd, 2010 
Camera Ready: June 14th, 2010


PROGRAM CHAIRS
--------------
  * Pavan Balaji, Argonne National Laboratory
  * Abhinav Vishnu, Pacific Northwest National Laboratory


PUBLICITY CHAIR
---------------
  * Yong Chen, Illinois Institute of Technology


STEERING COMMITTEE
------------------
  * William D. Gropp, University of Illinois Urbana-Champaign
  * Dhabaleswar K. Panda, Ohio State University
  * Vijay Saraswat, IBM Research


PROGRAM COMMITTEE
-----------------
  * Ahmad Afsahi, Queen's University
  * George Almasi, IBM Research 
  * Taisuke Boku, Tsukuba University
  * Ron Brightwell, Sandia National Laboratory
  * Franck Cappello, INRIA, France
  * Yong Chen, Illinois Institute of Technology
  * Ada Gavrilovska, Georgia Tech
  * Torsten Hoefler, Indiana University
  * Zhiyi Huang, University of Otago, New Zealand
  * Hyun-Wook Jin, Konkuk University, Korea
  * Zhiling Lan, Illinois Institute of Technology
  * Doug Lea, State University of New York at Oswego
  * Jiuxing Liu, IBM Research
  * Heshan Lin, Virginia Tech
  * Guillaume Mercier, INRIA, France
  * Scott Pakin, Los Alamos National Laboratory
  * Fabrizio Petrini, IBM Research
  * Bronis de Supinksi, Lawrence Livermore National Laboratory
  * Sayantan Sur, Ohio State University
  * Rajeev Thakur, Argonne National Laboratory
  * Vinod Tipparaju, Oak Ridge National Laboratory
  * Jesper Traff, NEC, Europe
  * Weikuan Yu, Auburn University


If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov

========================================================================
You can unsubscribe from the hpc-announce mailing list here:
https://lists.mcs.anl.gov/mailman/listinfo/hpc-announce
========================================================================


From hahn at mcmaster.ca  Tue Mar  2 20:37:06 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 2 Mar 2010 23:37:06 -0500 (EST)
Subject: [Beowulf] error while make mpijava on amd_64
In-Reply-To: <e3ccefa01003010254p60f77076t749cfc3bda092da2@mail.gmail.com>
References: <e3ccefa01003010254p60f77076t749cfc3bda092da2@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1003022312210.9588@coffee.psychology.mcmaster.ca>

> i am getting this error when i do make for mpijava:

isn't mpijava very old?

> /usr/local/mpich-1.2.6/bin/mpicc  -o ../../lib/libmpijava.so \
>                -L/usr/local/mpich-1.2.6/lib mpi_MPI.o       mpi_Comm.o
> mpi_Op.o        mpi_Datatype.o mpi_Intracomm.o mpi_Intercomm.o
> mpi_Cartcomm.o  mpi_Gr
> aphcomm.o mpi_Group.o     mpi_Status.o mpi_Request.o mpi_Errhandler.o ;
> /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64/crt1.o(.text+0x21):
> In function `_start':
> : undefined reference to `main'

which is gcc's way of saying "I'm trying to link an executable not a shared
library."  it needs -shared in there.  likely it also needs -fPIC when 
compiling the .o files.  or maybe just stick to static archives, which 
are generally simpler...


From hahn at mcmaster.ca  Wed Mar  3 12:05:01 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Wed, 3 Mar 2010 15:05:01 -0500 (EST)
Subject: [Beowulf] error while make mpijava on amd_64
In-Reply-To: <e3ccefa01003031149u4c08aa57le42357939bc9dde0@mail.gmail.com>
References: <e3ccefa01003010254p60f77076t749cfc3bda092da2@mail.gmail.com> 
	<Pine.LNX.4.64.1003022312210.9588@coffee.psychology.mcmaster.ca>
	<e3ccefa01003031149u4c08aa57le42357939bc9dde0@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1003031457160.27681@coffee.psychology.mcmaster.ca>

>    we r not getting latest free download version for mpijava for linux for

this is version 1.2.5 circa jan 2003, right?  right away this should 
set off some alarms, since any maintained package would have had some 
updates since then.  it's slightly unreasonable to expect such an old
package, especially one which is inherently "glueware" to be viable 
after sitting so long.

> AMD 64 bit. Also can u suggest the solution n for the error i forwarded.ur
> explaination is not very clear 2 me..:(

using a static library would be easy ("ar r mpijava.a *.o"), but now 
that I give it another thought, you probably want this to act as a 
java extension, which probably requires being a shared library.

in principle, to get the .so to work, you need to add -fPIC to each 
of the component compiles (which produce .o files), then add -shared
to the last link-like stage which combines the .o files into a .so file.
you'll have to look at the Makefile to find out where to add these 
flags.  offhand, I'd think you should add -fPIC to CFLAGS and -static 
to LDFLAGS (but I don't have a copy of the Makefile.)

as a testament to the whimsical nature of getting 2003-vintage software
to compile, I can't even find a working download link for mpijava.
web sites also wither and die if not cared for...

-mark


From mm at yuhu.biz  Thu Mar  4 04:27:38 2010
From: mm at yuhu.biz (Marian Marinov)
Date: Thu, 4 Mar 2010 14:27:38 +0200
Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping
In-Reply-To: <c4d69731001191506y3a13ec6ch63d473bb2a0b42b6@mail.gmail.com>
References: <c4d69731001171307h41c84743geb0b33aff038680b@mail.gmail.com>
	<4B56246E.2050505@abdn.ac.uk>
	<c4d69731001191506y3a13ec6ch63d473bb2a0b42b6@mail.gmail.com>
Message-ID: <201003041427.47481.mm@yuhu.biz>

On Wednesday 20 January 2010 01:06:27 Rahul Nabar wrote:
> On Tue, Jan 19, 2010 at 3:30 PM, Tony Travis <a.travis at abdn.ac.uk> wrote:
> > I responded to Rahul who started this thread because his requirements
> > seemed to be similar to mine: i.e. a small-scale DIY Beowulf cluster. In
> > this context, every penny counts and we do not throw things away until
> > they are actually dead: Old servers become new compute nodes, and so on.
> > I think that lot of people reading this list are interested in running
> > small Beowulf clusters for relatively small projects, like me. I've found
> > the Beowulf list to be a mine of useful information, but we are not all
> > running huge Beowulf clusters or supporting them commerically.
> 
> I don't know about the others on the list, but you describe my
> situation pretty accurately Tony! :) Small budget, primitive hardware
> that's rarely retired etc. Sounds familiar.
> 

Linux-Mag had a very good article about Software RAID0 vs LVM Stripe 
performance:

http://www.linux-mag.com/cache/7582/1.html

You should read it.

-- 
Best regards,
Marian Marinov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100304/0628e80c/attachment.sig>

From rigved.sharma123 at gmail.com  Wed Mar  3 11:49:42 2010
From: rigved.sharma123 at gmail.com (rigved sharma)
Date: Thu, 4 Mar 2010 01:19:42 +0530
Subject: [Beowulf] error while make mpijava on amd_64
In-Reply-To: <Pine.LNX.4.64.1003022312210.9588@coffee.psychology.mcmaster.ca>
References: <e3ccefa01003010254p60f77076t749cfc3bda092da2@mail.gmail.com>
	<Pine.LNX.4.64.1003022312210.9588@coffee.psychology.mcmaster.ca>
Message-ID: <e3ccefa01003031149u4c08aa57le42357939bc9dde0@mail.gmail.com>

Hi Mark/Friends,

    we r not getting latest free download version for mpijava for linux for
AMD 64 bit. Also can u suggest the solution n for the error i forwarded.ur
explaination is not very clear 2 me..:(

On Wed, Mar 3, 2010 at 10:07 AM, Mark Hahn <hahn at mcmaster.ca> wrote:

> i am getting this error when i do make for mpijava:
>>
>
> isn't mpijava very old?
>
>
>  /usr/local/mpich-1.2.6/bin/mpicc  -o ../../lib/libmpijava.so \
>>               -L/usr/local/mpich-1.2.6/lib mpi_MPI.o       mpi_Comm.o
>> mpi_Op.o        mpi_Datatype.o mpi_Intracomm.o mpi_Intercomm.o
>> mpi_Cartcomm.o  mpi_Gr
>> aphcomm.o mpi_Group.o     mpi_Status.o mpi_Request.o mpi_Errhandler.o ;
>>
>> /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64/crt1.o(.text+0x21):
>> In function `_start':
>> : undefined reference to `main'
>>
>
> which is gcc's way of saying "I'm trying to link an executable not a shared
> library."  it needs -shared in there.  likely it also needs -fPIC when
> compiling the .o files.  or maybe just stick to static archives, which are
> generally simpler...
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100304/1ae89c02/attachment.html>

From mdidomenico4 at gmail.com  Fri Mar  5 07:23:47 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Fri, 5 Mar 2010 10:23:47 -0500
Subject: [Beowulf] copying data between clusters
Message-ID: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>

How does one copy large (20TB) amounts of data from one cluster to another?

Assuming that each node in the cluster can only do about 30MB/sec
between clusters and i want to preserve the uid/gid/timestamps, etc

I know how i do it, but i'm curious what methods other people use...

Just a general survey...


From landman at scalableinformatics.com  Fri Mar  5 08:00:03 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 05 Mar 2010 11:00:03 -0500
Subject: [Beowulf] copying data between clusters
In-Reply-To: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
Message-ID: <4B912A83.5090703@scalableinformatics.com>

Michael Di Domenico wrote:
> How does one copy large (20TB) amounts of data from one cluster to another?
> 
> Assuming that each node in the cluster can only do about 30MB/sec
> between clusters and i want to preserve the uid/gid/timestamps, etc
> 
> I know how i do it, but i'm curious what methods other people use...

I am biased of course, but Fedex-net with one of these: 
http://scalableinformatics.com/jackrabbit

1GB @ 30 MB/s is about 33s.  1TB @ 30 MB/s is about 33000s.  Or more 
than 1/3 of a day.  20TB @ 30 MB/s ... you are looking at ~7 days to write.

If you have a 1GB/s disk write speed (less than the above unit can do), 
1TB takes ~1000s, 20TB takes 20000s, about 1/4 of a day.

If the clusters are close enough (same data center) this could be a 
shared storage but you will need a fast network between them.  If the 
clusters are far enough to avoid direct connection, chances are 30 MB/s 
may be optimistic on getting data between them.

BTW: 30 MB/s sounds suspiciously like either a) 1GbE sustained NFS speed 
for some nodes or b) the speed of an IDE drive.

> 
> Just a general survey...
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From kyron at neuralbs.com  Fri Mar  5 08:18:54 2010
From: kyron at neuralbs.com (kyron)
Date: Fri, 05 Mar 2010 11:18:54 -0500
Subject: [Beowulf] copying data between clusters
In-Reply-To: <4B912A83.5090703@scalableinformatics.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
	<4B912A83.5090703@scalableinformatics.com>
Message-ID: <58a7213ba2ad240c2da27b8d33311f57@localhost>

On Fri, 05 Mar 2010 11:00:03 -0500, Joe Landman
<landman at scalableinformatics.com> wrote:
> Michael Di Domenico wrote:
>> How does one copy large (20TB) amounts of data from one cluster to
>> another?
>> 
>> Assuming that each node in the cluster can only do about 30MB/sec
>> between clusters and i want to preserve the uid/gid/timestamps, etc
>> 
>> I know how i do it, but i'm curious what methods other people use...

Could you clarify? Are-you actually sending from NodeXX-clusterA to
NodeXX-ClusterB ? Are-we to assume aggregate bandwidth of Node*BW (as long
as you don't saturate the switch fabric)? Also, given my comment below, I
am assuming the 20TB of data is actually segmented (20TB/NodeCount) across
the nodes and not 20TB*NodeCount.

> I am biased of course, but Fedex-net with one of these: 
> http://scalableinformatics.com/jackrabbit
> 
> 1GB @ 30 MB/s is about 33s.  1TB @ 30 MB/s is about 33000s.  Or more 
> than 1/3 of a day.  20TB @ 30 MB/s ... you are looking at ~7 days to
write.
> 
> If you have a 1GB/s disk write speed (less than the above unit can do), 
> 1TB takes ~1000s, 20TB takes 20000s, about 1/4 of a day.
> 
> If the clusters are close enough (same data center) this could be a 
> shared storage but you will need a fast network between them.  If the 
> clusters are far enough to avoid direct connection, chances are 30 MB/s 
> may be optimistic on getting data between them.
> 
> BTW: 30 MB/s sounds suspiciously like either a) 1GbE sustained NFS speed

> for some nodes or b) the speed of an IDE drive.

Given I haven't seen single 20TB drives out there yet, I doubt it to be
the case. I wouldn't throw in NFS as a limiting factor (just yet) as I have
been able to have sustained 250MB/s data transfer rates (2xGigE using
channel bonding). And this figure is without jumbo frames so I do have some
protocol overhead loss. The sending server is a PERC 5/i raid with
4*300G*15kRPM drives while the receiving well...was loading onto RAM ;)


Eric Thibodeau


From jmdavis1 at vcu.edu  Fri Mar  5 08:22:14 2010
From: jmdavis1 at vcu.edu (Mike Davis)
Date: Fri, 05 Mar 2010 11:22:14 -0500
Subject: [Beowulf] copying data between clusters
In-Reply-To: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
Message-ID: <4B912FB6.5010109@vcu.edu>

Michael Di Domenico wrote:
> How does one copy large (20TB) amounts of data from one cluster to another?
>
> Assuming that each node in the cluster can only do about 30MB/sec
> between clusters and i want to preserve the uid/gid/timestamps, etc
>   
If the clusters are co-lo I wouldn't copy I would use shared storage. If 
they are not co-located I would use patience.

Seriously though, for a one time copy, I would consider copying to an 
external system and then physically moving that system. To do this and 
preserve ownerships you will need to duplicate accounts and groups.


From landman at scalableinformatics.com  Fri Mar  5 08:27:22 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 05 Mar 2010 11:27:22 -0500
Subject: [Beowulf] copying data between clusters
In-Reply-To: <58a7213ba2ad240c2da27b8d33311f57@localhost>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
	<4B912A83.5090703@scalableinformatics.com>
	<58a7213ba2ad240c2da27b8d33311f57@localhost>
Message-ID: <4B9130EA.8000100@scalableinformatics.com>

kyron wrote:

> Given I haven't seen single 20TB drives out there yet, I doubt it to be
> the case. I wouldn't throw in NFS as a limiting factor (just yet) as I have

I was commenting on the 30 MB/s figure.  Not whether or not he had 20TB 
attached to it (though if he did ... that would be painful).

> been able to have sustained 250MB/s data transfer rates (2xGigE using
> channel bonding). And this figure is without jumbo frames so I do have some
> protocol overhead loss. The sending server is a PERC 5/i raid with
> 4*300G*15kRPM drives while the receiving well...was loading onto RAM ;)

We are getting sustained 1+GB/s over 10GbE with NFS on a per unit basis. 
  For IB its somewhat faster.  Backing store is able to handle this 
easily.

I think Michael may be thinking about the performance of a single node 
GbE or IDE rather than the necessary r/w performance to populate 20+ TB 
of data for data motion.

> 
> 
> Eric Thibodeau


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From kyron at neuralbs.com  Fri Mar  5 08:30:44 2010
From: kyron at neuralbs.com (kyron)
Date: Fri, 05 Mar 2010 11:30:44 -0500
Subject: [Beowulf] copying data between clusters
In-Reply-To: <4B912FB6.5010109@vcu.edu>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
	<4B912FB6.5010109@vcu.edu>
Message-ID: <dce74932fd32499ef42147e4b066dfc4@localhost>

On Fri, 05 Mar 2010 11:22:14 -0500, Mike Davis <jmdavis1 at vcu.edu> wrote:
> Michael Di Domenico wrote:
>> How does one copy large (20TB) amounts of data from one cluster to
>> another?
>>
>> Assuming that each node in the cluster can only do about 30MB/sec
>> between clusters and i want to preserve the uid/gid/timestamps, etc
>>   
> If the clusters are co-lo I wouldn't copy I would use shared storage. If

> they are not co-located I would use patience.
> 
> Seriously though, for a one time copy, I would consider copying to an 
> external system and then physically moving that system. To do this and 
> preserve ownerships you will need to duplicate accounts and groups.


...and we are all assuming non-compressibility; otherwise, use pbzip2 ;) 


From akshar.bhosale at gmail.com  Thu Mar  4 11:14:22 2010
From: akshar.bhosale at gmail.com (akshar bhosale)
Date: Fri, 5 Mar 2010 00:44:22 +0530
Subject: [Beowulf] error while make mpijava on amd_64
In-Reply-To: <Pine.LNX.4.64.1003031457160.27681@coffee.psychology.mcmaster.ca>
References: <e3ccefa01003010254p60f77076t749cfc3bda092da2@mail.gmail.com>
	<Pine.LNX.4.64.1003022312210.9588@coffee.psychology.mcmaster.ca>
	<e3ccefa01003031149u4c08aa57le42357939bc9dde0@mail.gmail.com>
	<Pine.LNX.4.64.1003031457160.27681@coffee.psychology.mcmaster.ca>
Message-ID: <bf0758a31003041114x59830b98gf97efb84db3c4d4c@mail.gmail.com>

Hi Mark,

Many thanks 2 u..

Regards,
Rigved

On Thu, Mar 4, 2010 at 1:35 AM, Mark Hahn <hahn at mcmaster.ca> wrote:

>   we r not getting latest free download version for mpijava for linux for
>>
>
> this is version 1.2.5 circa jan 2003, right?  right away this should set
> off some alarms, since any maintained package would have had some updates
> since then.  it's slightly unreasonable to expect such an old
> package, especially one which is inherently "glueware" to be viable after
> sitting so long.
>
>
>  AMD 64 bit. Also can u suggest the solution n for the error i forwarded.ur
>> explaination is not very clear 2 me..:(
>>
>
> using a static library would be easy ("ar r mpijava.a *.o"), but now that I
> give it another thought, you probably want this to act as a java extension,
> which probably requires being a shared library.
>
> in principle, to get the .so to work, you need to add -fPIC to each of the
> component compiles (which produce .o files), then add -shared
> to the last link-like stage which combines the .o files into a .so file.
> you'll have to look at the Makefile to find out where to add these flags.
>  offhand, I'd think you should add -fPIC to CFLAGS and -static to LDFLAGS
> (but I don't have a copy of the Makefile.)
>
> as a testament to the whimsical nature of getting 2003-vintage software
> to compile, I can't even find a working download link for mpijava.
> web sites also wither and die if not cared for...
>
> -mark
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100305/f4c10eaf/attachment.html>

From wrankin.1 at gmail.com  Fri Mar  5 07:59:08 2010
From: wrankin.1 at gmail.com (Bill Rankin)
Date: Fri, 5 Mar 2010 10:59:08 -0500
Subject: [Beowulf] copying data between clusters
In-Reply-To: <ea17197c1003050758x405fed3bkac14ca700f558d76@mail.gmail.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
	<ea17197c1003050758x405fed3bkac14ca700f558d76@mail.gmail.com>
Message-ID: <ea17197c1003050759y465c893ckb218fd6a16723843@mail.gmail.com>

Umm, you have your network guys pull a fiber run (or two) from your
cluster's file server over to the other cluster's core network switch?

Alternately, you unbolt and pull the shelf of FC disks out of the rack, put
them on a cart and wheel them over to the other cluster's filer.
(1/2 :-)

It's an ill defined problem.

What's your network topology?  Per-node bandwidth is pretty meaningless if
you are oversubscribed (and most clusters are).  What's the biggest pipe
between cluster A and cluster B?

-bill


On Fri, Mar 5, 2010 at 10:23 AM, Michael Di Domenico <mdidomenico4 at gmail.com
> wrote:

> How does one copy large (20TB) amounts of data from one cluster to another?
>
> Assuming that each node in the cluster can only do about 30MB/sec
> between clusters and i want to preserve the uid/gid/timestamps, etc
>
> I know how i do it, but i'm curious what methods other people use...
>
> Just a general survey...
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Bill Rankin
wrankin1 at gmail.com


-- 
Bill Rankin
wrankin1 at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100305/b1a51193/attachment.html>

From mdidomenico4 at gmail.com  Fri Mar  5 09:32:37 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Fri, 5 Mar 2010 12:32:37 -0500
Subject: [Beowulf] copying data between clusters
In-Reply-To: <dce74932fd32499ef42147e4b066dfc4@localhost>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
	<4B912FB6.5010109@vcu.edu>
	<dce74932fd32499ef42147e4b066dfc4@localhost>
Message-ID: <e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>

As i expect from the smartest sysadmins on the planet, everyone has
over analyzed the issue... :)

lets see if i can clarify

assuming there are two clusters - clusterA and clusterB

Each cluster is 32nodes and has 50TB of storage attached

the aggregate network bandwidth between the clusters is 800MB/sec

the problem is the per-node bandwidth on clusterB is 30MB/sec

so i use a single node to copy the 20TB of data from clusterB, yes
it's going to take me 7days to copy everything

I'd like to paralyze that across multiple nodes to drive the aggregate up

I was hoping someone would pop up say, hey use this magical piece of
software. (of which im unable to locate)..


On Fri, Mar 5, 2010 at 11:30 AM, kyron <kyron at neuralbs.com> wrote:
> On Fri, 05 Mar 2010 11:22:14 -0500, Mike Davis <jmdavis1 at vcu.edu> wrote:
>> Michael Di Domenico wrote:
>>> How does one copy large (20TB) amounts of data from one cluster to
>>> another?
>>>
>>> Assuming that each node in the cluster can only do about 30MB/sec
>>> between clusters and i want to preserve the uid/gid/timestamps, etc
>>>
>> If the clusters are co-lo I wouldn't copy I would use shared storage. If
>
>> they are not co-located I would use patience.
>>
>> Seriously though, for a one time copy, I would consider copying to an
>> external system and then physically moving that system. To do this and
>> preserve ownerships you will need to duplicate accounts and groups.
>
>
> ...and we are all assuming non-compressibility; otherwise, use pbzip2 ;)
>


From john.hearns at mclaren.com  Fri Mar  5 10:05:38 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 5 Mar 2010 18:05:38 -0000
Subject: [Beowulf] copying data between clusters
In-Reply-To: <e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com><4B912FB6.5010109@vcu.edu><dce74932fd32499ef42147e4b066dfc4@localhost>
	<e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0F9E7F4C@milexchmb1.mil.tagmclarengroup.com>


> 
> I'd like to paralyze that across multiple nodes to drive the aggregate
> up
> 
> I was hoping someone would pop up say, hey use this magical piece of
> software. (of which im unable to locate)..
> 
My recommendation also would be to use an external storage device - a
USB drive would be useful, and I have been involved in a couple of
industrial projects where data has been brought to a cluster on an
external USB drive. It is as people say quite an efficient way to
transfer the data.

I gather that for high def digital cinema a RAID array is physically
shipped to the cinema - I guess that also helps with data security, as
you could do some sort of encryption on the drives, though I might be
wrong.
In the digital media world, there are some fast parallel SCP boxes which
are an industry standard - I gather they cost $$$$ but do make transfers
faster.
I forget the name, and if they don't really do parallel SCP forgive me -
its something along those lines.


Re. moving data to/from a cluster over a WAN link, I did look at this
recently.
You can set up a fuse filesystem running over SSH. This actually works
quite well from the point of view of ease of setting up and usability,
but I didn't try any serious data transfer over it - and of course it
cannot be faster than ssh anyway!

I did also have a look at the types of tools used by grids for bulk data
transfer, but not much more than looking.
Here's an interesting link I found:  http://fasterdata.es.net/tools.html


ps. you don't say how you are transferring the data - if via rsync you
have looked at the encryption options you are using?


John Hearns


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From john.hearns at mclaren.com  Fri Mar  5 10:07:14 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 5 Mar 2010 18:07:14 -0000
Subject: [Beowulf] copying data between clusters
In-Reply-To: <e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com><4B912FB6.5010109@vcu.edu><dce74932fd32499ef42147e4b066dfc4@localhost>
	<e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0F9E7F51@milexchmb1.mil.tagmclarengroup.com>

These might be the boxes used by post production/animation:
http://www.rocketstream.com/company/overview/default.aspx


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From bill at cse.ucdavis.edu  Fri Mar  5 10:10:36 2010
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Fri, 05 Mar 2010 10:10:36 -0800
Subject: [Beowulf] copying data between clusters
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F9E7F51@milexchmb1.mil.tagmclarengroup.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com><4B912FB6.5010109@vcu.edu><dce74932fd32499ef42147e4b066dfc4@localhost>	<e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0F9E7F51@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <4B91491C.6040604@cse.ucdavis.edu>

Grid-ftp?  http://www.globus.org/toolkit/docs/3.2/gridftp/key/index.html


From jlforrest at berkeley.edu  Fri Mar  5 10:34:55 2010
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Fri, 05 Mar 2010 10:34:55 -0800
Subject: [Beowulf] copying data between clusters
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F9E7F4C@milexchmb1.mil.tagmclarengroup.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com><4B912FB6.5010109@vcu.edu><dce74932fd32499ef42147e4b066dfc4@localhost>	<e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B0F9E7F4C@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <4B914ECF.60102@berkeley.edu>

On 3/5/2010 10:05 AM, Hearns, John wrote:
>

> My recommendation also would be to use an external storage device - a
> USB drive would be useful, and I have been involved in a couple of
> industrial projects where data has been brought to a cluster on an
> external USB drive. It is as people say quite an efficient way to
> transfer the data.

Yes, except the speed of even USB 2.0 would make
this an unpleasant experience. These days many external
drives support eSATA, which runs at regular SATA
speeds so you're not facing the USB bottleneck.

If your host doesn't have an eSATA connector
you can buy a PCI card for not much money.

Once USB 3 is ubiquitous this problem (e.g USB 2.0 vs eSATA)
will go away.

Cordially,
-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From ljdursi at scinet.utoronto.ca  Fri Mar  5 11:16:12 2010
From: ljdursi at scinet.utoronto.ca (Jonathan Dursi)
Date: Fri, 05 Mar 2010 14:16:12 -0500
Subject: [Beowulf] copying data between clusters
In-Reply-To: <4B91491C.6040604@cse.ucdavis.edu>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com><4B912FB6.5010109@vcu.edu><dce74932fd32499ef42147e4b066dfc4@localhost>	<e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>	<68A57CCFD4005646957BD2D18E60667B0F9E7F51@milexchmb1.mil.tagmclarengroup.com>
	<4B91491C.6040604@cse.ucdavis.edu>
Message-ID: <4B91587C.2070004@scinet.utoronto.ca>

On 03/05/2010 01:10 PM, Bill Broadley wrote:
> Grid-ftp?  http://www.globus.org/toolkit/docs/3.2/gridftp/key/index.html

If you don't already have the globus framework set up on both ends, 
getting it installed just for gridftp is a huge amount of work; 
especially since the advantage of gridftp doesn't derive from its 
grid-nature at all, it's just multi-channel and the protocol has big 
windows.

We've had good luck with rsync over hpn-ssh
  http://www.psc.edu/networking/projects/hpn-ssh/

There's a java package out of CERN called FDT
  http://monalisa.cern.ch/FDT/

which looks promising but we've not had much luck getting it to be 
particularly fast; but maybe we're doing something wrong.

	- Jonathan
-- 
Jonathan Dursi     <ljdursi at scinet.utoronto.ca>


From richard.walsh at comcast.net  Fri Mar  5 11:16:56 2010
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Fri, 5 Mar 2010 19:16:56 +0000 (UTC)
Subject: [Beowulf] Configuring PBS for a mixed CPU-GPU and QDR-DDR cluster
	...
Message-ID: <316095273.10450501267816616006.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>


All, 


I am augmenting a DDR switched SGI ICE system with 
one that largely network-separate (a few 4x DDR links 
connect them) and QDR switched. The QDR "half" also 
includes GPUs (one per socket). Has anyone configured 
PBS to manage these kinds of natural divisions as a single 
cluster. Some part of the QDR-GPU "half" will be dedicated 
to GPU work, but I would like the rest of that part of the system 
to run either category of work. 


rbw 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100305/a5ba8cd3/attachment.html>

From mathog at caltech.edu  Fri Mar  5 14:14:25 2010
From: mathog at caltech.edu (David Mathog)
Date: Fri, 05 Mar 2010 14:14:25 -0800
Subject: [Beowulf] Re: copying data between clusters
Message-ID: <E1Nnfmr-0003FL-KK@mendel.bio.caltech.edu>

Michael Di Domenico wrote:

> lets see if i can clarify
> 
> assuming there are two clusters - clusterA and clusterB
> 
> Each cluster is 32nodes and has 50TB of storage attached

Attached how?  Is the 50TB sitting on one file server on each cluster,
or is it distributed across the cluster?  We need more details.

> 
> the aggregate network bandwidth between the clusters is 800MB/sec
> 
> the problem is the per-node bandwidth on clusterB is 30MB/sec

Is there a switch on each cluster so that each node can write directly
to the interconnect between clusters?   Specifically, can node A12 write
to node B12?  Sounds like there might be, and since you seem to care
about the per-node bandwidth on the target it sounds like you have a
situation where the data is distributed on A and will again be
distributed across nodes on B.  If that's what you mean, then you just
need to queue up a job on each node to do something like:

 (cd $DATADIRECTORY ; tar -cf - . ) \
   | ssh matching_target_node 'cd $DATADIRECTORY; tar -xf - )  

It will run in parallel using up all of your interconnect bandwidth.
If on the other hand, the only per node rate you care about is the one
fileserver on B, then it is a different problem.  On the other, other
hand, if you can temporarily store the data on each node of B, and the
cumulative bandwidth that way is 800MB/s you could conceivably transfer
it in parallel from A to all 32 destinations in B, and put the mess back
together in B later.  However, if you are still rate limited to 30Mb/sec
on a single B fileserver then the total time to complete this operation
will not change, only the time the data is in transit between the
clusters will be reduced.


Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From ntmoore at gmail.com  Sat Mar  6 08:27:01 2010
From: ntmoore at gmail.com (Nathan Moore)
Date: Sat, 6 Mar 2010 10:27:01 -0600
Subject: [Beowulf] best sophomore-level FPGA reference?
Message-ID: <6009416b1003060827s39e717dbve97b8de1c559492a@mail.gmail.com>

Hi All,

I regularly teach a sophomore/junior level course on digital circuits.
 I've recently started paying attention to PLD/FPGA hardware,
particularly Actel's bargain-basement igloo-nano development board,
http://www.actel.com/products/hardware/devkits_boards/igloonano_starter.aspx,
which is offered at a price that my students could conceivably buy for
the course.

I'd like to spend some time talking about FPGA's but have run into two
small problems - asking you-all seems like the easiest solution:
  - What's your favorite text that serves as a technical introduction
to FPGA's?  (I'm thinking of something parallel to Essick's
"Introduction to LabVIEW")  My main text for the course is Floyd's
"Digital Fundametals" which I think is mediocre overall.
  - What's your favorite class project involving FPGA programming?
The obvios targets for a project right now seem like they should
feature VHDL (obviously), and tend towards the architectural
"sweet-spot" of FPGA computation, which to me seems like massively
parallel computation of something.  I've been thinking about a cartoon
of the features in a digital camera, eg zoom in on an image (bitmap),
rotate an image, etc.  If you have something built that you're willing
to share I'd be most grateful.

I hope this isn't too off-topic. I didn't think of FPGA's in the same
terms as beowulfs until I started reading the amazon reviews for
http://www.amazon.com/gp/product/0471687839/ref=oss_product

Nathan Moore


From lindahl at pbm.com  Sat Mar  6 15:36:07 2010
From: lindahl at pbm.com (Greg Lindahl)
Date: Sat, 6 Mar 2010 15:36:07 -0800
Subject: [Beowulf] Q: IB message rate & large core counts (per node) ?
In-Reply-To: <4A1DA2D8-E75F-46C0-9CDA-64BD204A0CCA@gmail.com>
References: <265537950.7749891267205793764.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
	<4A1DA2D8-E75F-46C0-9CDA-64BD204A0CCA@gmail.com>
Message-ID: <20100306233607.GA5410@bx9.net>

On Fri, Feb 26, 2010 at 01:20:49PM -0500, Lawrence Stewart wrote:

> Personally, I believe our thinking about interconnects has been
> poisoned by thinking that NICs are I/O devices.  We would be better
> off if they were coprocessors.  Threads should be able to send
> messages by writing to registers, and arriving packets should
> activate a hyperthread that has full core capabilities for acting on
> them, and with the ability to interact coherently with the memory
> hierarchy from the same end as other processors.

I'm up for dedicating 1+ normal processor cores to doing the special
stuff.  Nodes have a lot of cores these days, and all-2-sided programs
don't have to dedicate a core & thus would pay nothing. In the MPI
1-sided model, you'd probably want to run all the cores on separate
programs and have the dedicated core get access to the appropriate
process' address space.

-- greg


From fitz at cs.earlham.edu  Fri Mar  5 10:00:49 2010
From: fitz at cs.earlham.edu (Andrew Fitz Gibbon)
Date: Fri, 5 Mar 2010 12:00:49 -0600
Subject: [Beowulf] copying data between clusters
In-Reply-To: <e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
	<4B912FB6.5010109@vcu.edu>
	<dce74932fd32499ef42147e4b066dfc4@localhost>
	<e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
Message-ID: <853EAB81-DB3D-42F2-8467-F6EC5F5A8B21@cs.earlham.edu>


On Mar 5, 2010, at 11:32 AM, Michael Di Domenico wrote:

> I was hoping someone would pop up say, hey use this magical piece of
> software. (of which im unable to locate)..

You might want to take a look at GridFTP from Globus (http:// 
globus.org). Among other things, it has support for parallel data  
streams and is specifically designed for transferring lots of data  
between clusters. It's distributed as part of the Toolkit, and it's  
not too hard to build /just/ GridFTP.

As with any recommended software, YMMV.

----------------
Andrew Fitz Gibbon
fitz at cs.earlham.edu


From scott at cse.ucdavis.edu  Fri Mar  5 10:21:01 2010
From: scott at cse.ucdavis.edu (Scott Beardsley)
Date: Fri, 05 Mar 2010 10:21:01 -0800
Subject: [Beowulf] copying data between clusters
In-Reply-To: <e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>	<4B912FB6.5010109@vcu.edu>	<dce74932fd32499ef42147e4b066dfc4@localhost>
	<e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
Message-ID: <4B914B8D.8090206@cse.ucdavis.edu>


> I'd like to paralyze that across multiple nodes to drive the aggregate up
> 
> I was hoping someone would pop up say, hey use this magical piece of
> software. (of which im unable to locate)..

Sounds like what we are doing with hadoop and gridftp-hdfs for our LHC
cluster. Basically you would run N gridftp servers where N is the number
of nodes (and preferably uncontended uplinks as well) on the destination
cluster. Then run something like bestman[1] (it'll act as a director).
It is a non-trivial stack of software but it should get the job done.

Scott
-----------
[1] https://sdm.lbl.gov/bestman/


From dgs at slac.stanford.edu  Fri Mar  5 14:54:48 2010
From: dgs at slac.stanford.edu (David Simas)
Date: Fri, 5 Mar 2010 14:54:48 -0800
Subject: [Beowulf] copying data between clusters
In-Reply-To: <e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
References: <e75d22a91003050723n678ba3dcj8028ba427ebdb677@mail.gmail.com>
	<4B912FB6.5010109@vcu.edu>
	<dce74932fd32499ef42147e4b066dfc4@localhost>
	<e75d22a91003050932h2153287bvdef3e3df8accfac@mail.gmail.com>
Message-ID: <20100305225448.GB31413@horus.slac.stanford.edu>

On Fri, Mar 05, 2010 at 12:32:37PM -0500, Michael Di Domenico wrote:
> As i expect from the smartest sysadmins on the planet, everyone has
> over analyzed the issue... :)
> 
> lets see if i can clarify
> 
> assuming there are two clusters - clusterA and clusterB
> 
> Each cluster is 32nodes and has 50TB of storage attached
> 
> the aggregate network bandwidth between the clusters is 800MB/sec
> 
> the problem is the per-node bandwidth on clusterB is 30MB/sec
> 
> so i use a single node to copy the 20TB of data from clusterB, yes
> it's going to take me 7days to copy everything
> 
> I'd like to paralyze that across multiple nodes to drive the aggregate up
> 
> I was hoping someone would pop up say, hey use this magical piece of
> software. (of which im unable to locate)..

You might be able to use "dar" for this:

	http://dar.linux.free.fr/

Dar will let you slice up your 20 TB of data into even sized pieces
that you can transfer in parallel, than re-construct on the receiving
side.

David S.


> 
> 
> 
> On Fri, Mar 5, 2010 at 11:30 AM, kyron <kyron at neuralbs.com> wrote:
> > On Fri, 05 Mar 2010 11:22:14 -0500, Mike Davis <jmdavis1 at vcu.edu> wrote:
> >> Michael Di Domenico wrote:
> >>> How does one copy large (20TB) amounts of data from one cluster to
> >>> another?
> >>>
> >>> Assuming that each node in the cluster can only do about 30MB/sec
> >>> between clusters and i want to preserve the uid/gid/timestamps, etc
> >>>
> >> If the clusters are co-lo I wouldn't copy I would use shared storage. If
> >
> >> they are not co-located I would use patience.
> >>
> >> Seriously though, for a one time copy, I would consider copying to an
> >> external system and then physically moving that system. To do this and
> >> preserve ownerships you will need to duplicate accounts and groups.
> >
> >
> > ...and we are all assuming non-compressibility; otherwise, use pbzip2 ;)
> >
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From markcf at rocketmail.com  Sun Mar  7 07:08:27 2010
From: markcf at rocketmail.com (MArk Fennema)
Date: Sun, 7 Mar 2010 07:08:27 -0800 (PST)
Subject: [Beowulf] Windows Master, Linux Slaves
Message-ID: <901979.89098.qm@web65303.mail.ac2.yahoo.com>

I'm not sure at all if this would be at all beowulfy, or even if it would be possible. That's why I'm asking you. What I want to set up is a cluster computer that can run standard windows applications. Random download games, Microsoft office, etc. So I was wondering, is it at all possible to run a windows master computer that's controlling Linux slaves, and if I did, would it improve the performance of usual applications (or make it possible to run more of them at the same time). I know this isn't the most useful or the cheapest way to make a computer like this, but it's kind of an experiment.


      __________________________________________________________________
Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your favourite sites. Download it now
http://ca.toolbar.yahoo.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100307/669a8006/attachment.html>

From hahn at mcmaster.ca  Sun Mar  7 12:05:52 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sun, 7 Mar 2010 15:05:52 -0500 (EST)
Subject: [Beowulf] Windows Master, Linux Slaves
In-Reply-To: <901979.89098.qm@web65303.mail.ac2.yahoo.com>
References: <901979.89098.qm@web65303.mail.ac2.yahoo.com>
Message-ID: <Pine.LNX.4.64.1003071453350.17117@coffee.psychology.mcmaster.ca>

> I'm not sure at all if this would be at all beowulfy, or even if it would
>be possible. That's why I'm asking you. What I want to set up is a cluster
>computer that can run standard windows applications. Random download games,
>Microsoft office, etc. So I was wondering, is it at all possible to run a
>windows master computer that's controlling Linux slaves, and if I did, would
>it improve the performance of usual applications (or make it possible to run
>more of them at the same time). I know this isn't the most useful or the
>cheapest way to make a computer like this, but it's kind of an experiment.

beowulf is mainly about leveraging (commodity hardware, open software),
but it's not explicitly _anti_ windows.  the problem here is that to gain
an advantage from parallelism (shared or distributed memory), an app 
almost certainly needs to be parallelism-aware, indeed, _designed_ for it.

if you wanted to run a bunch of non-interacting windows programs,
and wanted to distribute them across cluster nodes (with, for instance,
a shared filesystem), this could certainly be done.  personally, I'd 
start out with a linux cluster and run VMs so that the whole would be 
managable and secure.  each windows instance would see itself alone in
the VM, of course (separte desktops, registries, etc)  as far as I know, 
MSFT will want their pound of flesh for each instance, though.

in general, there are easy hardware fixes to speed up your programs.
SSDs, lots of ram, 4-socket, 48-core motherboards, etc.  even things like
software-based shared memory single-image systems like ScaleMP.
they cost money, but so does person-time.

the main thing about beowulf is that there's a LOT of programs already
written for MPI clusters, so the task is "just" tuning (for some combination
of performance, greenness, price, convenience, etc).


From eagles051387 at gmail.com  Sun Mar  7 21:37:32 2010
From: eagles051387 at gmail.com (Jonathan Aquilina)
Date: Mon, 8 Mar 2010 06:37:32 +0100
Subject: [Beowulf] Windows Master, Linux Slaves
In-Reply-To: <Pine.LNX.4.64.1003071453350.17117@coffee.psychology.mcmaster.ca>
References: <901979.89098.qm@web65303.mail.ac2.yahoo.com>
	<Pine.LNX.4.64.1003071453350.17117@coffee.psychology.mcmaster.ca>
Message-ID: <a31cd3861003072137v6a1dbb67s9a1fcc23c5531931@mail.gmail.com>

@Mark F i am not sure if it can be done with xp. i know windows have a HPC
version and server 2003 include clustering capabilities. I agree with mark H
that you should go with linux and run windows in a vm.


-- 
Jonathan Aquilina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100308/776446ae/attachment.html>

From michf at post.tau.ac.il  Mon Mar  8 07:14:00 2010
From: michf at post.tau.ac.il (Micha Feigin)
Date: Mon, 8 Mar 2010 17:14:00 +0200
Subject: [Beowulf] assigning cores to queues with torque
Message-ID: <20100308171400.5128b849@vivalunalitshi.luna.local>

I have a small local cluster in our lab that I'm trying to setup with minimum
hustle to support both cpu and gpu processing where only some of the nodes have
a gpu and those have only two gpu for four cores.

It is currently setup using torque from ubuntu (2.3.6) with the torque supplied
scheduler (set it up with maui initially but it was a bit of a pain for such a
small cluster so I switched)

This cluster is used by very few people in a very controlled environment so I
don't really need any protection from each other, the queues are just for
convenience to allow remote execution

The problem:

I want to allow gpu related jobs to run only on the gpu equiped nodes (i.e more jobs then GPUs will be queued), I want to run other jobs on all nodes with either
1. a priority to use the gpu equiped nodes last
2. or better, use only two out of four cores on the gpu equiped nodes

It doesn't seem though that I can map nodes or cores to queues with torque as far as I can tell
(i.e cpu queue uses 2 cores on gpu1, 2 cores on gpu2, all cores on everything else
      gpu queue uses 2 cores on gpu1, 2 cores on gpu2)

I can't seem to set user defined resources so that I can define gpu machines as having gpu resource and schedule according to that.

Is it possible to achieve any of these two with torque, or is there any other
simple enough queue manager that can do this (preferably with a debian package
in some way to simplify maintanance). I only manage this cluster since no one
else knows how to and it's supposed to take as little of my time as possible
I'm looking for the simplest solution to implement and not the most versatile
one.

Thanks


From Glen.Beane at jax.org  Mon Mar  8 07:39:08 2010
From: Glen.Beane at jax.org (Glen Beane)
Date: Mon, 8 Mar 2010 10:39:08 -0500
Subject: [Beowulf] assigning cores to queues with torque
In-Reply-To: <20100308171400.5128b849@vivalunalitshi.luna.local>
Message-ID: <C7BA844C.619C%glen.beane@jax.org>


On 3/8/10 10:14 AM, "Micha Feigin" <michf at post.tau.ac.il> wrote:

I have a small local cluster in our lab that I'm trying to setup with minimum
hustle to support both cpu and gpu processing where only some of the nodes have
a gpu and those have only two gpu for four cores.

It is currently setup using torque from ubuntu (2.3.6) with the torque supplied
scheduler (set it up with maui initially but it was a bit of a pain for such a
small cluster so I switched)

This cluster is used by very few people in a very controlled environment so I
don't really need any protection from each other, the queues are just for
convenience to allow remote execution

The problem:

I want to allow gpu related jobs to run only on the gpu equiped nodes (i.e more jobs then GPUs will be queued), I want to run other jobs on all nodes with either
1. a priority to use the gpu equiped nodes last
2. or better, use only two out of four cores on the gpu equiped nodes

It doesn't seem though that I can map nodes or cores to queues with torque as far as I can tell
(i.e cpu queue uses 2 cores on gpu1, 2 cores on gpu2, all cores on everything else
      gpu queue uses 2 cores on gpu1, 2 cores on gpu2)

I can't seem to set user defined resources so that I can define gpu machines as having gpu resource and schedule according to that.

Is it possible to achieve any of these two with torque, or is there any other
simple enough queue manager that can do this (preferably with a debian package
in some way to simplify maintanance). I only manage this cluster since no one
else knows how to and it's supposed to take as little of my time as possible
I'm looking for the simplest solution to implement and not the most versatile
one.


you can define a resource "gpu" in your TORQUE nodes file:

hostname np=4 gpu

and then users can request -l nodes=1:ppn=4:gpu to get assigned a node with a gpu,  but to do anything more advanced you'll need Maui or Moab.   You should try the maui users mailing list, or the torque users mailing list to see if anyone else has some ideas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100308/a51988d2/attachment.html>

From vallard at benincosa.com  Sun Mar  7 14:42:06 2010
From: vallard at benincosa.com (Vallard Benincosa)
Date: Sun, 7 Mar 2010 14:42:06 -0800
Subject: [Beowulf] Windows Master, Linux Slaves
In-Reply-To: <901979.89098.qm@web65303.mail.ac2.yahoo.com>
References: <901979.89098.qm@web65303.mail.ac2.yahoo.com>
Message-ID: <f7572b051003071442i655ff063n2cf4beea27613d69@mail.gmail.com>

We do it the opposite way quite frequently: Linux master node with Windows
slaves.  You could at the very least install the Linux servers from Windows
over the network.  Setup a Windows DHCP/HTTP server.  Then you could at the
very least 'kickstart' or 'autoyast' the nodes.  Not sure about the software
level of controlling the apps.

On Sun, Mar 7, 2010 at 7:08 AM, MArk Fennema <markcf at rocketmail.com> wrote:

> I'm not sure at all if this would be at all beowulfy, or even if it would
> be possible. That's why I'm asking you. What I want to set up is a cluster
> computer that can run standard windows applications. Random download games,
> Microsoft office, etc. So I was wondering, is it at all possible to run a
> windows master computer that's controlling Linux slaves, and if I did, would
> it improve the performance of usual applications (or make it possible to run
> more of them at the same time). I know this isn't the most useful or the
> cheapest way to make a computer like this, but it's kind of an experiment.
>
>    ------------------------------
>
> *Yahoo! Canada Toolbar :* Search from anywhere on the web and bookmark
> your favourite sites. Download it now! <http://ca.toolbar.yahoo.com/>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100307/59f4e7e6/attachment.html>

From richard.walsh at comcast.net  Mon Mar  8 09:20:32 2010
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Mon, 8 Mar 2010 17:20:32 +0000 (UTC)
Subject: [Beowulf] assigning cores to queues with torque
In-Reply-To: <20100308171400.5128b849@vivalunalitshi.luna.local>
Message-ID: <1394126184.11221941268068832543.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>


Micha Feigin wrote: 

>The problem: 
> 
>I want to allow gpu related jobs to run only on the gpu 
>equiped nodes (i.e more jobs then GPUs will be queued), 
>I want to run other jobs on all nodes with either: 
> 
> 1. a priority to use the gpu equiped nodes last 
> 2. or better, use only two out of four cores on the gpu equiped nodes 


In PBS Pro you would do the following (torque may have something 
similar): 


1. Create a custom resource called "ngpus" in the resourcedef 
file as in: 


ngpus type=long flag=nh 


2. This resource should then be explicitly set on each node that 
includes a GPU to the number it includes: 


set node compute-0-5 resources_available.ncpus = 8 
set node compute-0-5 resources_available.ngpus = 2 


Here I have set the number of cpus per node (8) explicitly to defeat 
hyper-threading and the actual number of gpus per node (2). On the 
other nodes you might have: 


set node compute-0-5 resources_available.ncpus = 8 
set node compute-0-5 resources_available.ngpus = 0 


Indicating that there are no gpus to allocate. 


3. You would then use the '-l select' option in your job file as follows: 


#PBS -l select=4:ncpus=2:ngpus=2 


This requests 4 PBS resource chunks. Each includes 2 cpus and 2 gpus. 
Because the resource request is "chunked" these 2 cpu x 2 gpu chunks would 
be placed together on one physical node. Because you marked some 
nodes as having 2 gpus in the nodes file and some to have 0 gpus, only those 
that have them will get allocated. As a consumable resource, as soon as 2 
were allocated the total available would drop to 0. In total you would have 
asked for 4 chunks distributed to 4 physical nodes (because only one of these 
chunks can fit on a single node). This also ensures a 1:1 mapping of cpus to 
gpus, although it does nothing about tying each cpu to a different socket. You 
would to do that in the script with numactl probably. 


There are other ways to approach by tying physical nodes to queues, which you 
might wish to do to set up a dedicate slice for GPU development. You may also 
be able to do this in PBS using the v-node abstraction. There might be some 
reason to have two production routing queues that map to slight different parts 
of the system. 


Not sure how this could be approximated in Torque, but perhaps this will give you 
some leads. 


rbw 
_______________________________________________ 
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing 
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100308/c81628d5/attachment.html>

From mathog at caltech.edu  Mon Mar  8 12:28:09 2010
From: mathog at caltech.edu (David Mathog)
Date: Mon, 08 Mar 2010 12:28:09 -0800
Subject: [Beowulf] SATA L shaped cable terminology question
Message-ID: <E1NojYf-0004F9-0q@mendel.bio.caltech.edu>

SATA cables may be purchased with an L shaped connector.  I just looked
at 10 of them and they were all like this when looking into the open end
of the angled connector (ASCII art):

  +-------------------+
  |                   |
  |  --------------+  |
  |                |  |
  +------+    +-------+
         |    |
         |    |
         |    |
         |    |


There are some cables out there labeled as "left angle" cables.  SOME of
those have associated pictures like the one above, and others like the
one below:

  +-------------------+
  |  |                |
  |  +-------------+  |
  |                   |
  +------+    +-------+
         |    |
         |    |
         |    |
         |    |

The question is, are all cables labeled as "Left angle" supposed to look
like the lower illustration?  (My guess is that this is the case and the
illustrations which aren't like this are erroneously from the "right
angle" variant of the cable.)

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From mdidomenico4 at gmail.com  Mon Mar  8 14:44:27 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 8 Mar 2010 17:44:27 -0500
Subject: [Beowulf] SATA L shaped cable terminology question
In-Reply-To: <E1NojYf-0004F9-0q@mendel.bio.caltech.edu>
References: <E1NojYf-0004F9-0q@mendel.bio.caltech.edu>
Message-ID: <e75d22a91003081444l4786ce17x512f5f594753a248@mail.gmail.com>

my guess would be there are two cables because the sata cables are not
overly flexible at the joint

having the connector tail in either the up direction or down direction
could save a loop of cable in the chassis

On Mon, Mar 8, 2010 at 3:28 PM, David Mathog <mathog at caltech.edu> wrote:
> SATA cables may be purchased with an L shaped connector. ?I just looked
> at 10 of them and they were all like this when looking into the open end
> of the angled connector (ASCII art):
>
> ?+-------------------+
> ?| ? ? ? ? ? ? ? ? ? |
> ?| ?--------------+ ?|
> ?| ? ? ? ? ? ? ? ?| ?|
> ?+------+ ? ?+-------+
> ? ? ? ? | ? ?|
> ? ? ? ? | ? ?|
> ? ? ? ? | ? ?|
> ? ? ? ? | ? ?|
>
>
> There are some cables out there labeled as "left angle" cables. ?SOME of
> those have associated pictures like the one above, and others like the
> one below:
>
> ?+-------------------+
> ?| ?| ? ? ? ? ? ? ? ?|
> ?| ?+-------------+ ?|
> ?| ? ? ? ? ? ? ? ? ? |
> ?+------+ ? ?+-------+
> ? ? ? ? | ? ?|
> ? ? ? ? | ? ?|
> ? ? ? ? | ? ?|
> ? ? ? ? | ? ?|
>
> The question is, are all cables labeled as "Left angle" supposed to look
> like the lower illustration? ?(My guess is that this is the case and the
> illustrations which aren't like this are erroneously from the "right
> angle" variant of the cable.)
>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From gdjacobs at gmail.com  Tue Mar  9 13:15:43 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Tue, 09 Mar 2010 15:15:43 -0600
Subject: [Beowulf] Arima motherboards with SATA2 drives
In-Reply-To: <E1Nk5IH-0007CR-5X@mendel.bio.caltech.edu>
References: <E1Nk5IH-0007CR-5X@mendel.bio.caltech.edu>
Message-ID: <4B96BA7F.9060509@gmail.com>

David Mathog wrote:
> Have any of you seen a patched BIOS for the Arima HDAM* motherboards
> that resolves the issue of the Sil 3114 SATA controller locking up when
> it sees a SATA II disk? (Even a disk jumpered to Sata I speeds.) 
> Silicon Image released a BIOS fix for this, but since all of these
> motherboards use a Phoenix BIOS, it is not like an AMI or Award BIOS,
> where there are published methods for swapping out the broken chunk of
> BIOS (5.0.49) for the one with the fix (5.4.0.3).  Sure, one could work
> around this on a single disk system, at least, with an IDE to SATA2
> converter, or a PCI(X) Sata(2) controller, but reflashing the BIOS would
> be easier.  Or it would be if Flextronics, who bought this product line
> from Arima, would issue another BIOS update :-(.
> 
> Thanks,
>   
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech

I'm assuming that the boffin Flextronics has handling legacy support for
Arima is not being very responsive?

Well, editing the BIOS image for the mainboard seams kind of dodgy. If
chassis space isn't a problem, I would think replacing the controller
would be a better solution. I'm also unsure if Coreboot is a viable
option, although it seems the HDAMA is supported. I'm not 100% sure if
the Sil controller is, though.

Interesting problem, though.

If you want to, find a copy of BNTBTC (Bog Number Two's BIOS Tools
Collection) and install Phoenix Bios Editor. Hopefully you have no
missing VBVM 6.0 files. You'll need to find the correct module for
replacement. I was just checking the ROM image myself, but I was using
an older BIOS editor and things were a little gnarly. We'll see with the
new version...

-- 
Geoffrey D. Jacobs


From mathog at caltech.edu  Tue Mar  9 13:41:00 2010
From: mathog at caltech.edu (David Mathog)
Date: Tue, 09 Mar 2010 13:41:00 -0800
Subject: [Beowulf] cpufreq, multiple cores, load
Message-ID: <E1Np7Ai-0004mC-8Z@mendel.bio.caltech.edu>

I am currently configuring a new (for us) dual Opteron 280 system. 
cpufreq works on this system, moving each pair of cores between 1000 and
2400MHz using the "ondemand" governor.  The interesting thing, at least
from my point of view, is how rapidly the power savings degrade as CPU
load increases.  Naively one might have thought it would go up in steps
corresponding to the power difference between 2400MHz full CPU load on a
core and 1000MHz idle on the same core.  Not so. Here is the data
measured with our Kill-A-Watt, view the table with a fixed width font:

Watts   CPU0 MHz      CPU1 MHz cpuburn#  governor
       core0 core1  core0 core1
115    1000  1000   1000  1000    0      ondemand
157    2400  2400   1000  1000    1      ondemand
199    2400  2400   2400  2400    2      ondemand
214    2400  2400   2400  2400    3      ondemand
228    2400  2400   2400  2400    4      ondemand
172    2400  2400   2400  2400    0      performance
186    2400  2400   2400  2400    1      performance
199    2400  2400   2400  2400    2      performance
214    2400  2400   2400  2400    3      performance
228    2400  2400   2400  2400    4      performance

Starting one cpuburn flips both of the cores on the first CPU
to the faster clock speed, resulting in a 42W (37%) increase in power
consumption.  Starting a second cpuburn apparently schedules it
on one of the cores on the unused second processor, rather than
on the equally unused, but already sped up, second core on the first
CPU.  This flips the remaining two cores also to 2400 MHz, negating any
further benefit from "ondemand" as more cpuburn processes are added.
That is, there are 5 states for increasing cpuburn load but, only the
lowest two have different power consumption for "ondemand" than for
"performance".

I have not repeated this experiment yet with the "conservative"
governor, but since cpuburn is intentionally such a cpu hog, I expect
the results would be about the same.

Anyway, this is just 4 cores total, but it makes me wonder what happens
with a system having say, two quad core processors, where if the same
sort of scheduling/cpufreq logic apply, two CPU saturating jobs (still
only 1/4 of total available CPU capacity) will effectively negate the
energy saving modes.  For instance, imagine some future 24 core behemoth
that acts the same way.  One might almost dispense with power saving
modes altogether if one CPU intensive job is going to kick the other 23
cores into a high power state.  Or do the newer CPUs, either AMD's or
Intel's, allow different frequencies on each core of a CPU?  

(Kernel 2.6.31.12, Arima HDAMAI motherboard, Mandriva 2010.0).

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From bill at cse.ucdavis.edu  Tue Mar  9 13:50:05 2010
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Tue, 09 Mar 2010 13:50:05 -0800
Subject: [Beowulf] cpufreq, multiple cores, load
In-Reply-To: <E1Np7Ai-0004mC-8Z@mendel.bio.caltech.edu>
References: <E1Np7Ai-0004mC-8Z@mendel.bio.caltech.edu>
Message-ID: <4B96C28D.1000900@cse.ucdavis.edu>

David Mathog wrote:
> Starting a second cpuburn apparently schedules it
> on one of the cores on the unused second processor, rather than
> on the equally unused, but already sped up, second core on the first
> CPU.

Since that gives the most additional performance that seems a reasonable
default.  So you add an additional cache and memory system.  Numactl or the
related system calls would let you schedule it on the first CPU if desired.

> This flips the remaining two cores also to 2400 MHz, negating any
> further benefit from "ondemand" as more cpuburn processes are added.
> That is, there are 5 states for increasing cpuburn load but, only the
> lowest two have different power consumption for "ondemand" than for
> "performance".

Unless you choose to do otherwise.  If you are willing to ignore the 2nd chips
cache and memory system you could keep power lower until over half the system
is busy.

> Anyway, this is just 4 cores total, but it makes me wonder what happens
> with a system having say, two quad core processors, where if the same
> sort of scheduling/cpufreq logic apply, two CPU saturating jobs (still
> only 1/4 of total available CPU capacity) will effectively negate the
> energy saving modes.  For instance, imagine some future 24 core behemoth
> that acts the same way.  One might almost dispense with power saving
> modes altogether if one CPU intensive job is going to kick the other 23
> cores into a high power state.  Or do the newer CPUs, either AMD's or
> Intel's, allow different frequencies on each core of a CPU?  

I seem to recall that one of the recent AMD tweaks was to allow additional
tweaks.  I forget if it was voltage, or clock speed that can now be controller
per core.  I believe the north bridge and memory bus also can enter a lower
power state when not in use.

Next time I have a nehalem dual socket in my office I'll test it.


From gdjacobs at gmail.com  Tue Mar  9 15:17:47 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Tue, 09 Mar 2010 17:17:47 -0600
Subject: [Beowulf] Arima motherboards with SATA2 drives
In-Reply-To: <E1Np7ST-0004nJ-MJ@mendel.bio.caltech.edu>
References: <E1Np7ST-0004nJ-MJ@mendel.bio.caltech.edu>
Message-ID: <4B96D71B.9080504@gmail.com>

David Mathog wrote:
>> I'm assuming that the boffin Flextronics has handling legacy support for
>> Arima is not being very responsive?
> 
> If by "very" you mean "at all", then you would be accurate.
> 
>> Well, editing the BIOS image for the mainboard seams kind of dodgy.
> 
> That's what I ended up doing though, and it worked.

By any chance is the flash ROM socketed and did you have a spare board
for hot swapping? That sort of insurance make's me breathe easier when
doing weird firmware updates.

>> If
>> chassis space isn't a problem, I would think replacing the controller
>> would be a better solution.
> 
> That's an option for the one machine I moved to a different case, the
> others I'm thinking about getting are in strange little cases, and they
> will only have room for one disk, so just getting the one controller to
> see a SATA II disk will be good enough.

Yeah, so you were stuck.

>> If you want to, find a copy of BNTBTC (Bog Number Two's BIOS Tools
>> Collection) and install Phoenix Bios Editor. Hopefully you have no
>> missing VBVM 6.0 files. You'll need to find the correct module for
>> replacement. I was just checking the ROM image myself, but I was using
>> an older BIOS editor and things were a little gnarly. We'll see with the
>> new version...

That would be Borg...

> Found this thread:
> 
> http://forums.mydigitallife.info/threads/13358-How-to-Use-New-Phoenix-Bios-Mod-Tool-to-Modify-Phoenix-Dell-Insyde-EFI-Bios-Files?s=016f5c20a0849a623a806a9a440db2fb
>
> with a link to PBE 2.1.0.0.  Used that on the 1.11 BIOS (downloaded
> from Flextronic), stuffed in the SiI stuff (5403.bin), flashed the ROM,
> and it worked, seeing the SATA II disk.  Note, there were two
> complications.  1st, PBE installs owned by the installer, and there are
> access issues, solved those by changing ownership to Everbody:FULL at
> the top level of that directory tree.  Second, replacing a module.  I
> tried that using the menu interface, and it seemed to have done it, but
> the resulting BIOS still had the old SiI section.  So used PBE to
> unpack, from another window, copied 5403.bin to ....\TEMP\OPROM3.ROM
> (where it lived in this BIOS), changed and unchanged a string to enable
> the build command, then BUILD, then save BIOS.

It seems BIOS modding has become somewhat of a cottage industry as a way
of getting around WGA and/or being able to boot Windows 7.

Examining the module binaries, I see they did us the favor of providing
copyrights in the first line. Very considerate of them. Distinguishing
between the various addon modules is easy this way.

> Tried another tool called Phoenix_Tool_1.24 (from the thread cited
> above) and it could break out the pieces of the BIOS, then you could
> replace one, and run prepare and catenate.  Except that what comes out
> won't flash with phlash.exe. See my entry near the end of the forum
> cited above.  I used a little tool of mine called "intercept" to
> intercept all program calls, and it wasn't doing anything special with
> prepare.exe, catenate.exe, or the other programs.  The only thing left
> was that PBE itself must be appending some stuff after the catenate
> output, and that seems to be metadata for phlash16.  Probably one could
> use another flash program that didn't need these, but I wasn't going to
> try that.  In any case, these appended bytes were the same when PBE
> built an unmodified and a modified BIOS, so it would apparently be safe
> to just use "dd" and hack the difference in length off the original BIOS
> image and append it to the output from catenate.  The phlash command I
> used was:
>
>   phlash16 HDAMAI.11C /C /Mode=3 /CS /EXIT /PN /V
>
> Since the legal status of these versions of PBE is somewhat dubious, I
> didn't post these last few results in public anywhere.

I won't tell if you won't.

> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech

Hey, glad it's working. I guess the war horses will soldier on some more.

-- 
Geoffrey D. Jacobs


From hahn at mcmaster.ca  Tue Mar  9 20:14:59 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 9 Mar 2010 23:14:59 -0500 (EST)
Subject: [Beowulf] cpufreq, multiple cores, load
In-Reply-To: <E1Np7Ai-0004mC-8Z@mendel.bio.caltech.edu>
References: <E1Np7Ai-0004mC-8Z@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.64.1003092113390.8204@coffee.psychology.mcmaster.ca>

> 2400MHz using the "ondemand" governor.  The interesting thing, at least
> from my point of view, is how rapidly the power savings degrade as CPU
> load increases.

well, one of the big themes in recent chip development is factoring
out pieces that can be separately clocked.  it's also true that the 
platforms themselves have improved (lower voltage ddr3, power/flop, etc)


> Watts   CPU0 MHz      CPU1 MHz cpuburn#  governor
>       core0 core1  core0 core1
> 115    1000  1000   1000  1000    0      ondemand
> 157    2400  2400   1000  1000    1      ondemand
> 172    2400  2400   2400  2400    0      performance
> 186    2400  2400   2400  2400    1      performance
> 199    2400  2400   2400  2400    2      ondemand/performance
> 214    2400  2400   2400  2400    3      ondemand/performance
> 228    2400  2400   2400  2400    4      ondemand/performance

I rearranged your table a bit.
157-115=42W ramps up a socket from 1-2.4.
186-157=29W gets the second socket going (bit surprising).
172-115=57W is the clock-related savings when idle.
13/15/14W for adding core load once the sockets are spun up.

I'm guessing it's the latter that surprised you - per core power 
savings are minor, dominated by socket-related power.  but Intel
and AMD have been talking along those lines for a few years...

> consumption.  Starting a second cpuburn apparently schedules it
> on one of the cores on the unused second processor, rather than
> on the equally unused, but already sped up, second core on the first

well, that's a kernel/scheduler choice, probably pretty hackable.
or else just use numactl to direct the second cpuburn away from
the second socket.  in fact, using numactl to control memory allocation
would probably be a good idea anyway.

> Anyway, this is just 4 cores total, but it makes me wonder what happens
> with a system having say, two quad core processors, where if the same
> sort of scheduling/cpufreq logic apply, two CPU saturating jobs (still
> only 1/4 of total available CPU capacity) will effectively negate the
> energy saving modes.

I'm guessing that we can have pretty high-res power savings if we're 
willing to do some extra things like controlling which procs use 
which cores, which memory banks, perhaps a syscall interface to 
control clock modulation...

> cores into a high power state.  Or do the newer CPUs, either AMD's or
> Intel's, allow different frequencies on each core of a CPU?

I've seen that on AMD presentations, but iirc it was an istanbul thing.

-mark


From gmpc at sanger.ac.uk  Wed Mar 10 01:20:57 2010
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Wed, 10 Mar 2010 09:20:57 +0000
Subject: [Beowulf] cpufreq, multiple cores, load
In-Reply-To: <Pine.LNX.4.64.1003092113390.8204@coffee.psychology.mcmaster.ca>
References: <E1Np7Ai-0004mC-8Z@mendel.bio.caltech.edu>
	<Pine.LNX.4.64.1003092113390.8204@coffee.psychology.mcmaster.ca>
Message-ID: <4B976479.40802@sanger.ac.uk>


>> consumption.  Starting a second cpuburn apparently schedules it
>> on one of the cores on the unused second processor, rather than
>> on the equally unused, but already sped up, second core on the first
> 
> well, that's a kernel/scheduler choice, probably pretty hackable.
> or else just use numactl to direct the second cpuburn away from
> the second socket.  in fact, using numactl to control memory allocation
> would probably be a good idea anyway.
> 
You can change that schedular behaviour by twiddling sched_mc_power_savings

echo 1 > /sys/devices/system/cpu/sched_mc_power_savings


http://www.lesswatts.org/tips/cpu.php

"'sched_mc_power_savings' tunable under /sys/devices/system/cpu/
controls the Multi-core related tunable. By default, this is set to '0'
(for optimal performance). By setting this to '1', under light load
scenarios, the process load is distributed such that all the cores in a
processor package are busy before distributing the process load to other
processor packages."


Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From john.hearns at mclaren.com  Wed Mar 10 01:46:33 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Wed, 10 Mar 2010 09:46:33 -0000
Subject: [Beowulf] cpufreq, multiple cores, load
In-Reply-To: <4B976479.40802@sanger.ac.uk>
References: <E1Np7Ai-0004mC-8Z@mendel.bio.caltech.edu><Pine.LNX.4.64.1003092113390.8204@coffee.psychology.mcmaster.ca>
	<4B976479.40802@sanger.ac.uk>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0FA56FD0@milexchmb1.mil.tagmclarengroup.com>

> >
> You can change that schedular behaviour by twiddling 
> sched_mc_power_savings
> 
> echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
> 
> 
> http://www.lesswatts.org/tips/cpu.php
> 

Sorry to set off at a slight tangent - on a related matter, does anyone
have a good writeup on the CPU scaling controls in Linux for Nehalem -
ie controlling TurboBoost.
Yes, I have Googled but as above someone here normally had a blindlingly
good resource.

The reason I ask is that we have some Nehalem boxes, which will
principally run a serial application.
Could be useful to see what effect turning on turboboost has.
I also saw there is a Gnome taskbar applet for showing CPU frequencies -
anyone got recommendations for a good monitor for that sort of thing?
Gkrellem springs to mind.

John Hearns
McLaren Racing


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From cap at nsc.liu.se  Wed Mar 10 04:39:41 2010
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Wed, 10 Mar 2010 13:39:41 +0100
Subject: [Beowulf] cpufreq, multiple cores, load
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0FA56FD0@milexchmb1.mil.tagmclarengroup.com>
References: <E1Np7Ai-0004mC-8Z@mendel.bio.caltech.edu>
	<4B976479.40802@sanger.ac.uk>
	<68A57CCFD4005646957BD2D18E60667B0FA56FD0@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <201003101339.45496.cap@nsc.liu.se>

On Wednesday 10 March 2010, Hearns, John wrote:
...
> Sorry to set off at a slight tangent - on a related matter, does anyone
> have a good writeup on the CPU scaling controls in Linux for Nehalem -
> ie controlling TurboBoost.
> Yes, I have Googled but as above someone here normally had a blindlingly
> good resource.

Turbo mode shows up as its own frequency step (one Mhz over normal max):

 cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
 2268000 2267000 2133000 2000000 1867000 1733000 1600000

2267(Mhz) is the highest "normal" setting on this E5520. 2268(Mhz) is the 
turbo mode setting.

Linux will treat 2268 as a normal available frequency (and since it's the 
highest this is what will get chosen under load).

If you wan't to run without turbo mode just set the frequency statically to 
2267 (of course this will be true even at idle then).

When the system is running at 2268 it's using turbo mode, the actual frequency 
will be somewhere between 2267 and MAX_TURBO. Where it will end up between 
these two depends on power/thermal head room and what MAX_TURBO is depends on 
the specific cpu model.

For E55* MAX_TURBO is "one step up", that is a 2.267 E5520 can go to 2.4 
(normal max of E5530). For X55* it's two steps up (a 2.667 can go to 2.93 
etc.).

> The reason I ask is that we have some Nehalem boxes, which will
> principally run a serial application.
> Could be useful to see what effect turning on turboboost has.
> I also saw there is a Gnome taskbar applet for showing CPU frequencies -
> anyone got recommendations for a good monitor for that sort of thing?
> Gkrellem springs to mind.

Note that when running turbo mode linux will think it's running a normal 
frequency mode one Mhz up (2268 Mhz for the E5520) not at the true frequency.

/Peter

> John Hearns
> McLaren Racing
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100310/87ba25d2/attachment.sig>

From mathog at caltech.edu  Wed Mar 10 13:10:09 2010
From: mathog at caltech.edu (David Mathog)
Date: Wed, 10 Mar 2010 13:10:09 -0800
Subject: [Beowulf] cpufreq, multiple cores, load
Message-ID: <E1NpTAP-0005GF-TM@mendel.bio.caltech.edu>

Guy Coates wrote:
> You can change that schedular behaviour by twiddling
sched_mc_power_savings
> 
> echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
> 
> 
> http://www.lesswatts.org/tips/cpu.php

Good link.  

Setting this option made a slight difference when 2 cpuburn processes
were running, and not for any other number.  Instead of 199W power
consumption was reduced to 171W, and two cores showed 2400 MHz and two
1000 MHz, instead of all 4 at 2400 MHz.  Hopefully AMD has addressed
this in later processors, because the 40W hit induced by rev'ing up all
cores on a processor is much bigger than the ~16W hit associated with
actually running the program.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From tom.ammon at utah.edu  Wed Mar 10 14:14:26 2010
From: tom.ammon at utah.edu (Tom Ammon)
Date: Wed, 10 Mar 2010 15:14:26 -0700
Subject: [Beowulf] ARP timers on RHEL4 vs. RHEL5
Message-ID: <4B9819C2.10100@utah.edu>

Hi,

I've been trying to figure out how to adjust the ARP timeout on kernel
2.6.9 and I found the following in /proc/sys/net/ipv4/neigh/ib0 (its an
IB interface I am interested in changing) with the following values.
This is on kernel 2.6.9-89ELsmp (RHEL4) :

[root at up255 ib0]# cat anycast_delay
99
[root at up255 ib0]# cat app_solicit
0
[root at up255 ib0]# cat base_reachable_time
30
[root at up255 ib0]# cat delay_first_probe_time
5
[root at up255 ib0]# cat gc_stale_time
60
[root at up255 ib0]# cat locktime
99
[root at up255 ib0]# cat mcast_solicit
3
[root at up255 ib0]# cat proxy_delay
79
[root at up255 ib0]# cat proxy_qlen
64
[root at up255 ib0]# cat retrans_time
99
[root at up255 ib0]# cat ucast_solicit
3
[root at up255 ib0]# cat unres_qlen
3

When I test this, along with per-flow ECMP (using the iproute2 utils), I
see that the ARP cache is timing out about every 10 minutes (I observe
this by load balancing an iperf flow between two different gateway
machines and then graphing the interface traffic)

On a newer kernel, 2.6.18-164.11.1.el5 (RHEL5), I see mostly the same
parms available, but a few new ones have been added. However, all of the
parms that are the same name between the two kernels are the same values:

[root at gateway2 ib0]# cat anycast_delay
99
[root at gateway2 ib0]# cat app_solicit
0
[root at gateway2 ib0]# cat base_reachable_time
30
[root at gateway2 ib0]# cat base_reachable_time_ms
30000
[root at gateway2 ib0]# cat delay_first_probe_time
5
[root at gateway2 ib0]# cat gc_stale_time
60
[root at gateway2 ib0]# cat locktime
99
[root at gateway2 ib0]# cat mcast_solicit
3
[root at gateway2 ib0]# cat proxy_delay
79
[root at gateway2 ib0]# cat proxy_qlen
64
[root at gateway2 ib0]# cat retrans_time
99
[root at gateway2 ib0]# cat retrans_time_ms
1000
[root at gateway2 ib0]# cat ucast_solicit
3
[root at gateway2 ib0]# cat unres_qlen
3

Yet when I observe the same traffic flow with this machine, the ARP
cache times out about once per minute.

Is there another set of parameters somewhere that govern how often the
kernel times out the ARP cache? If so, where might I find that? Is there
any kernel documentation that talks about changing ARP timers on the
linux kernel?

Tom Ammon

-- 
--------------------------------------------------------------------
Tom Ammon
Network Engineer
Office: 801.587.0976
Mobile: 801.674.9273

Center for High Performance Computing
University of Utah
http://www.chpc.utah.edu


From gdjacobs at gmail.com  Thu Mar 11 10:13:49 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Thu, 11 Mar 2010 12:13:49 -0600
Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how	large
	of an installation have people used NFS, with?
In-Reply-To: <4AA8F107.7010805@avalon.umaryland.edu>
References: <200909091900.n89J07U5031683@bluewest.scyld.com>	<4AA800B2.7060607@avalon.umaryland.edu>	<20090910073957.GA8487@gretchen.aei.mpg.de>
	<4AA8F107.7010805@avalon.umaryland.edu>
Message-ID: <4B9932DD.4050502@gmail.com>

psc wrote:
> Thank you all for the answers.  Would you guys please share with me some
> good brands of those
> 200+  1GB Ethernet switches? I think I'll leave our current clusters
> alone , but the new cluster I
> will design for about 500 to 1000 nodes --- I don't think that we will
> go much above since for big jobs
> our scientists using outside resources. We do all our calculations and
> analysis on the nodes and only the final produce
> we sent to the frontend , also we don't run jobs across the nodes , so I
> don't need to get too much creative with the network
> beside being sure that I can expand the cluster without having the
> switches as a limitation (our current situation)
> 
> thank you again!

In alphabetical order...
Alcatel, Cisco, Extreme, Force10, Foundry, HP, Juniper, Nortel (although
they're not terribly stable, financially). All these companies make the
requisite hardware in a variety of configurations, from small switches,
to big modular chassis. I have only personally used Procurves.

> Henning Fehrmann wrote:
>> Hi
>>
>> On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote:
>>   
>>> I wonder what would be the sensible biggest cluster possible based on
>>> 1GB Ethernet network .
>>>     
>> Hmmm, may I cheat and use a 10Gb core switch?
>>
>> If you setup a cluster with few thousand nodes you have to ask yourself
>> whether this network should be non-blocking or not.
>>
>> For a non blocking network you need the right core-switch technology.
>> Unfortunately, there are not many vendors out there which provide
>> non-blocking Ethernet based core switches but I am aware of at least
>> two. One provides or will provide 144 10Gb Ethernet ports. Another one
>> sells switches with more than 1000 1 GB ports.
>> You could buy edge-switches with 4 10Gb uplinks and 48 1GB ports. If
>> you just use 40 of them you end up with a 1440 non-blocking 1Gb ports.
>>
>> It might be also possible to cross connect two of these core-switches
>> with the help of some smaller switches so that one ends up with 288
>> 10Gb ports and, in principle, one might connect 2880 nodes in a 
>> non-blocking way, but we did not have the possibility to test it
>> successfully yet. One of problems is that the internal hash table can
>> not store that many mac addresses. Anyway, one probably needs to change
>> the mac addresses of the nodes to avoid an overflow of the hash tables.
>> An overflow might cause arp storms.
>>
>> Once this works one runs into some smaller problems. One of them is the arp
>> cache of the nodes. It should be adjusted to hold as many mac addresses
>> as you have nodes in the cluster.
>>
>>
>>   
>>> And especially how would you connect those 1GB
>>> switches together -- now we have (on one of our four clusters) Two 48
>>> ports gigabit switches connected together with 6 patch cables and I just
>>> ran out of ports for expansion and wonder where to go from here as we
>>> already have four clusters and it would be great to stop adding cluster
>>> and start expending them beyond number of outlets on the switch/s ....
>>> NFS and 1GB Ethernet works great for us and we want to stick with it ,
>>> but we would love to find a way how to overcome the current "switch
>>> limitation".   
>>>     
>> With NFS you can nicely test the setup. Use one NFS server and let all
>> nodes write different files into it and look what happens.
>>
>> Cheers,
>> Henning
>>   
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


-- 
Geoffrey D. Jacobs


From gdjacobs at gmail.com  Thu Mar 11 10:16:29 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Thu, 11 Mar 2010 12:16:29 -0600
Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how	large
	of an installation have people used NFS, with?
In-Reply-To: <4AA8F107.7010805@avalon.umaryland.edu>
References: <200909091900.n89J07U5031683@bluewest.scyld.com>	<4AA800B2.7060607@avalon.umaryland.edu>	<20090910073957.GA8487@gretchen.aei.mpg.de>
	<4AA8F107.7010805@avalon.umaryland.edu>
Message-ID: <4B99337D.8070203@gmail.com>

psc wrote:
> Thank you all for the answers.  Would you guys please share with me some
> good brands of those
> 200+  1GB Ethernet switches? I think I'll leave our current clusters
> alone , but the new cluster I
> will design for about 500 to 1000 nodes --- I don't think that we will
> go much above since for big jobs
> our scientists using outside resources. We do all our calculations and
> analysis on the nodes and only the final produce
> we sent to the frontend , also we don't run jobs across the nodes , so I
> don't need to get too much creative with the network
> beside being sure that I can expand the cluster without having the
> switches as a limitation (our current situation)
> 
> thank you again!
> 
> 
> Henning Fehrmann wrote:
>> Hi
>>
>> On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote:
>>   
>>> I wonder what would be the sensible biggest cluster possible based on
>>> 1GB Ethernet network .
>>>     
>> Hmmm, may I cheat and use a 10Gb core switch?
>>
>> If you setup a cluster with few thousand nodes you have to ask yourself
>> whether this network should be non-blocking or not.
>>
>> For a non blocking network you need the right core-switch technology.
>> Unfortunately, there are not many vendors out there which provide
>> non-blocking Ethernet based core switches but I am aware of at least
>> two. One provides or will provide 144 10Gb Ethernet ports. Another one
>> sells switches with more than 1000 1 GB ports.
>> You could buy edge-switches with 4 10Gb uplinks and 48 1GB ports. If
>> you just use 40 of them you end up with a 1440 non-blocking 1Gb ports.
>>
>> It might be also possible to cross connect two of these core-switches
>> with the help of some smaller switches so that one ends up with 288
>> 10Gb ports and, in principle, one might connect 2880 nodes in a 
>> non-blocking way, but we did not have the possibility to test it
>> successfully yet. One of problems is that the internal hash table can
>> not store that many mac addresses. Anyway, one probably needs to change
>> the mac addresses of the nodes to avoid an overflow of the hash tables.
>> An overflow might cause arp storms.
>>
>> Once this works one runs into some smaller problems. One of them is the arp
>> cache of the nodes. It should be adjusted to hold as many mac addresses
>> as you have nodes in the cluster.
>>
>>
>>   
>>> And especially how would you connect those 1GB
>>> switches together -- now we have (on one of our four clusters) Two 48
>>> ports gigabit switches connected together with 6 patch cables and I just
>>> ran out of ports for expansion and wonder where to go from here as we
>>> already have four clusters and it would be great to stop adding cluster
>>> and start expending them beyond number of outlets on the switch/s ....
>>> NFS and 1GB Ethernet works great for us and we want to stick with it ,
>>> but we would love to find a way how to overcome the current "switch
>>> limitation".   
>>>     
>> With NFS you can nicely test the setup. Use one NFS server and let all
>> nodes write different files into it and look what happens.
>>
>> Cheers,
>> Henning
>>   
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

It looks like Allied Telesis makes chassis switches now too.

-- 
Geoffrey D. Jacobs


From carsten.aulbert at aei.mpg.de  Thu Mar 11 22:41:18 2010
From: carsten.aulbert at aei.mpg.de (Carsten Aulbert)
Date: Fri, 12 Mar 2010 07:41:18 +0100
Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: 
	=?utf-8?q?how=09large_of_an_installation_have_people_used?= NFS, with?
In-Reply-To: <4B99337D.8070203@gmail.com>
References: <200909091900.n89J07U5031683@bluewest.scyld.com>
	<4AA8F107.7010805@avalon.umaryland.edu>
	<4B99337D.8070203@gmail.com>
Message-ID: <201003120741.20840.carsten.aulbert@aei.mpg.de>

On Thursday 11 March 2010 19:16:29 Geoff Jacobs wrote:

> 
> It looks like Allied Telesis makes chassis switches now too.
> 

as well as Fortinet (I don't think Henning named them), they took over the 
WovenSystems stuff after the latter went under. 

Cheers

CArsten
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1871 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100312/01fbdc92/attachment.bin>

From hahn at mcmaster.ca  Fri Mar 12 07:21:38 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 12 Mar 2010 10:21:38 -0500 (EST)
Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: 
	=?utf-8?q?how=09large_of_an_installation_have_people_used?= NFS, with?
In-Reply-To: <201003120741.20840.carsten.aulbert@aei.mpg.de>
References: <200909091900.n89J07U5031683@bluewest.scyld.com>
	<4AA8F107.7010805@avalon.umaryland.edu> <4B99337D.8070203@gmail.com>
	<201003120741.20840.carsten.aulbert@aei.mpg.de>
Message-ID: <Pine.LNX.4.64.1003121014031.22923@coffee.psychology.mcmaster.ca>

>> It looks like Allied Telesis makes chassis switches now too.
>
> as well as Fortinet (I don't think Henning named them), they took over the
> WovenSystems stuff after the latter went under.

interesting - I wondered what had happened to Woven.  but Woven 
reminds me of Gnodal, which also aims to produce a smarter fabric
that can scalably switch ethernet.  the latter seems promising to me,
since they're leveraging lessons learned from Quadrics.


From michf at post.tau.ac.il  Wed Mar 10 13:20:56 2010
From: michf at post.tau.ac.il (Micha Feigin)
Date: Wed, 10 Mar 2010 23:20:56 +0200
Subject: [Beowulf] assigning cores to queues with torque
In-Reply-To: <C7BA844C.619C%glen.beane@jax.org>
References: <20100308171400.5128b849@vivalunalitshi.luna.local>
	<C7BA844C.619C%glen.beane@jax.org>
Message-ID: <20100310232056.65e553b9@vivalunalitshi.luna.local>

On Mon, 8 Mar 2010 10:39:08 -0500
Glen Beane <Glen.Beane at jax.org> wrote:

> 
> 
> 
> On 3/8/10 10:14 AM, "Micha Feigin" <michf at post.tau.ac.il> wrote:
> 
> I have a small local cluster in our lab that I'm trying to setup with minimum
> hustle to support both cpu and gpu processing where only some of the nodes have
> a gpu and those have only two gpu for four cores.
> 
> It is currently setup using torque from ubuntu (2.3.6) with the torque supplied
> scheduler (set it up with maui initially but it was a bit of a pain for such a
> small cluster so I switched)
> 
> This cluster is used by very few people in a very controlled environment so I
> don't really need any protection from each other, the queues are just for
> convenience to allow remote execution
> 
> The problem:
> 
> I want to allow gpu related jobs to run only on the gpu equiped nodes (i.e more jobs then GPUs will be queued), I want to run other jobs on all nodes with either
> 1. a priority to use the gpu equiped nodes last
> 2. or better, use only two out of four cores on the gpu equiped nodes
> 
> It doesn't seem though that I can map nodes or cores to queues with torque as far as I can tell
> (i.e cpu queue uses 2 cores on gpu1, 2 cores on gpu2, all cores on everything else
>       gpu queue uses 2 cores on gpu1, 2 cores on gpu2)
> 
> I can't seem to set user defined resources so that I can define gpu machines as having gpu resource and schedule according to that.
> 
> Is it possible to achieve any of these two with torque, or is there any other
> simple enough queue manager that can do this (preferably with a debian package
> in some way to simplify maintanance). I only manage this cluster since no one
> else knows how to and it's supposed to take as little of my time as possible
> I'm looking for the simplest solution to implement and not the most versatile
> one.
> 
> 
> you can define a resource "gpu" in your TORQUE nodes file:
> 
> hostname np=4 gpu
> 
> and then users can request -l nodes=1:ppn=4:gpu to get assigned a node with a gpu,  but to do anything more advanced you'll need Maui or Moab.   You should try the maui users mailing list, or the torque users mailing list to see if anyone else has some ideas

Thanks, almost perfect. It would have been a complete solution if there was a
way to define how many such resources there are as there are 4 cores and 2 GPUs
per node. Its good enough for now though as it works perfect when asking for
nodes=1:ppn=2 to make sure that I don't get too many GPU jobs. This is a
cluster that is used by 3 people that are cooperating at the moment so I can
waste the extra core for now to spare man hours for the setup of maui.


From chenyon1 at iit.edu  Wed Mar 10 20:27:28 2010
From: chenyon1 at iit.edu (Yong Chen)
Date: Wed, 10 Mar 2010 22:27:28 -0600
Subject: [Beowulf] [hpc-announce] P2S2-2010 submission still open: deadline
	extended to 3/17/2010
Message-ID: <fb1ac394cb56.4b981cd0@iit.edu>

[Apologies if you got multiple copies of this email. If you'd like to
opt out of these announcements, information on how to unsubscribe is
available at the bottom of this email.]

Dear Colleague:

We would like to inform you that the paper submission deadline of the Third International Workshop 
on Parallel Programming Models and Systems Software for High-end Computing (P2S2)
has been extended to March 17th, 2010.

A full CFP can be found below. 

Thank you.


CALL FOR PAPERS
===============

Third International Workshop on Parallel Programming Models 
and Systems Software for High-end Computing (P2S2)
Sept. 13th, 2010

To be held in conjunction with ICPP-2010: The 39th International 
Conference on Parallel Processing, Sept. 13-16, 2010, San Diego, CA, USA 

Website: http://www.mcs.anl.gov/events/workshops/p2s2

SCOPE
-----
The goal of this workshop is to bring together researchers and
practitioners in parallel programming models and systems software for
high-end computing systems. Please join us in a discussion of new ideas,
experiences, and the latest trends in these areas at the workshop.


TOPICS OF INTEREST
------------------
The focus areas for this workshop include, but are not limited to:

    *  Systems software for high-end scientific and enterprise computing architectures
          o Communication sub-subsystems for high-end computing
          o High-performance file and storage systems
          o Fault-tolerance techniques and implementations
          o Efficient and high-performance virtualization and other management 
            mechanisms for high-end computing

    * Programming models and their high-performance implementations
          o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel, Fortress and others
          o Hybrid Programming Models

    * Tools for Management, Maintenance, Coordination and Synchronization
          o Software for Enterprise Data-centers using Modern Architectures
          o Job scheduling libraries
          o Management libraries for large-scale system
          o Toolkits for process and task coordination on modern platforms

    * Performance evaluation, analysis and modeling of emerging computing platforms


PROCEEDINGS
-----------
Proceedings of this workshop will be published in CD format and will be available 
at the conference (together with the ICPP conference proceedings) .


SUBMISSION INSTRUCTIONS
-----------------------
Submissions should be in PDF format in U.S. Letter size paper. They
should not exceed 8 pages (all inclusive). Submissions will be judged
based on relevance, significance, originality, correctness and clarity.
Please visit workshop website at: http://www.mcs.anl.gov/events/workshops/p2s2/
for the submission link.


JOURNAL SPECIAL ISSUE
---------------------
The best papers of P2S2'10 will be included in a special issue of the International 
Journal of High Performance Computing Applications (IJHPCA) on Programming Models, 
Software and Tools for High-End Computing. 


IMPORTANT DATES
---------------
Paper Submission: March 17th, 2010
Author Notification: May 3rd, 2010 
Camera Ready: June 14th, 2010


PROGRAM CHAIRS
--------------
  * Pavan Balaji, Argonne National Laboratory
  * Abhinav Vishnu, Pacific Northwest National Laboratory


PUBLICITY CHAIR
---------------
  * Yong Chen, Illinois Institute of Technology


STEERING COMMITTEE
------------------
  * William D. Gropp, University of Illinois Urbana-Champaign
  * Dhabaleswar K. Panda, Ohio State University
  * Vijay Saraswat, IBM Research


PROGRAM COMMITTEE
-----------------
  * Ahmad Afsahi, Queen's University
  * George Almasi, IBM Research 
  * Taisuke Boku, Tsukuba University
  * Ron Brightwell, Sandia National Laboratory
  * Franck Cappello, INRIA, France
  * Yong Chen, Illinois Institute of Technology
  * Ada Gavrilovska, Georgia Tech
  * Torsten Hoefler, Indiana University
  * Zhiyi Huang, University of Otago, New Zealand
  * Hyun-Wook Jin, Konkuk University, Korea
  * Zhiling Lan, Illinois Institute of Technology
  * Doug Lea, State University of New York at Oswego
  * Jiuxing Liu, IBM Research
  * Heshan Lin, Virginia Tech
  * Guillaume Mercier, INRIA, France
  * Scott Pakin, Los Alamos National Laboratory
  * Fabrizio Petrini, IBM Research
  * Bronis de Supinksi, Lawrence Livermore National Laboratory
  * Sayantan Sur, Ohio State University
  * Rajeev Thakur, Argonne National Laboratory
  * Vinod Tipparaju, Oak Ridge National Laboratory
  * Jesper Traff, NEC, Europe
  * Weikuan Yu, Auburn University


If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov

========================================================================
You can unsubscribe from the hpc-announce mailing list here:
https://lists.mcs.anl.gov/mailman/listinfo/hpc-announce
========================================================================


From akshar.bhosale at gmail.com  Fri Mar 12 10:08:56 2010
From: akshar.bhosale at gmail.com (akshar bhosale)
Date: Fri, 12 Mar 2010 23:38:56 +0530
Subject: [Beowulf] error while using mpirun
Message-ID: <bf0758a31003121008s4ea3454waff26ccb281ae1af@mail.gmail.com>

i have installed mpich 1.2 6 on my desktop (core 2 duo)

my test file is :

#include<stdio.h>
#include<mpi.h>

int main(int argc,char *argv[])
{
        int rank=0;

        MPI_Init(&argc,&argv);

        MPI_Comm_rank(MPI_COMM_WORLD,&rank);
        printf("my second program rank is %d \n",rank);

        MPI_Finalize();
        return;
}

---------
when i do
/usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but when i do
/usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get

p0_31341:  p4_error: Path to program is invalid while starting
/home/npsf/last with rsh on dragon: -1
    p4_error: latest msg from perror: No such file or directory
error.
please suggest the solution.


From gdjacobs at gmail.com  Fri Mar 12 17:04:26 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Fri, 12 Mar 2010 19:04:26 -0600
Subject: [Beowulf] Cluster of Linux and Windows
In-Reply-To: <Pine.LNX.4.64.0911131222400.23373@coffee.psychology.mcmaster.ca>
References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com>
	<Pine.LNX.4.64.0911131222400.23373@coffee.psychology.mcmaster.ca>
Message-ID: <4B9AE49A.1040004@gmail.com>

Mark Hahn wrote:
>> I am used to work with Arch Linux. What do you think about it?
> 
> the distro is basically irrelevant.  clustering is just a matter of your
> apps, middleware like mpi (may or may not be provided by the cluster),
> probably a shared filesystem, working kernel, network stack,
> job-launch mechanism.  distros are mainly about desktop gunk that is
> completely irrelevant to clusters.

Let's not start the great distro debate again! Suffice it to say that
most distros will work as long as they satisfy your ISV and hardware
requirements.

>> And finnaly, I would like to know if Is it possible to get a Cluster
>> Working
>> with a Server on Arch Linux and the nodes Windows.

File server? Sure, Samba is a breeze, not sure if the performance
scaling is there. NFS works, I guess. Windows comes with a client for
it. Authentication? Again, no problem.

I don't mean to be a snob, but this is basic stuff. I'm guessing you're
requirements are more esoteric so please try to be more specific about
them.

> sure, but why?  windows is generally inferior as an OS platform,
> so I would stay away unless you actually require your apps to run
> under windows.  (remember that linux can use windows storage and
> authentication just fine.)

If you look back in the archives of this ML, you'll find a thread from
when Microsoft released their compute cluster product. It covers quite
effectively the positives and negatives of Windows in HPC. Start with
this posting by Jon Forrest.
http://www.beowulf.org/archive/2008-April/021006.html

Do you have an application which can only be run on Windows?

>> Or even better the nodes without a defined SO.
> 
> SO=Significant Other?  oh, maybe "OS".  generally, you want to minimize
> the number of things that can go wrong in your system.  using uniform OS
> on nodes/servers is a good start.  but sure, there's no reason you can't
> run a cluster where every node is a different OS.  they simply need to
> agree on the network protocol (which doesn't have to be MPI - in fact,
> using something more SOA-like might help if the nodes are heterogenous)

Having as much homogeneity in your cluster as possible will help you
administer it increasingly as a single resource. As Mark said, it's
easier to debug a single set of problems. It's also easier in terms of
infrastructure to maintain a minimal set of images for your clients.

Perhaps you meant something different?

-- 
Geoffrey D. Jacobs


From gdjacobs at gmail.com  Fri Mar 12 17:59:59 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Fri, 12 Mar 2010 19:59:59 -0600
Subject: [Beowulf] Cluster of Linux and Windows
In-Reply-To: <4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com>
References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com>
	<a8d96dec0911130520l8bcaab2q74399f3b48ca98a9@mail.gmail.com>
	<4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com>
Message-ID: <4B9AF19F.3070904@gmail.com>

Leonardo Machado Moreira wrote:
> Basicaly, Is a Cluster Implementation just based on these two libraries
> MPI on the Server and SSH on the clients??

Technically you don't need a server as long as all your clients have a
copy of your application and are able to talk to each other. File
servers and authentication servers just make things easier.

MPI is a type of library which allows your application to talk with it's
sisters on other computers without you having to do sockets programming
or other things just as distasteful. SSH allows you to sign in on one
computer and launch jobs on one or more other computers, so you can
start your application on all the computers (not technically accurate,
but good enough for an overview).

> And a program on tcl/tk for example on server to watch the cluster?

TCL/TK is a programming language and widget library. Monitoring is done
by applications written in TCL/TK and other languages, but it's not a
requirement. Unless you use it to write your application, of course.

Programs for monitoring hardware, batch scheduling, revision management,
and so on are used because they make it easier to maintain and use the
cluster optimally.

It's best to just start reading here...
http://www.clustermonkey.net//content/category/5/14/32/

> Thanks a lot.
> 
> Leonardo Machado Moreira

-- 
Geoffrey D. Jacobs


From gdjacobs at gmail.com  Fri Mar 12 18:06:06 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Fri, 12 Mar 2010 20:06:06 -0600
Subject: [Beowulf] Cluster of Linux and Windows
In-Reply-To: <4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com>
References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com>
	<a8d96dec0911130520l8bcaab2q74399f3b48ca98a9@mail.gmail.com>
	<4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com>
Message-ID: <4B9AF30E.5090204@gmail.com>

Leonardo Machado Moreira wrote:
> Basicaly, Is a Cluster Implementation just based on these two libraries
> MPI on the Server and SSH on the clients??

Technically you don't need a server as long as all your clients have a
copy of your application and are able to talk to each other. File
servers and authentication servers just make things easier.

MPI is a type of library which allows your application to talk with it's
sisters on other computers without you having to do sockets programming
or other things just as distasteful. SSH allows you to sign in on one
computer and launch jobs on one or more other computers, so you can
start your application on all the computers (not technically accurate,
but good enough for an overview).

> And a program on tcl/tk for example on server to watch the cluster?

TCL/TK is a programming language and widget library. Monitoring is done
by applications written in TCL/TK and other languages, but it's not a
requirement. Unless you use it to write your application, of course.

Programs for monitoring hardware, batch scheduling, revision management,
and so on are used because they make it easier to maintain and use the
cluster optimally.

It's best to just start reading here...
http://www.clustermonkey.net//content/category/5/14/32/

> Thanks a lot.
> 
> Leonardo Machado Moreira

-- 
Geoffrey D. Jacobs


From richard.walsh at comcast.net  Fri Mar 12 20:11:28 2010
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Sat, 13 Mar 2010 04:11:28 +0000 (UTC)
Subject: [Beowulf] error while using mpirun
In-Reply-To: <bf0758a31003121008s4ea3454waff26ccb281ae1af@mail.gmail.com>
Message-ID: <573935419.327311268453488264.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>


Akshar bhosale wrote: 

>When i do: 
> 
>/usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but when i do 
>/usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get 
> 
>p0_31341: p4_error: Path to program is invalid while starting 
>/home/npsf/last with rsh on dragon: -1 
>p4_error: latest msg from perror: No such file or directory error. 
> 
>please suggest the solution. 


Looks like the directory that your MPI executable 'test' is in is: 


/home/npsf/last 


Correct? This directory needs to be visible on each node used 
by MPI to run your program. You might also need to put a ./ in 
front of the name of the executable, as in ./test . You also need 
be able the 'rsh' to each of those nodes. Because you have not 
specified a 'machines' file, MPI is using the default file in the install 
tree which normally lists the nodes in simple sequence. Still, I 
think the problem option 1 or 2 above. 


rbw 
_______________________________________________ 
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing 
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100313/5f203a1b/attachment.html>

From gdjacobs at gmail.com  Fri Mar 12 21:43:16 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Fri, 12 Mar 2010 23:43:16 -0600
Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping
In-Reply-To: <4B539883.8040003@abdn.ac.uk>
References: <c4d69731001171307h41c84743geb0b33aff038680b@mail.gmail.com>
	<4B539883.8040003@abdn.ac.uk>
Message-ID: <4B9B25F4.7070901@gmail.com>

Tony Travis wrote:
> Rahul Nabar wrote:
>> If I have a option between doing Hardware RAID versus having software
>> raid via mdadm is there a clear winner in terms of performance? Or is
>> the answer only resolvable by actual testing? I have a fairly fast
>> machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM.
> 
> Hello, Rahul.
> 
> It depends which level of RAID you want to use, and if you want hot-swap
> capability. I use inexpensive 3ware 8006-2 RAID1 controllers and stripe
> them using "md" software RAID0 to make RAID10 arrays. This gives me good
> performance and hot-swap capability (the production md RAID driver does
> not support hot-swap). However, where "md" really scores is portability.
> My RAID's can only be read by 3ware controllers - I made a considered
> descision about this: The 3ware controllers are well-supported by Linux
> kernels, but it makes me uneasy using a proprietary RAID format. I do
> also use "md" RAID5 which is more space efficient, but read this:
> 
>   http://www.baarf.com/

Hot swap is at least partially dependent on the controller. Even most of
the built-in controllers now support hot swap. I'm not aware of md
hardcoding anything on boot which would prevent a change on the fly,
except that you would manually have to initiate a rebuild after swapping.

Could you be more specific about what wasn't working as far as hot
swapping? Was this with current controllers? Which ones?

>> Should I be using the vendor's hardware RAID or mdadm? In case a
>> generic answer is not possible, what might be a good way to test the
>> two options? Any other implications that I should be thinking about?
> 
> In fact, "mdadm" is just the user-space command for controlling the "md"
> driver. The problem with using an on-board RAID controller is that many
> of these are 'host' RAID (i.e. need a Windows driver to do the RAID) in
> which case you are using the CPU anyway, and they also use proprietary
> formats. Generally, I just use SATA mode on the on-board RAID controller
> and create an "md" RAID. This means that I can replace a motherboard
> withour worrying if it has the same type of RAID controller on-board.

Yes, it's pedigreed software instead of mystery meat firmware.

>> Finally, there;s always hybrid approaches. I could have several small
>> RAID5's  at the hardware level (RIAD5 seems ok since I have smaller
>> disks ~300 GB so not really in the domain where the RAID6 arguments
>> kick in, I think) Then using LVM I can integrate storage while asking
>> LVM to stripe across these RAID5's. Thus I'd get striping at two
>> levels: LVM (software) and RAID5 (hardware).
> 
> Yes, I think a hybrid approach is good because that's what I use ;-)
> 
> However, I would avoid relying on LVM mirroring for data protection. It
> is much safer to stripe a set of RAID1's using LVM. I don't think LVM is
> useful unless you are managing a disk farm. The commonest issue in disk
> perfomance is decoupling seeks between different spindles, so I put the
> system files on a different RAID1-set to /export (or /home) filesystems.

LVM is there as a management convenience. It allows you to grow your
disk pool more-or-less on demand. Where it's really beautiful, though,
is when you want to migrate data

-- 
Geoffrey D. Jacobs


From deadline at eadline.org  Mon Mar 15 11:24:56 2010
From: deadline at eadline.org (Douglas Eadline)
Date: Mon, 15 Mar 2010 14:24:56 -0400 (EDT)
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.ma
	il.comcast.net>
References: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
Message-ID: <33623.192.168.1.213.1268677496.squirrel@mail.eadline.org>

I have placed a copy of Richard's table on ClusterMonkey
in case you want an html view.

  http://www.clustermonkey.net//content/view/275/33/

--
Doug


>
> All,
>
>
> In case anyone else has trouble keeping the numbers
> straight between IB (SDR, DDR, QDR, EDR) and PCI-Express,
> (1.0, 2.0, 30) here are a couple of tables in Excel I just worked
> up to help me remember.
>
>
> If anyone finds errors in it please let me know so that I can fix
> them.
>
>
> Regards,
>
>
> rbw
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Doug


From patrick at myri.com  Mon Mar 15 13:27:23 2010
From: patrick at myri.com (Patrick Geoffray)
Date: Mon, 15 Mar 2010 16:27:23 -0400
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
References: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
Message-ID: <4B9E982B.9000002@myri.com>

Hi Richard,

I meant to reply earlier but got busy.

On 2/27/2010 11:17 PM, richard.walsh at comcast.net wrote:
> If anyone finds errors in it please let me know so that I can fix
> them.

You don't consider the protocol efficiency, and this is a major issue on 
PCIe.

First of all, I would change the labels "Raw" and "Effective" to 
"Signal" and "Raw". Then, I would add a third column "Effective" which 
consider the protocol overhead. The protocol overhead is the amount of 
raw bandwidth that is not used for useful payload. On PCIe, on the Read 
side, the data comes in small packets with a 20 Bytes header (could be 
24 with optional ECRC) for a 64, 128 or 256 Bytes payload. Most PCIe 
chipsets only support 64 Bytes Read Completions MTU, and even the ones 
that support larger sizes would still use a majority of 64 Bytes 
completions because it maps well to the transaction size on the memory 
bus (HT, QPI). With 64 Bytes Read Completions, the PCIe efficiency is 
64/84 = 76%, so 32 Gb/s becomes 24 Gb/s, which correspond to the hero 
number quoted by MVAPICH for example (3 GB/s unidirectional). 
Bidirectional efficiency is a bit worse because PCIe Acks take some raw 
bandwidth too. They are coalesced but the pipeline is not very deep, so 
you end up with roughly 20+20 Gb/s bidirectional.

There is a similar protocol efficiency at the IB or Ethernet level, but 
the MTU is large enough that it's much smaller compared to PCIe.

Now, all of this does not matter because Marketers will keep using 
useless Signal rates. They will even have the balls to (try to) rewrite 
history about packet rate benchmarks...

Patrick


From mathog at caltech.edu  Mon Mar 15 13:47:09 2010
From: mathog at caltech.edu (David Mathog)
Date: Mon, 15 Mar 2010 13:47:09 -0700
Subject: [Beowulf] 1000baseT NIC and PXE?
Message-ID: <E1NrHBt-000758-0L@mendel.bio.caltech.edu>

Sorry if this is a silly question, but do any of the inexpensive
1000baseT NICs support PXE boot?  I just finished looking through the
offerings on newegg and while a couple of the really really cheap ones
had an empty socket for a boot rom, none of the ones without such a
socket said definitively in its specs that it could PXE boot (or for
that matter, that it couldn't).  The older machines are all 100baseT, be
nice to give them a network speed boost by dropping in an inexpensive
1000baseT NIC, but not if the new cards won't be able to PXE boot.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From mdidomenico4 at gmail.com  Mon Mar 15 14:00:40 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Mon, 15 Mar 2010 17:00:40 -0400
Subject: [Beowulf] 1000baseT NIC and PXE?
In-Reply-To: <E1NrHBt-000758-0L@mendel.bio.caltech.edu>
References: <E1NrHBt-000758-0L@mendel.bio.caltech.edu>
Message-ID: <e75d22a91003151400t6b62b08av13a04ccf8a1bed23@mail.gmail.com>

surprisingly enough there are still cards that don't come with PXE
built into the embedded rom.  you'll have to check the specs on the
card you're interested in from the mfg website.

one thing that bit me in the past, even though the card had pxe, the
bios of the machine i was working on had no mechanism to load the
option rom (it was a desktop),


On Mon, Mar 15, 2010 at 4:47 PM, David Mathog <mathog at caltech.edu> wrote:
> Sorry if this is a silly question, but do any of the inexpensive
> 1000baseT NICs support PXE boot? ?I just finished looking through the
> offerings on newegg and while a couple of the really really cheap ones
> had an empty socket for a boot rom, none of the ones without such a
> socket said definitively in its specs that it could PXE boot (or for
> that matter, that it couldn't). ?The older machines are all 100baseT, be
> nice to give them a network speed boost by dropping in an inexpensive
> 1000baseT NIC, but not if the new cards won't be able to PXE boot.
>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From richard.walsh at comcast.net  Mon Mar 15 14:24:52 2010
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Mon, 15 Mar 2010 21:24:52 +0000 (UTC)
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <1392498409.1069101268686455945.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
Message-ID: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>


On Monday, March 15, 2010 1:27:23 PM GMT Patrick Geoffray wrote: 


>I meant to respond to this, but got busy. You don't consider the protocol 
>efficiency, and this is a major issue on PCIe. 


Yes, I forgot that there is more to the protocol than the 8B/10B encoding, 
but I am glad to get your input to improve the table (late or otherwise). 

>First of all, I would change the labels "Raw" and "Effective" to 
>"Signal" and "Raw". Then, I would add a third column "Effective" which 
>consider the protocol overhead. The protocol overhead is the amount of 


I think adding another column for protocol inefficiency column makes 
some sense. Not sure I know enough to chose the right protocol performance 
loss multipliers or what the common case values would be (as opposed 
to best and worst case). It would be good to add Ethernet to the mix 
(1Gb, 10Gb, and 40Gb) as well. Sounds like the 76% multiplier is 
reasonable for PCI-E (with a "your mileage may vary" footnote). The table 
cannot perfectly reflect every contributing variable without getting very large. 
Perhaps, you could provide a table with the Ethernet numbers, and I will do 
some more research to make estimates for IB? Then I will get a draft to Doug 
at Cluster Monkey. One more iteration only ... to improve things, but avoid 
a "protocol holy war" ... ;-) ... 


>raw bandwidth that is not used for useful payload. On PCIe, on the Read 
>side, the data comes in small packets with a 20 Bytes header (could be 
>24 with optional ECRC) for a 64, 128 or 256 Bytes payload. Most PCIe 
>chipsets only support 64 Bytes Read Completions MTU, and even the ones 
>that support larger sizes would still use a majority of 64 Bytes 
>completions because it maps well to the transaction size on the memory 
>bus (HT, QPI). With 64 Bytes Read Completions, the PCIe efficiency is 
>64/84 = 76%, so 32 Gb/s becomes 24 Gb/s, which correspond to the hero 
>number quoted by MVAPICH for example (3 GB/s unidirectional). 
>Bidirectional efficiency is a bit worse because PCIe Acks take some raw 
>bandwidth too. They are coalesced but the pipeline is not very deep, so 
>you end up with roughly 20+20 Gb/s bidirectional. 


Thanks for the clear and detailed description. 

>There is a similar protocol efficiency at the IB or Ethernet level, but 
>the MTU is large enough that it's much smaller compared to PCIe. 


Would you estimate less than 1%, 2%, 4% ... ?? 

>Now, all of this does not matter because Marketers will keep using 
>useless Signal rates. They will even have the balls to (try to) rewrite 
>history about packet rate benchmarks... 


I am hoping the table increases the number of fully informed decisions on 
these questions. 


rbw 
_______________________________________________ 
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing 
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100315/655c611c/attachment.html>

From Shainer at mellanox.com  Mon Mar 15 14:33:07 2010
From: Shainer at mellanox.com (Gilad Shainer)
Date: Mon, 15 Mar 2010 14:33:07 -0700
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com>

To make it more accurate, most PCIe chipsets supports 256B reads, and the data bandwidth is 26Gb/s, which makes it 26+26, not 20+20. 

 
Gilad

 
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of richard.walsh at comcast.net
Sent: Monday, March 15, 2010 2:25 PM
To: beowulf at beowulf.org
Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)?

 
On Monday, March 15, 2010 1:27:23 PM GMT Patrick Geoffray wrote: 

 
>I meant to respond to this, but got busy. You don't consider the protocol

>efficiency, and this is a major issue on PCIe.

 
Yes, I forgot that there is more to the protocol than the 8B/10B encoding,

but I am glad to get your input to improve the table (late or otherwise).


>First of all, I would change the labels "Raw" and "Effective" to 
>"Signal" and "Raw". Then, I would add a third column "Effective" which 
>consider the protocol overhead. The protocol overhead is the amount of 

 
I think adding another column for protocol inefficiency column makes

some sense.   Not sure I know enough to chose the right protocol performance

loss multipliers or what the common case values would be (as opposed

to best and worst case).  It would be good to add Ethernet to the mix

(1Gb, 10Gb, and 40Gb) as well.  Sounds like the 76% multiplier is 

reasonable for PCI-E (with a "your mileage may vary" footnote).  The table

cannot perfectly reflect every contributing variable without getting very large. 

Perhaps, you could provide a table with the Ethernet numbers, and I will do

some more research to make estimates for IB?  Then I will get a draft to Doug

at Cluster Monkey.  One more iteration only ... to improve things, but avoid

a "protocol holy war" ... ;-) ... 

 
>raw bandwidth that is not used for useful payload. On PCIe, on the Read 
>side, the data comes in small packets with a 20 Bytes header (could be 
>24 with optional ECRC) for a 64, 128 or 256 Bytes payload. Most PCIe 
>chipsets only support 64 Bytes Read Completions MTU, and even the ones 
>that support larger sizes would still use a majority of 64 Bytes 
>completions because it maps well to the transaction size on the memory 
>bus (HT, QPI). With 64 Bytes Read Completions, the PCIe efficiency is 
>64/84 = 76%, so 32 Gb/s becomes 24 Gb/s, which correspond to the hero 
>number quoted by MVAPICH for example (3 GB/s unidirectional). 
>Bidirectional efficiency is a bit worse because PCIe Acks take some raw 
>bandwidth too. They are coalesced but the pipeline is not very deep, so 
>you end up with roughly 20+20 Gb/s bidirectional.

 
Thanks for the clear and detailed description.


>There is a similar protocol efficiency at the IB or Ethernet level, but 
>the MTU is large enough that it's much smaller compared to PCIe.

 
Would you estimate less than 1%, 2%, 4% ... ??

>Now, all of this does not matter because Marketers will keep using 
>useless Signal rates. They will even have the balls to (try to) rewrite 
>history about packet rate benchmarks...

I am hoping the table increases the number of fully informed decisions on

these questions.

 
rbw
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100315/941b967c/attachment.html>

From fitz at cs.earlham.edu  Fri Mar 12 17:05:41 2010
From: fitz at cs.earlham.edu (Andrew Fitz Gibbon)
Date: Fri, 12 Mar 2010 19:05:41 -0600
Subject: [Beowulf] error while using mpirun
In-Reply-To: <bf0758a31003121008s4ea3454waff26ccb281ae1af@mail.gmail.com>
References: <bf0758a31003121008s4ea3454waff26ccb281ae1af@mail.gmail.com>
Message-ID: <63ECD1AF-EB0B-46AB-BDB7-878853519DE6@cs.earlham.edu>

On Mar 12, 2010, at 12:08 PM, akshar bhosale wrote:

> when i do
> /usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but  
> when i do
> /usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get
>
> p0_31341:  p4_error: Path to program is invalid while starting
> /home/npsf/last with rsh on dragon: -1
>    p4_error: latest msg from perror: No such file or directory
> error.
> please suggest the solution.


My suggestion would be to specify something closer to the absolute  
path to your binary. For example:

$ /usr/local/mpich-1.2.6/bin/mpirun -np 4 $HOME/test

----------------
Andrew Fitz Gibbon
fitz at cs.earlham.edu


From rigved.sharma123 at gmail.com  Fri Mar 12 22:41:06 2010
From: rigved.sharma123 at gmail.com (rigved sharma)
Date: Sat, 13 Mar 2010 12:11:06 +0530
Subject: [Beowulf] error while using mpirun
In-Reply-To: <573935419.327311268453488264.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
References: <bf0758a31003121008s4ea3454waff26ccb281ae1af@mail.gmail.com>
	<573935419.327311268453488264.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
Message-ID: <e3ccefa01003122241q79e5b842ye43e1568a49e87cd@mail.gmail.com>

hi,
thanks for your solutions.
i tried all solutions given by you ..still same error. can ypu please
suggest any other solution?
regards,
akshar

On Sat, Mar 13, 2010 at 9:41 AM, <richard.walsh at comcast.net> wrote:

>
> Akshar bhosale wrote:
>
> >When i do:
> >
> >/usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but when i do
> >/usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get
> >
> >p0_31341:  p4_error: Path to program is invalid while starting
> >/home/npsf/last with rsh on dragon: -1
> >p4_error: latest msg from perror: No such file or directory error.
> >
> >please suggest the solution.
>
> Looks like the directory that your MPI executable 'test' is in is:
>
> /home/npsf/last
>
> Correct?  This directory needs to be visible on each node used
> by MPI to run your program.  You might also need to put a ./ in
> front of the name of the executable, as in ./test . You also need
> be able the 'rsh' to each of those nodes.  Because you have not
> specified a 'machines' file, MPI is using the default file in the install
> tree which normally lists the nodes in simple sequence. Still, I
> think the problem option 1 or 2 above.
>
> rbw
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100313/10fb59bc/attachment.html>

From erik at contica.com  Sat Mar 13 07:48:26 2010
From: erik at contica.com (Erik Andresen)
Date: Sat, 13 Mar 2010 16:48:26 +0100
Subject: [Beowulf] error while using mpirun
In-Reply-To: <201003130209.o2D28ttl025995@bluewest.scyld.com>
References: <201003130209.o2D28ttl025995@bluewest.scyld.com>
Message-ID: <4B9BB3CA.1050208@contica.com>


> Date: Fri, 12 Mar 2010 23:38:56 +0530
> From: akshar bhosale <akshar.bhosale at gmail.com>
> Subject: [Beowulf] error while using mpirun
> To: beowulf at beowulf.org, torqueusers at supercluster.org
> when i do
> /usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but when i do
> /usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get
>
> p0_31341:  p4_error: Path to program is invalid while starting
> /home/npsf/last with rsh on dragon: -1
>     p4_error: latest msg from perror: No such file or directory
> error.
> please suggest the solution.
>
>   
Classic mistake. Try to use

mpirun -np 4 ./test


On my system 'which test' returns /usr/bin/test
so I guess mpirun tries to run another 'test' than
the one you made.

Erik Andresen


From patrick at myri.com  Mon Mar 15 15:47:51 2010
From: patrick at myri.com (Patrick Geoffray)
Date: Mon, 15 Mar 2010 18:47:51 -0400
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com>
References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
	<9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com>
Message-ID: <4B9EB917.8090201@myri.com>

On 3/15/2010 5:33 PM, Gilad Shainer wrote:
> To make it more accurate, most PCIe chipsets supports 256B reads, and
> the data bandwidth is 26Gb/s, which makes it 26+26, not 20+20.

I know Marketers lives in their own universe, but here are a few nuts 
for you to crack:

* If most PCIe chipsets would effectively do 256B Completions, why is 
the max unidirectional bandwidth for QDR/Nehalem is 3026 MB/s (24.2 
GB/s) as reported in the latest MVAPICH announcement ?
3026 MB/s is 73.4% efficiency compared to raw bandwidth of 4 GB for Gen2 
8x. With 256B Completions, the PCIe efficiency would be 92.7%, so 
someone would be losing 19.3% ? Would that be your silicon ?

* For 64B Completions: 64/84 is 0.7619, and 0.7619 * 32 = 24.38 Gb/s. 
How do you get 26 Gb/s again ?

* PCIe is a reliable protocol, there are Acks in the other direction. If 
you claim that one way is 26 GB/s and two-way is 26+26 Gb/s, does that 
mean you have invented a reliable protocol that does not need acks ?

* If bidirectional is 26+26, why is the max bidirectional bandwidth 
reported by MVAPICH is 5858 MB/s, ie 46.8 Gb/s  or 23.4+23.4 Gb/s ? 
Granted, it's more than 20+20, but it depends a lot on the 
chipset-dependent pipeline depth.


BTW, Greg's offer is still pending...

Patrick


From Shainer at mellanox.com  Mon Mar 15 16:09:53 2010
From: Shainer at mellanox.com (Gilad Shainer)
Date: Mon, 15 Mar 2010 16:09:53 -0700
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net><9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com>
	<4B9EB917.8090201@myri.com>
Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F0276123F@mtiexch01.mti.com>

I don?t appreciate those kind of responses and it is not appropriate for this mailing list. Please fix in future emails. I am standing behind any info I put out, and definitely don?t do down estimations as you do. It was nice to see that you fixed your 20+20 numbers to 24+23 (that was marketing that you did?), but I suggest you do a better search to look on numbers of recent systems, with decent Bios setting. Gen2 system can provide 3300MB/s uni or >6500MB bi dir. Of course you can find versions that gives lower performance, and I can send you some instruction to get the PCIe BW even lower than 20 for your own performance testing if you want to. It still will be much higher than what you can do with Myri10G... 

Gilad


-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Patrick Geoffray
Sent: Monday, March 15, 2010 3:48 PM
To: beowulf at beowulf.org
Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)?

On 3/15/2010 5:33 PM, Gilad Shainer wrote:
> To make it more accurate, most PCIe chipsets supports 256B reads, and
> the data bandwidth is 26Gb/s, which makes it 26+26, not 20+20.

I know Marketers lives in their own universe, but here are a few nuts 
for you to crack:

* If most PCIe chipsets would effectively do 256B Completions, why is 
the max unidirectional bandwidth for QDR/Nehalem is 3026 MB/s (24.2 
GB/s) as reported in the latest MVAPICH announcement ?
3026 MB/s is 73.4% efficiency compared to raw bandwidth of 4 GB for Gen2 
8x. With 256B Completions, the PCIe efficiency would be 92.7%, so 
someone would be losing 19.3% ? Would that be your silicon ?

* For 64B Completions: 64/84 is 0.7619, and 0.7619 * 32 = 24.38 Gb/s. 
How do you get 26 Gb/s again ?

* PCIe is a reliable protocol, there are Acks in the other direction. If 
you claim that one way is 26 GB/s and two-way is 26+26 Gb/s, does that 
mean you have invented a reliable protocol that does not need acks ?

* If bidirectional is 26+26, why is the max bidirectional bandwidth 
reported by MVAPICH is 5858 MB/s, ie 46.8 Gb/s  or 23.4+23.4 Gb/s ? 
Granted, it's more than 20+20, but it depends a lot on the 
chipset-dependent pipeline depth.


BTW, Greg's offer is still pending...

Patrick
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From patrick at myri.com  Mon Mar 15 16:30:15 2010
From: patrick at myri.com (Patrick Geoffray)
Date: Mon, 15 Mar 2010 19:30:15 -0400
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
Message-ID: <4B9EC307.5070009@myri.com>

On 3/15/2010 5:24 PM, richard.walsh at comcast.net wrote:
> to best and worst case). It would be good to add Ethernet to the mix
> (1Gb, 10Gb, and 40Gb) as well.

10 Gb Ethernet uses 8b/10b with a signal rate of 12.5 Gb/s, for a raw 
bandwidth of 10 Gb/s. I don't know how 1Gb is encoded and 40 Gb/s is 
still in draft. Last time I looked at 40 Gb/s, it was pretty much four 
10 Gb links put together, so I would say 8b/10b with 50 Gb/s signal rate.

>  >There is a similar protocol efficiency at the IB or Ethernet level, but
>  >the MTU is large enough that it's much smaller compared to PCIe.
>
> Would you estimate less than 1%, 2%, 4% ... ??

It depends on the packet size. For example, 14 Bytes Ethernet header on 
1500 Bytes MTU, that's 1%. For Jumbo frames at 9000B MTU, it's much less 
than that. I don't know the header size in IB, but with an MTU of 2K or 
4K, it's negligible.

However, things are different for tiny packets. The minimum packet size 
on Ethernet is 60 Bytes. The maximum packet rate (not coalesced !) is 
14.88 Mpps on a 10GE link, counting everything (inter-packet gap, CRC, 
etc). If you do the math, that's 14.88*60 = 892 MB/s on the link, or 684 
MB/s if you remove the 14B Ethernet header (54% efficiency).

I don't think you can put all that on an Excel sheet :-)

Patrick


From tom.ammon at utah.edu  Mon Mar 15 16:45:44 2010
From: tom.ammon at utah.edu (Tom Ammon)
Date: Mon, 15 Mar 2010 17:45:44 -0600
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <4B9EC307.5070009@myri.com>
References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
	<4B9EC307.5070009@myri.com>
Message-ID: <4B9EC6A8.2000800@utah.edu>

If I understand correctly, 40GbE is 64/66 encoded.

Tom


On 3/15/2010 5:30 PM, Patrick Geoffray wrote:
> On 3/15/2010 5:24 PM, richard.walsh at comcast.net wrote:
>    
>> to best and worst case). It would be good to add Ethernet to the mix
>> (1Gb, 10Gb, and 40Gb) as well.
>>      
> 10 Gb Ethernet uses 8b/10b with a signal rate of 12.5 Gb/s, for a raw
> bandwidth of 10 Gb/s. I don't know how 1Gb is encoded and 40 Gb/s is
> still in draft. Last time I looked at 40 Gb/s, it was pretty much four
> 10 Gb links put together, so I would say 8b/10b with 50 Gb/s signal rate.
>
>    
>>   >There is a similar protocol efficiency at the IB or Ethernet level, but
>>   >the MTU is large enough that it's much smaller compared to PCIe.
>>
>> Would you estimate less than 1%, 2%, 4% ... ??
>>      
> It depends on the packet size. For example, 14 Bytes Ethernet header on
> 1500 Bytes MTU, that's 1%. For Jumbo frames at 9000B MTU, it's much less
> than that. I don't know the header size in IB, but with an MTU of 2K or
> 4K, it's negligible.
>
> However, things are different for tiny packets. The minimum packet size
> on Ethernet is 60 Bytes. The maximum packet rate (not coalesced !) is
> 14.88 Mpps on a 10GE link, counting everything (inter-packet gap, CRC,
> etc). If you do the math, that's 14.88*60 = 892 MB/s on the link, or 684
> MB/s if you remove the 14B Ethernet header (54% efficiency).
>
> I don't think you can put all that on an Excel sheet :-)
>
> Patrick
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>    


-- 
--------------------------------------------------------------------
Tom Ammon
Network Engineer
Office: 801.587.0976
Mobile: 801.674.9273

Center for High Performance Computing
University of Utah
http://www.chpc.utah.edu


From Shainer at mellanox.com  Mon Mar 15 16:54:08 2010
From: Shainer at mellanox.com (Gilad Shainer)
Date: Mon, 15 Mar 2010 16:54:08 -0700
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F01E1558F@mtiexch01.mti.com>

That's correct


----- Original Message -----
From: beowulf-bounces at beowulf.org <beowulf-bounces at beowulf.org>
To: Patrick Geoffray <patrick at myri.com>
Cc: beowulf at beowulf.org <beowulf at beowulf.org>
Sent: Mon Mar 15 16:45:44 2010
Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)?

If I understand correctly, 40GbE is 64/66 encoded.

Tom


On 3/15/2010 5:30 PM, Patrick Geoffray wrote:
> On 3/15/2010 5:24 PM, richard.walsh at comcast.net wrote:
>    
>> to best and worst case). It would be good to add Ethernet to the mix
>> (1Gb, 10Gb, and 40Gb) as well.
>>      
> 10 Gb Ethernet uses 8b/10b with a signal rate of 12.5 Gb/s, for a raw
> bandwidth of 10 Gb/s. I don't know how 1Gb is encoded and 40 Gb/s is
> still in draft. Last time I looked at 40 Gb/s, it was pretty much four
> 10 Gb links put together, so I would say 8b/10b with 50 Gb/s signal rate.
>
>    
>>   >There is a similar protocol efficiency at the IB or Ethernet level, but
>>   >the MTU is large enough that it's much smaller compared to PCIe.
>>
>> Would you estimate less than 1%, 2%, 4% ... ??
>>      
> It depends on the packet size. For example, 14 Bytes Ethernet header on
> 1500 Bytes MTU, that's 1%. For Jumbo frames at 9000B MTU, it's much less
> than that. I don't know the header size in IB, but with an MTU of 2K or
> 4K, it's negligible.
>
> However, things are different for tiny packets. The minimum packet size
> on Ethernet is 60 Bytes. The maximum packet rate (not coalesced !) is
> 14.88 Mpps on a 10GE link, counting everything (inter-packet gap, CRC,
> etc). If you do the math, that's 14.88*60 = 892 MB/s on the link, or 684
> MB/s if you remove the 14B Ethernet header (54% efficiency).
>
> I don't think you can put all that on an Excel sheet :-)
>
> Patrick
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>    


-- 
--------------------------------------------------------------------
Tom Ammon
Network Engineer
Office: 801.587.0976
Mobile: 801.674.9273

Center for High Performance Computing
University of Utah
http://www.chpc.utah.edu

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100315/d2b06ad1/attachment.html>

From tom.elken at qlogic.com  Mon Mar 15 17:03:14 2010
From: tom.elken at qlogic.com (Tom Elken)
Date: Mon, 15 Mar 2010 17:03:14 -0700
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com>
References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com><e75d22a91002191249w52f76004n926cc6b25deb03d5@mail.gmail.com><9FA59C95FFCBB34EA5E42C1A8573784F02662C76@mtiexch01.mti.com>
	<20100219220538.GL2857@bx9.net>
	<9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com>
Message-ID: <35AAF1E4A771E142979F27B51793A48887348B241D@AVEXMB1.qlogic.org>

> On Behalf Of Gilad Shainer
> 
> ... OSU has different benchmarks
> so you can measure message coalescing or real message rate. 

[ As a refresher for the wider audience , as Gilad defined earlier: " Message coalescing is when you incorporate multiple MPI messages in a single network packet."  And I agree with this definition :) ]

Gilad,  

Sorry for the delayed QLogic response on this.  I was on vacation when this thread started up.  But now that it has been revived, ... 

Which OSU benchmarks have message-coalescing built into the source?  

> Nowadays it seems that QLogic
> promotes the message rate as non coalescing data and I almost got
> bought
> by their marketing machine till I looked on at the data on the wire...
> interesting what the bits and bytes and symbols can tell you...

Message-coalescing has been done in benchmark source code, such as HPC Challenge's MPI RandomAccess benchmark.  In that case, coalescing is performed when the SANDIA_OPT2 define is turned on during the build.

More typically message coalescing is a feature of some MPIs and they use various heuristics for when it is active.

MVAPICH has an environment variable -- VIADEV_USE_COALESCE -- which can turn this feature on or off.  HP-MPI has coalescing heuristics on by default when using IB-Verbs, off by default when using QLogic's PSM.  Open MPI has enabled message-coalescing heuristics for more recent versions when running over IB verbs.

There is nothing wrong with message coalescing features in the MPI.  Only when you are trying to measure the raw message rate of the network adapter, it is best to not use message coalescing feature so you can measure what you set out to measure.  QLogic MPI does not have a message coalescing feature, and that is what we use to measure MPI message rate on our IB adapters.  We also measure using MVAPICH with it's message coalescing feature turned off, and get virtually identical message rate performance to that with QLogic MPI. 

I don't know what you were measuring on the wire, but with the osu_mbw_mr benchmark and QLogic MPI, for the small 1 to 8 byte message sizes where we achieve maximum message rate, each message is in its own 56 byte packet with
no coalescing.  I asked a couple of our engineers who have looked at a lot of PCIe traces to make sure of this.

Regards,
-Tom


> 
> 
> 
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
> On Behalf Of Greg Lindahl
> Sent: Friday, February 19, 2010 2:06 PM
> To: beowulf at beowulf.org
> Subject: Re: [Beowulf] Q: IB message rate & large core counts (per
> node)?
> 
> > Mellanox latest message rate numbers with ConnectX-2 more than
> > doubled versus the old cards, and are for real message rate -
> > separate messages on the wire. The competitor numbers are with using
> > message coalescing, so it is not real separate messages on the wire,
> > or not really message rate.
> 
> Gilad,
> 
> I think you forgot which side you're supposed to be supporting.
> 
> The only people I have ever seen publish message rate with coalesced
> messages are DK Panda (with Mellanox cards) and Mellanox.
> 
> QLogic always hated coalesced messages, and if you look back in the
> archive for this mailing list, you'll see me denouncing coalesced
> messages as meanless about 1 microsecond after the first result was
> published by Prof. Panda.
> 
> Looking around the Internet, I don't see any numbers ever published by
> PathScale/QLogic using coalesced messages.
> 
> At the end of the day, the only reason microbenchmarks are useful is
> when they help explain why one interconnect does better than another
> on real applications. No customer should ever choose which adapter to
> buy based on microbenchmarks.
> 
> -- greg
> (formerly employed by QLogic)
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at pbm.com  Mon Mar 15 17:06:01 2010
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 15 Mar 2010 17:06:01 -0700
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <4B9EC307.5070009@myri.com>
References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
	<4B9EC307.5070009@myri.com>
Message-ID: <20100316000601.GA8362@bx9.net>

On Mon, Mar 15, 2010 at 07:30:15PM -0400, Patrick Geoffray wrote:

> However, things are different for tiny packets. The minimum packet size  
> on Ethernet is 60 Bytes. The maximum packet rate (not coalesced !) is  
> 14.88 Mpps on a 10GE link, counting everything (inter-packet gap, CRC,  
> etc). If you do the math, that's 14.88*60 = 892 MB/s on the link, or 684  
> MB/s if you remove the 14B Ethernet header (54% efficiency).

There's the additional complexity, for tiny packets, that different
cards will have different outgoing inter-packet gaps, usually greater
than the minimum.  Switches can merge streams from multiple hosts and
reduce that inter-packet gap on the receiving side, if multiple hosts
talk to one.  This is true for both Ethernet and IB.

And if we're talking MPI, different MPIs have different header sizes.
That's where a graph of non-coalesced MPI bandwidth (not including all
the overheads) as a function of message size and core counts is
interesting.

The spreadsheet can kinda be useful for large messages, but not for
data sizes < 1/2 MTU. But even for large data sizes, there are plenty
of other factors which your real application can trip over.

-- greg


From hahn at mcmaster.ca  Mon Mar 15 19:20:54 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Mon, 15 Mar 2010 22:20:54 -0400 (EDT)
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F0276123F@mtiexch01.mti.com>
References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net><9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com>
	<4B9EB917.8090201@myri.com>
	<9FA59C95FFCBB34EA5E42C1A8573784F0276123F@mtiexch01.mti.com>
Message-ID: <Pine.LNX.4.64.1003152135000.2105@coffee.psychology.mcmaster.ca>

> I don???t appreciate those kind of responses and it is not
>appropriate for this mailing list. Please fix in future emails. I am

your assert some numbers, perhaps correct, but Patrick provides 
useful explanations.  I prefer the latter.

>system can provide 3300MB/s uni or >6500MB bi dir. Of course you can find

your numbers are 10% different from Patrick's.  wow.

could we all just drop the silly personal byplay?

From lindahl at pbm.com  Mon Mar 15 21:48:07 2010
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 15 Mar 2010 21:48:07 -0700
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <35AAF1E4A771E142979F27B51793A48887348B241D@AVEXMB1.qlogic.org>
References: <20100219220538.GL2857@bx9.net>
	<9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com>
	<35AAF1E4A771E142979F27B51793A48887348B241D@AVEXMB1.qlogic.org>
Message-ID: <20100316044807.GB27065@bx9.net>

On Mon, Mar 15, 2010 at 05:03:14PM -0700, Tom Elken wrote:

> QLogic MPI does not have a message coalescing feature, and that is
> what we use to measure MPI message rate on our IB adapters.

Thank you for making that clear, Tom.

-- greg


From niftyompi at niftyegg.com  Mon Mar 15 21:58:36 2010
From: niftyompi at niftyegg.com (Nifty Tom Mitchell)
Date: Mon, 15 Mar 2010 21:58:36 -0700
Subject: [Beowulf] confidential data on public HPC cluster
In-Reply-To: <4B8BEB7D.9050904@scinet.utoronto.ca>
References: <4B8BEB7D.9050904@scinet.utoronto.ca>
Message-ID: <20100316045836.GA3252@compegg>

On Mon, Mar 01, 2010 at 11:29:49AM -0500, Jonathan Dursi wrote:
> 
> Hi;
> 
> We're a fairly typical academic HPC centre, and we're starting to
> have users talk to us about using our new clusters for projects that
> have various requirements for keeping data confidential. 

"Various requirements" should spell it out for you.
The requirements result in consequences and a price.
Multiple groups may have conflicting requirements
and cannot play together.

If they want timeshare does that desire argue with 
their requirements?

In one senario you can isolate storage and kickstart (clean load) all
the compute hosts between project access.  i.e. It is possible for each group to
have its own "Head Node" with a dedicated NFS resource and allow only one
"Head Node" to be physically connected to cluster at a time.

Requirements should specify staff requirements and more
including physical access. 

The cost of a breach can dwarf the cost of dedicated individual disk 
farms and clusters.    If their requirements cost you then they
need to put skin in the game.

Your best solution might be to turn it back at them and make 
"various requirements" of them that you can live with!  This
might require a legal review as well.


-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?


From carsten.aulbert at aei.mpg.de  Tue Mar 16 08:27:30 2010
From: carsten.aulbert at aei.mpg.de (Carsten Aulbert)
Date: Tue, 16 Mar 2010 16:27:30 +0100
Subject: [Beowulf] HPL as a learning experience
Message-ID: <201003161627.32012.carsten.aulbert@aei.mpg.de>

Hi all,

I wanted to run high performance linpack mostly for fun (and of course to 
learn more about it and stress test a couple of machines). However, so far 
I've had very mixed results.

I downloaded the 2.0 version released in September 2008 and managed it to 
compile with mpich 1.2.7 on Debian Lenny. The resulting xhpl file is 
dynamically linked like this:

        linux-vdso.so.1 =>  (0x00007fffca372000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00007fb47bca8000)
        librt.so.1 => /lib/librt.so.1 (0x00007fb47ba9f000)
        libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fb47b7c4000)
        libm.so.6 => /lib/libm.so.6 (0x00007fb47b541000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fb47b32a000)
        libc.so.6 => /lib/libc.so.6 (0x00007fb47afd7000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb47bec4000)

Then I wanted to run a couple of tests on a single quad-CPU node (with 12 GB 
physical RAM), I used

http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

to generate files for a single and a dual core test [1] and [2].

Starting the single core run does not pose any problem:
/usr/bin/mpirun.mpich -np 1  -machinefile machines /nfs/xhpl

where machines is just a simple file containing 4 times the name of this host. 
So far so good. 
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR11C2R4       14592   128     1     1             407.94          5.078e+00
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0087653 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0209927 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0045327 ...... PASSED
============================================================================

When starting the two core run, I receive the following error message after a 
couple of seconds (after RSS hits the VIRT RAM value in top):

/usr/bin/mpirun.mpich -np 2  -machinefile machines /nfs/xhpl
p0_20535:  p4_error: interrupt SIGSEGV: 11
rm_l_1_20540: (1.804688) net_send: could not write to fd=5, errno = 32

SIGSEGV with p4_error indicates a seg fault within hpl - that's as far as I've 
come with google, but right now I have no idea how to proceed. I somehow doubt 
that this venerable program is so buggy that I'd hit it on my first day ;)

Any ideas where I might do something wrong?

Cheers

Carsten

[1]
single core test
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any) 
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
14592         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

[2]
dual core setup
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any) 
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
14592         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB


From gus at ldeo.columbia.edu  Tue Mar 16 10:00:07 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Tue, 16 Mar 2010 13:00:07 -0400
Subject: [Beowulf] HPL as a learning experience
In-Reply-To: <201003161627.32012.carsten.aulbert@aei.mpg.de>
References: <201003161627.32012.carsten.aulbert@aei.mpg.de>
Message-ID: <4B9FB917.1070905@ldeo.columbia.edu>

Hi Carsten

The problem is most likely mpich 1.2.7.
MPICH-1 is old and no longer maintained.
It is based on the P4 lower level libraries, which don't
seem to talk properly to current Linux kernels and/or
to current Ethernet card drivers.

There were several postings on this list, on the ROCKS Clusters list,
on the MPICH list, etc, reporting errors very similar to yours:
a p4 error followed by a segmentation fault.
The MPICH developers recommend upgrading to MPICH2 because of
these problems, besides performance, ease of use, etc.

The easy fix is to use another MPI, say, OpenMPI or MPICH2.
I would guess they are available as packages for Debian.

However, you can build both very easily
from source using just gcc/g++/gfortran.
Get the source code and documentation,
then read the README files, FAQ (OpenMPI),
and Install Guide, User Guide (MPICH2) for details:

OpenMPI
http://www.open-mpi.org/
http://www.open-mpi.org/software/ompi/v1.4/
http://www.open-mpi.org/faq/
http://www.open-mpi.org/faq/?category=building

MPICH2:
http://www.mcs.anl.gov/research/projects/mpich2/
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads
http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs

I compiled and ran HPL here with both OpenMPI and MPICH2
(and MVAPICH2 as well), and it works just fine,
over Ethernet and over Infiniband.

I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Carsten Aulbert wrote:
> Hi all,
> 
> I wanted to run high performance linpack mostly for fun (and of course to 
> learn more about it and stress test a couple of machines). However, so far 
> I've had very mixed results.
> 
> I downloaded the 2.0 version released in September 2008 and managed it to 
> compile with mpich 1.2.7 on Debian Lenny. The resulting xhpl file is 
> dynamically linked like this:
> 
>         linux-vdso.so.1 =>  (0x00007fffca372000)
>         libpthread.so.0 => /lib/libpthread.so.0 (0x00007fb47bca8000)
>         librt.so.1 => /lib/librt.so.1 (0x00007fb47ba9f000)
>         libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fb47b7c4000)
>         libm.so.6 => /lib/libm.so.6 (0x00007fb47b541000)
>         libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fb47b32a000)
>         libc.so.6 => /lib/libc.so.6 (0x00007fb47afd7000)
>         /lib64/ld-linux-x86-64.so.2 (0x00007fb47bec4000)
> 
> Then I wanted to run a couple of tests on a single quad-CPU node (with 12 GB 
> physical RAM), I used
> 
> http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html
> 
> to generate files for a single and a dual core test [1] and [2].
> 
> Starting the single core run does not pose any problem:
> /usr/bin/mpirun.mpich -np 1  -machinefile machines /nfs/xhpl
> 
> where machines is just a simple file containing 4 times the name of this host. 
> So far so good. 
> ============================================================================
> T/V                N    NB     P     Q               Time             Gflops
> ----------------------------------------------------------------------------
> WR11C2R4       14592   128     1     1             407.94          5.078e+00
> ----------------------------------------------------------------------------
> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0087653 ...... PASSED
> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0209927 ...... PASSED
> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0045327 ...... PASSED
> ============================================================================
> 
> When starting the two core run, I receive the following error message after a 
> couple of seconds (after RSS hits the VIRT RAM value in top):
> 
> /usr/bin/mpirun.mpich -np 2  -machinefile machines /nfs/xhpl
> p0_20535:  p4_error: interrupt SIGSEGV: 11
> rm_l_1_20540: (1.804688) net_send: could not write to fd=5, errno = 32
> 
> SIGSEGV with p4_error indicates a seg fault within hpl - that's as far as I've 
> come with google, but right now I have no idea how to proceed. I somehow doubt 
> that this venerable program is so buggy that I'd hit it on my first day ;)
> 
> Any ideas where I might do something wrong?
> 
> Cheers
> 
> Carsten
> 
> [1]
> single core test
> HPLinpack benchmark input file
> Innovative Computing Laboratory, University of Tennessee
> HPL.out      output file name (if any) 
> 8            device out (6=stdout,7=stderr,file)
> 1            # of problems sizes (N)
> 14592         Ns
> 1            # of NBs
> 128           NBs
> 0            PMAP process mapping (0=Row-,1=Column-major)
> 1            # of process grids (P x Q)
> 1            Ps
> 1            Qs
> 16.0         threshold
> 1            # of panel fact
> 2            PFACTs (0=left, 1=Crout, 2=Right)
> 1            # of recursive stopping criterium
> 4            NBMINs (>= 1)
> 1            # of panels in recursion
> 2            NDIVs
> 1            # of recursive panel fact.
> 1            RFACTs (0=left, 1=Crout, 2=Right)
> 1            # of broadcast
> 1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
> 1            # of lookahead depth
> 1            DEPTHs (>=0)
> 2            SWAP (0=bin-exch,1=long,2=mix)
> 64           swapping threshold
> 0            L1 in (0=transposed,1=no-transposed) form
> 0            U  in (0=transposed,1=no-transposed) form
> 1            Equilibration (0=no,1=yes)
> 8            memory alignment in double (> 0)
> ##### This line (no. 32) is ignored (it serves as a separator). ######
> 0                               Number of additional problem sizes for PTRANS
> 1200 10000 30000                values of N
> 0                               number of additional blocking sizes for PTRANS
> 40 9 8 13 13 20 16 32 64        values of NB
> 
> [2]
> dual core setup
> HPLinpack benchmark input file
> Innovative Computing Laboratory, University of Tennessee
> HPL.out      output file name (if any) 
> 8            device out (6=stdout,7=stderr,file)
> 1            # of problems sizes (N)
> 14592         Ns
> 1            # of NBs
> 128           NBs
> 0            PMAP process mapping (0=Row-,1=Column-major)
> 1            # of process grids (P x Q)
> 1            Ps
> 2            Qs
> 16.0         threshold
> 1            # of panel fact
> 2            PFACTs (0=left, 1=Crout, 2=Right)
> 1            # of recursive stopping criterium
> 4            NBMINs (>= 1)
> 1            # of panels in recursion
> 2            NDIVs
> 1            # of recursive panel fact.
> 1            RFACTs (0=left, 1=Crout, 2=Right)
> 1            # of broadcast
> 1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
> 1            # of lookahead depth
> 1            DEPTHs (>=0)
> 2            SWAP (0=bin-exch,1=long,2=mix)
> 64           swapping threshold
> 0            L1 in (0=transposed,1=no-transposed) form
> 0            U  in (0=transposed,1=no-transposed) form
> 1            Equilibration (0=no,1=yes)
> 8            memory alignment in double (> 0)
> ##### This line (no. 32) is ignored (it serves as a separator). ######
> 0                               Number of additional problem sizes for PTRANS
> 1200 10000 30000                values of N
> 0                               number of additional blocking sizes for PTRANS
> 40 9 8 13 13 20 16 32 64        values of NB
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at caltech.edu  Tue Mar 16 10:38:07 2010
From: mathog at caltech.edu (David Mathog)
Date: Tue, 16 Mar 2010 10:38:07 -0700
Subject: [Beowulf] 1000baseT NIC and PXE?
Message-ID: <E1NraiV-0007RF-Q9@mendel.bio.caltech.edu>

Michael Di Domenico wrote:
> 
> surprisingly enough there are still cards that don't come with PXE
> built into the embedded rom.  you'll have to check the specs on the
> card you're interested in from the mfg website.

Here are the docs for a typical inexpensive 1000baseT card:

ftp://ftp10.dlink.com/pdfs/products/DGE-530T/DGE-530T_ds.pdf

There is no empty socket on it for a boot rom, yet the specs say nothing
about whether or not it will PXE boot.

Let's turn this question around a bit.  Can anyone suggest a specific
inexpensive 1000baseT card which provides PXE and is otherwise reliable
and fast?  That is, one that you have personally used to boot a machine
into your cluster.  Similarly, the names of any models that should be
avoided would also be useful information.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From mdidomenico4 at gmail.com  Tue Mar 16 14:52:30 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Tue, 16 Mar 2010 17:52:30 -0400
Subject: [Beowulf] 1000baseT NIC and PXE?
In-Reply-To: <E1NraiV-0007RF-Q9@mendel.bio.caltech.edu>
References: <E1NraiV-0007RF-Q9@mendel.bio.caltech.edu>
Message-ID: <e75d22a91003161452y3e340400k733f7e2b66c3c095@mail.gmail.com>

On Tue, Mar 16, 2010 at 1:38 PM, David Mathog <mathog at caltech.edu> wrote:
> Michael Di Domenico wrote:
>>
>> surprisingly enough there are still cards that don't come with PXE
>> built into the embedded rom. ?you'll have to check the specs on the
>> card you're interested in from the mfg website.
>
> Here are the docs for a typical inexpensive 1000baseT card:
>
> ftp://ftp10.dlink.com/pdfs/products/DGE-530T/DGE-530T_ds.pdf
>
> There is no empty socket on it for a boot rom, yet the specs say nothing
> about whether or not it will PXE boot.
>
> Let's turn this question around a bit. ?Can anyone suggest a specific
> inexpensive 1000baseT card which provides PXE and is otherwise reliable
> and fast? ?That is, one that you have personally used to boot a machine
> into your cluster. ?Similarly, the names of any models that should be
> avoided would also be useful information.

I can't speak for the dlink card, but i primarily choose Intel NIC's,
something like this

http://www.newegg.com/Product/Product.aspx?Item=N82E16833106121

Any of the intel NIC's that support your slot specification and
support "Intel Boot Agent" should support PXE


From greg.matthews at diamond.ac.uk  Wed Mar 17 02:54:05 2010
From: greg.matthews at diamond.ac.uk (Gregory Matthews)
Date: Wed, 17 Mar 2010 09:54:05 +0000
Subject: [Beowulf] 1000baseT NIC and PXE?
In-Reply-To: <E1NraiV-0007RF-Q9@mendel.bio.caltech.edu>
References: <E1NraiV-0007RF-Q9@mendel.bio.caltech.edu>
Message-ID: <4BA0A6BD.90208@diamond.ac.uk>

David Mathog wrote:
> Let's turn this question around a bit.  Can anyone suggest a specific
> inexpensive 1000baseT card which provides PXE and is otherwise reliable
> and fast?  That is, one that you have personally used to boot a machine
> into your cluster.  Similarly, the names of any models that should be
> avoided would also be useful information.

we have had many problems with the Intel NICs that come embedded on the 
supermicro twin boards when paired with nortel switches. Setting the MTU 
to anything other than 1500 results in cards coming up in a strange 
state 50% of the time. Also, PXE has been a problem with certain 
versions of the driver not to mention the extra confusion over e1000 and 
e1000e.

These report themselves as:

Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet 
Controller (Copper) (rev 01)

GREG

> 
> Thanks,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


-- 
Greg Matthews            01235 778658
Senior Computer Systems Administrator
Diamond Light Source, Oxfordshire, UK


From andrew.robbie at gmail.com  Wed Mar 17 06:06:28 2010
From: andrew.robbie at gmail.com (Andrew Robbie (Gmail))
Date: Thu, 18 Mar 2010 00:06:28 +1100
Subject: [Beowulf] 1000baseT NIC and PXE?
In-Reply-To: <E1NrHBt-000758-0L@mendel.bio.caltech.edu>
References: <E1NrHBt-000758-0L@mendel.bio.caltech.edu>
Message-ID: <92BA00FE-E346-4ABA-82B5-D96F4E8D4F55@gmail.com>

On 16/03/2010, at 7:47 AM, David Mathog wrote:

> Sorry if this is a silly question, but do any of the inexpensive
> 1000baseT NICs support PXE boot?

Can I suggest you consult the rom-o-matic database maintained by the  
etherboot/gPXE project? The raw list is at:
http://rom-o-matic.net/gpxe/gpxe-git/gpxe.git/src/bin/NIC

For example, it shows that the DGE-530T is supported with the skge  
driver.

It can be problematic mapping brand names/part numbers to chipsets  
though. If you can buy one to test that helps... Do you mean PCI,  
PCI-64, PCI-X or PCIe cards? Standard PCI is really too slow for GigE.  
Can you say which your motherboard supports? It is a good idea to make  
sure the NIC can work at 3.3v or 5v for flexability.

Regards,
Andrew


From dnlombar at ichips.intel.com  Wed Mar 17 09:24:25 2010
From: dnlombar at ichips.intel.com (David N. Lombard)
Date: Wed, 17 Mar 2010 09:24:25 -0700
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
In-Reply-To: <33623.192.168.1.213.1268677496.squirrel@mail.eadline.org>
References: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
	<33623.192.168.1.213.1268677496.squirrel@mail.eadline.org>
Message-ID: <20100317162425.GA19667@nlxcldnl2.cl.intel.com>

On Mon, Mar 15, 2010 at 11:24:56AM -0700, Douglas Eadline wrote:
> I have placed a copy of Richard's table on ClusterMonkey
> in case you want an html view.
> 
>   http://www.clustermonkey.net//content/view/275/33/
> 
IBTA shows 20Gb/s for EDR:
<http://www.infinibandta.org/content/pages.php?pg=technology_overview>

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.


From Shainer at mellanox.com  Wed Mar 17 10:37:54 2010
From: Shainer at mellanox.com (Gilad Shainer)
Date: Wed, 17 Mar 2010 10:37:54 -0700
Subject: [Beowulf] Q: IB message rate & large core counts (per node)?
Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F01E155B9@mtiexch01.mti.com>

The EDR speed will be 25.78Gb/s per lane or 100Gb/s data rate for 4x port. It was not made public on the IBTA web site, probably will be updated in the comming days.

Gilad


----- Original Message -----
From: beowulf-bounces at beowulf.org <beowulf-bounces at beowulf.org>
To: Douglas Eadline <deadline at eadline.org>
Cc: richard.walsh at comcast.net <richard.walsh at comcast.net>; beowulf at beowulf.org <beowulf at beowulf.org>
Sent: Wed Mar 17 09:24:25 2010
Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)?

On Mon, Mar 15, 2010 at 11:24:56AM -0700, Douglas Eadline wrote:
> I have placed a copy of Richard's table on ClusterMonkey
> in case you want an html view.
> 
>   http://www.clustermonkey.net//content/view/275/33/
> 
IBTA shows 20Gb/s for EDR:
<http://www.infinibandta.org/content/pages.php?pg=technology_overview>

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100317/404f63c8/attachment.html>

From mathog at caltech.edu  Wed Mar 17 12:55:42 2010
From: mathog at caltech.edu (David Mathog)
Date: Wed, 17 Mar 2010 12:55:42 -0700
Subject: [Beowulf] Re: 1000baseT NIC and PXE? 
Message-ID: <E1NrzLC-00007G-LR@mendel.bio.caltech.edu>

Andrew Robbie wrote:
> Can I suggest you consult the rom-o-matic database maintained by the  
> etherboot/gPXE project? The raw list is at:
> http://rom-o-matic.net/gpxe/gpxe-git/gpxe.git/src/bin/NIC
> 
> For example, it shows that the DGE-530T is supported with the skge  
> driver.

I did visit that site, but the documentation was not exactly "make PXE
work when it didn't before" newbie friendly.  It looked very helpful if
one already knows what it means!  

Yesterday I called a bunch of tech support lines for the various cheap
NIC manufacturers and was that ever a miserable experience.  D-link and
netsys both sent me to Indian phone support hell, in one instance so bad
the questioner got into a loop on her script and started asking the same
questions again, at which point I threw an exception and hung up.  For
those cards which had an empty socket to hold an EEPROM, nobody could
(or would) tell me what type of chip to put in there, or where to get
the software to load on it.  The most honest answer of the day was (not
an exact quote) "we just make the things, ask Realtek if it can do
that".  (Still waiting for Realtek to reply.)

Gus Correa sent me the simplest solution - PXE boot using the existing
100baseT on the mobo and use the new gigabit card once the system comes up.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From carsten.aulbert at aei.mpg.de  Wed Mar 17 13:07:39 2010
From: carsten.aulbert at aei.mpg.de (Carsten Aulbert)
Date: Wed, 17 Mar 2010 21:07:39 +0100
Subject: [Beowulf] HPL as a learning experience
In-Reply-To: <4B9FB917.1070905@ldeo.columbia.edu>
References: <201003161627.32012.carsten.aulbert@aei.mpg.de>
	<4B9FB917.1070905@ldeo.columbia.edu>
Message-ID: <201003172107.47989.carsten.aulbert@aei.mpg.de>

Hi

On Tuesday 16 March 2010 18:00:07 Gus Correa wrote:
> The problem is most likely mpich 1.2.7.
> MPICH-1 is old and no longer maintained.
> It is based on the P4 lower level libraries, which don't
> seem to talk properly to current Linux kernels and/or
> to current Ethernet card drivers.
[...]
> The easy fix is to use another MPI, say, OpenMPI or MPICH2.
> I would guess they are available as packages for Debian.
> 

You were right, switching over to openmpi solved this issue at once. Sorry for 
not trying that before causing noise here :)

Now, the hard part might begin to narrow down the parameter space...

Cheers

Carsten


From landman at scalableinformatics.com  Wed Mar 17 15:54:29 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 17 Mar 2010 18:54:29 -0400
Subject: [Beowulf] Re: 1000baseT NIC and PXE? 
In-Reply-To: <E1NrzLC-00007G-LR@mendel.bio.caltech.edu>
References: <E1NrzLC-00007G-LR@mendel.bio.caltech.edu>
Message-ID: <0C96F3A2-F1D4-4281-A7AC-BFB9B0765418@scalableinformatics.com>

Why not use a USB stick with gpxe.USB?

This would provide the greatest flexibility.

Please pardon brevity and typos ... Sent from my iPhone

On Mar 17, 2010, at 3:55 PM, "David Mathog" <mathog at caltech.edu> wrote:

> Andrew Robbie wrote:
>> Can I suggest you consult the rom-o-matic database maintained by the
>> etherboot/gPXE project? The raw list is at:
>> http://rom-o-matic.net/gpxe/gpxe-git/gpxe.git/src/bin/NIC
>>
>> For example, it shows that the DGE-530T is supported with the skge
>> driver.
>
> I did visit that site, but the documentation was not exactly "make PXE
> work when it didn't before" newbie friendly.  It looked very helpful  
> if
> one already knows what it means!
>
> Yesterday I called a bunch of tech support lines for the various cheap
> NIC manufacturers and was that ever a miserable experience.  D-link  
> and
> netsys both sent me to Indian phone support hell, in one instance so  
> bad
> the questioner got into a loop on her script and started asking the  
> same
> questions again, at which point I threw an exception and hung up.  For
> those cards which had an empty socket to hold an EEPROM, nobody could
> (or would) tell me what type of chip to put in there, or where to get
> the software to load on it.  The most honest answer of the day was  
> (not
> an exact quote) "we just make the things, ask Realtek if it can do
> that".  (Still waiting for Realtek to reply.)
>
> Gus Correa sent me the simplest solution - PXE boot using the existing
> 100baseT on the mobo and use the new gigabit card once the system  
> comes up.
>
> Regards,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at caltech.edu  Wed Mar 17 16:11:26 2010
From: mathog at caltech.edu (David Mathog)
Date: Wed, 17 Mar 2010 16:11:26 -0700
Subject: [Beowulf] Re: 1000baseT NIC and PXE? 
Message-ID: <E1Ns2Oc-0000BG-79@mendel.bio.caltech.edu>

Joe Landman wrote:

> Why not use a USB stick with gpxe.USB?
> 
> This would provide the greatest flexibility.

Not sure if these machines will boot from a USB key, never tried it. 
They are old enough that they might not.  If it works, then yes, this
would be a good option.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From mathog at caltech.edu  Thu Mar 18 14:14:04 2010
From: mathog at caltech.edu (David Mathog)
Date: Thu, 18 Mar 2010 14:14:04 -0700
Subject: [Beowulf] Re: 1000baseT NIC and PXE? 
Message-ID: <E1NsN2a-0000ax-3m@mendel.bio.caltech.edu>

Joe Landman wrote:
> Why not use a USB stick with gpxe.USB?

Gave gPXE.USB a try using the motherboard's 100baseT NIC.  gPXE started
after the right BIOS settings were entered and the USB showed up in the
boot list. Unfortunately it used MAC "ad:ad:ad:ad:00:00" instead of the
actual hardware MAC "00:e0:81:22:cc:3d", so DHCP wasn't set up to send
it anything other than an address.  ^B to get into the gPXE command
line, but it wasn't accepting or echoing keyboard input.  (Possibly
terminal output was going out the serial port, anyway, it seemed to be
locked up at the command line.)

How does one make gPXE use the MAC it finds on the NIC instead of 
ad:ad:ad:ad:00:00?  The nodes on this system are not interchangeable,
node1 has data that node2 doesn't, and so forth.  The cluster is soon to
become heterogeneous. So the master does need to know who and what it is
responding to. 

If there are multiple nics how is gPXE configured to use a particular
one?  (If they have different hardware I guess just include that one
driver, but what if there are two the same?)

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From david.ritch at gmail.com  Thu Mar 18 05:57:39 2010
From: david.ritch at gmail.com (David B. Ritch)
Date: Thu, 18 Mar 2010 08:57:39 -0400
Subject: [Beowulf] PXE Booting and Interface Bonding
Message-ID: <4BA22343.8030802@gmail.com>

What is the best practice for the use of multiple NICs on cluster nodes?

I've found that when I enable Etherchannel bonding in out network
equipment, PXE booting does not work.  It breaks the initial DHCP
discover request, presumably because the response may not return to the
same NIC.  However, bonding the interfaces is a clear win for node
availability and for performance.

The best solution that I have is to turn off port channels in our
network equipment, and use Linux kernel bonding, in balance-alb mode. 
This provides adaptive load balancing by ARP manipulation.

Thanks in advance!

David


From ebiederm at xmission.com  Sat Mar 20 12:20:49 2010
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Sat, 20 Mar 2010 12:20:49 -0700
Subject: [Beowulf] PXE Booting and Interface Bonding
In-Reply-To: <4BA22343.8030802@gmail.com> (David B. Ritch's message of "Thu\,
	18 Mar 2010 08\:57\:39 -0400")
References: <4BA22343.8030802@gmail.com>
Message-ID: <m1fx3u94vi.fsf@fess.ebiederm.org>

"David B. Ritch" <david.ritch at gmail.com> writes:

> What is the best practice for the use of multiple NICs on cluster nodes?
>
> I've found that when I enable Etherchannel bonding in out network
> equipment, PXE booting does not work.  It breaks the initial DHCP
> discover request, presumably because the response may not return to the
> same NIC.  However, bonding the interfaces is a clear win for node
> availability and for performance.
>
> The best solution that I have is to turn off port channels in our
> network equipment, and use Linux kernel bonding, in balance-alb mode. 
> This provides adaptive load balancing by ARP manipulation.

A few years ago I added 802.3ad LAG support to etherboot now gpxe.

When used it negotiates tells the LAG that there is only one member.
You might want to try that. 

Eric


From ebiederm at xmission.com  Sat Mar 20 16:06:08 2010
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Sat, 20 Mar 2010 16:06:08 -0700
Subject: [Beowulf] PXE Booting and Interface Bonding
In-Reply-To: <4BA53056.6010507@gmail.com> (David B. Ritch's message of "Sat\,
	20 Mar 2010 16\:30\:14 -0400")
References: <4BA22343.8030802@gmail.com> <m1fx3u94vi.fsf@fess.ebiederm.org>
	<4BA53056.6010507@gmail.com>
Message-ID: <m18w9m8ufz.fsf@fess.ebiederm.org>

"David B. Ritch" <david.ritch.lists at gmail.com> writes:

> Eric,
>
> Thank you - that sounds like a good idea.  However, I'm not sure that
> we'll have an opportunity to replace the bootloader on our
> motherboards.  I'd love to see that become standard!
>
> Is gPXE widely used?  How else would one approach this?

Yes.  At least I have seen it as the boot firmware on several
10Gig nics.

You can also get gPXE to boot off of just about anything, so
you should be able to at least try it out.

Eric


From gdjacobs at gmail.com  Sat Mar 20 23:57:51 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Sun, 21 Mar 2010 01:57:51 -0500
Subject: [Beowulf] Re: Third-party drives not permitted on new Dell
	servers?
In-Reply-To: <ECE7A93BD093E1439C20020FBE87C47FED2B93E464@ALTPHYEMBEVSP20.RES.AD.JPL>
References: <4B7AD159.50000@scalableinformatics.com>	<C7A01B42.E21E%James.P.Lux@jpl.nasa.gov>
	<hlenln$1s2$1@ger.gmane.org>
	<ECE7A93BD093E1439C20020FBE87C47FED2B93E464@ALTPHYEMBEVSP20.RES.AD.JPL>
Message-ID: <4BA5C36F.9010609@gmail.com>

Lux, Jim (337C) wrote:
> 
> James Lux, P.E. Task Manager, SOMD Software Defined Radios Flight 
> Communications Systems Section Jet Propulsion Laboratory 4800 Oak 
> Grove Drive, Mail Stop 161-213 Pasadena, CA, 91109 +1(818)354-2075 
> phone +1(818)393-6875 fax
> 
>> -----Original Message----- From: beowulf-bounces at beowulf.org 
>> [mailto:beowulf-bounces at beowulf.org] On Behalf Of Doug O'Neal Sent:
>>  Tuesday, February 16, 2010 10:21 AM To: beowulf at beowulf.org 
>> Subject: [Beowulf] Re: Third-party drives not permitted on new Dell
>>  servers?
>> 
>> On 02/16/2010 12:52 PM, Lux, Jim (337C) wrote:
>>> 
>>> On 2/16/10 9:09 AM, "Joe Landman" 
>>> <landman at scalableinformatics.com> wrote:
>>>> 5X markup?  We must be doing something wrong :/
>>>> 
>>> 
>>> Depends on what the price includes.  I could easily see a 
>>> commodity drive in a case lot being dropped on the loading dock 
>>> at, say, $100 each, and the drive with installation, system 
>>> integrator testing, downstream support, etc. being $500. Doesn't 
>>> take many hours on the phone tracking down an idiosyncracy or 
>>> setup to cost $500 in labor.
>> But when you're installing anywhere from eight to forty-eight 
>> drives in a single system the required hours to make up that 
>> $400/drive overhead does get larger.  And if you spread the system 
>> integrator testing over eight drives per unit and hundreds to 
>> thousands of units the cost per drive shouldn't be measured in 
>> hundreds of dollars.
>> 
> 
> True, IFF the costing strategy is based on that sort of approach. 
> Various companies can and do price the NRE and support tail cost in a
>  variety of ways.   They might have a "notional" system size and base
>  the pricing model on that: Say they, through research, find that
> most customers are buying, say, 32 systems at a crack.  Now the
> support tail (which is basically "per system") is spread across only
> 32 drives, not thousands. If you happen to buy 64 systems, then you 
> basically are paying twice.   Most companies don't have infinite 
> granularity in this sort of thing, and try to pick a few breakpoints 
> that make sense.

But in this case, they're selling not 32 controllers, or whatever.
They're selling thousands or tens of thousands of controllers and tens
or hundreds of thousands of drives across the entire product line. Do
they qualify drives per system, or across the line (perhaps per
controller model)?

> (NRE = non recurring engineering) As far as the NRE goes, say they 
> get a batch of a dozen drives each of half a dozen kinds. They have 
> to set up half a dozen test systems (either in parallel or 
> sequentially), run the tests on all of them, and wind up with maybe 2
>  or 3 leading candidates that they decide to list on their "approved 
> disk" list.  The cost of testing the disks that didn't make the cut 
> has to be added to the cost of the disks that did.
>
> There's a lot that goes into pricing that isn't obvious at first 
> glance, or even second glance, especially if you're looking at a 
> single instance (your own purchase) and trying to work backwards from
>  there.  There are weird anomalies that crop up in supposedly 
> commodity items from things like fuel prices (e.g. you happened to 
> buy that container load of disks when fuel prices were high, so 
> shipping cost more). A couple years ago, there were huge fluctuations
>  in the price of copper, so there would be 2:1 differences in the 
> retail cost of copper wire and tubing at the local Home Depot and 
> Lowes, basically depending on when they happened to have bought the 
> stuff wholesale. (this is the kind of thing that arbitrageurs look 
> for, of course)

Logically, one would not see consistency in the markup in such a case.
Nor would the tier one vendors be consistently marked up at similar amounts.

> Some of it is "paying for convenience", too.  Rather than do all the 
> testing yourself, or writing a detailed requirements and procurement 
> document for a third party, both of which cost you some non-zero 
> amount of time and money, you just pay the increased price to a 
> vendor who's done it for you.  It's like eating sausage.  You can buy
>  already made sausage, and the sausage maker has done the 
> experimenting with seasoning and process controls to come out with 
> something that people like.  Or, you can spend the time to make it 
> yourself, potentially saving some money and getting a more customized
>  sausage taste, BUT, you're most likely going to have some 
> less-than-ideal sausage in the process.

They should be willing and able to convince people of the value of each
and every product they sell, and that includes justifying the
non-interoperability of their disk controllers with 3rd party HDDs.

> The more computers or sausage you're consuming, the more likely it is
>  that you could do better with a customized approach, but, even
> there, you may be faced with resource limits (e.g. you could spend
> your time getting a better deal on the disks or you could spend your
> time doing research with the disks.  Ultimately, the research MUST
> get done, so you have to trade off how much you're willing to spend.)
> 

Jim and Joe both are likely to have more of an idea of the realities
going on inside Dell than I. Michael Will likely does as well as some
others on list. However, it's up to Dell to justify their decisions to
those on list who have concerns of this nature either now  or when asked
to in the bidding process. Just like how Joe was able to explain one of
the subtle, relevant problems in system integration, all in one email!

It's as simple as that. They should be able to justify their position,
without sounding like they're high on Prozac.

Whenever there's a discussion on vendor markup, I always think on the
Audiophile scene. In particular:
http://www.usa.denon.com/ProductDetails/3429.asp
http://www.positive-feedback.com/Issue32/anjou.htm

I think I can speak on behalf of everyone that we do not want computer
hardware vendors to degrade to this level.

-- 
Geoffrey D. Jacobs


From hearnsj at googlemail.com  Sun Mar 21 01:52:59 2010
From: hearnsj at googlemail.com (John Hearns)
Date: Sun, 21 Mar 2010 08:52:59 +0000
Subject: [Beowulf] Third-party drives not permitted on new Dell servers?
In-Reply-To: <Pine.LNX.4.64.1002160045570.23371@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1002131332430.24032@coffee.psychology.mcmaster.ca>
	<web-2808025@free.net>
	<c4d69731002150951k23cda6fcs4796f0de96168792@mail.gmail.com>
	<4B79F7B4.9020808@scalableinformatics.com>
	<c4d69731002151812ka70daf6y8cc6f9f943d91df9@mail.gmail.com>
	<4B7A032C.2080207@scalableinformatics.com>
	<Pine.LNX.4.64.1002160045570.23371@coffee.psychology.mcmaster.ca>
Message-ID: <9f8092cc1003210152p2425a946s129e040aedc22e2b@mail.gmail.com>

On 16 February 2010 07:08, Mark Hahn <hahn at mcmaster.ca> wrote:

>
> I think the real paradigm shift is that disks have become a consumable
> which you want to be able to replace in 1-2 product generations (2-3 years).
> along with this, disks just aren't that important, individually - even
> something _huge_ like seagate's firmware problem, for instance, only drove
> up random failures, no?


You have just hit a very big nail on the head.

Let's think about current RAID arrays - you have to replace a drive
with the same type - take Fibrechannel arrays for instance - they have
different drive speeds, and sizes of course.  ot FC, but once when
doing support I replaced a SATA drive by one of the same size. But not
the same manufacturer - and it had just a couple of sectors less, so
was not accepted in as a spare drive.
We could go on - but the point being that once you select a storage
array you are bound into that type of disk.
I'm now still getting speedy and good service on a FC array which is
rather elderly - replacment drives have been on the shelf for years.

Anyway, Mark prompts me to think back to the IBM Storage Tank concept
- drive goes bad and it is popped out of a hatch like a vending
machine.

Remember, this is the Beowulf list and Beowulf is about applying COTS
technology. We're in the Web 2.0 age, with Google, Microsoft et. al.
deploying containerised data centres - and somehow I don't reckon they
keep all their data on some huge EMC fibrechannel array with a dual FC
fabric and a live mirror to another lockstep duplicate array in
another building, via dark fibre, with endless discussions on going to
8Gbit FC  (yadda yadda, you get the point).

As Mark says - storage is storage. It should be bought by the pallet
load, and deployed like Lego bricks.


From james.p.lux at jpl.nasa.gov  Sun Mar 21 08:20:51 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Sun, 21 Mar 2010 08:20:51 -0700
Subject: [Beowulf] Re: Third-party drives not permitted on new Dell
	servers?
In-Reply-To: <4BA5C36F.9010609@gmail.com>
Message-ID: <C7CB8763.F2F6%James.P.Lux@jpl.nasa.gov>


On 3/20/10 11:57 PM, "Geoff Jacobs" <gdjacobs at gmail.com> wrote:

> 
> Whenever there's a discussion on vendor markup, I always think on the
> Audiophile scene. In particular:
> http://www.usa.denon.com/ProductDetails/3429.asp
> 
> 
?Additionally, signal directional markings are provided for optimum signal
transfer.?
You mean you don?t carefully align your ethernet cables so that they are
oriented in the direction of predominant data flow?  I'm sure all the hot
stuff cluster folks take a look at the data flow among nodes before each job
and have the techs go reorient the cables.  Or provide dual cables and
paths, at the very least.

> 
> 
> http://www.positive-feedback.com/Issue32/anjou.htm
> 
> I think I can speak on behalf of everyone that we do not want computer
> hardware vendors to degrade to this level.

Aughhh! And I just spent the last year in my garage perfecting my Beophile
(tm pending<grin>) interconnect cables.  I have them carefully aligned to
magnetic north to cure, waiting for the residual stresses to decay.

> 
> --
> Geoffrey D. Jacobs
> 


From james.p.lux at jpl.nasa.gov  Sun Mar 21 08:41:55 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Sun, 21 Mar 2010 08:41:55 -0700
Subject: [Beowulf] Third-party drives not permitted on new Dell servers?
In-Reply-To: <9f8092cc1003210152p2425a946s129e040aedc22e2b@mail.gmail.com>
Message-ID: <C7CB8C53.F2F8%James.P.Lux@jpl.nasa.gov>


On 3/21/10 1:52 AM, "John Hearns" <hearnsj at googlemail.com> wrote:

> On 16 February 2010 07:08, Mark Hahn <hahn at mcmaster.ca> wrote:
> 
>> 
>> I think the real paradigm shift is that disks have become a consumable
>> which you want to be able to replace in 1-2 product generations (2-3 years).
>> along with this, disks just aren't that important, individually - even
>> something _huge_ like seagate's firmware problem, for instance, only drove
>> up random failures, no?
> 
> 
> 
> You have just hit a very big nail on the head.
> 
> Remember, this is the Beowulf list and Beowulf is about applying COTS
> technology.

To a certain extent Beowulfery has strayed from its roots.  Originally, it
was:Hey, I can do supercomputing that's competitive (or almost as good) as
the big iron with cheap consumer gear.

But now it has succeeded to the point that it's the dominant way of
supercomputing, and the emphasis is on optimizing performance, almost to the
point of CDC carefully trimming the wire lengths on the fast big vector
machines.  Clusters these days leverage commodity, but the nodes tend to be
more specialized, compared the run of the mill desktops that we started
with.

Would you see anything like the StoneSouperComputer today?
Talk about a heterogenous cluster.


 We're in the Web 2.0 age, with Google, Microsoft et. al.
> deploying containerised data centres - and somehow I don't reckon they
> keep all their data on some huge EMC fibrechannel array with a dual FC
> fabric and a live mirror to another lockstep duplicate array in
> another building, via dark fibre, with endless discussions on going to
> 8Gbit FC  (yadda yadda, you get the point).
> 
> As Mark says - storage is storage. It should be bought by the pallet
> load, and deployed like Lego bricks.


Yes. And the true direction of classic Beowulfery should be to deal with the
non-ideal/heterogenous nature of this approach.


Jim


From david.ritch.lists at gmail.com  Sat Mar 20 13:30:14 2010
From: david.ritch.lists at gmail.com (David B. Ritch)
Date: Sat, 20 Mar 2010 16:30:14 -0400
Subject: [Beowulf] PXE Booting and Interface Bonding
In-Reply-To: <m1fx3u94vi.fsf@fess.ebiederm.org>
References: <4BA22343.8030802@gmail.com> <m1fx3u94vi.fsf@fess.ebiederm.org>
Message-ID: <4BA53056.6010507@gmail.com>

Eric,

Thank you - that sounds like a good idea.  However, I'm not sure that
we'll have an opportunity to replace the bootloader on our
motherboards.  I'd love to see that become standard!

Is gPXE widely used?  How else would one approach this?

Thanks!

dbr

On 3/20/2010 3:20 PM, Eric W. Biederman wrote:
> "David B. Ritch" <david.ritch at gmail.com> writes:
>
>   
>> What is the best practice for the use of multiple NICs on cluster nodes?
>>
>> I've found that when I enable Etherchannel bonding in out network
>> equipment, PXE booting does not work.  It breaks the initial DHCP
>> discover request, presumably because the response may not return to the
>> same NIC.  However, bonding the interfaces is a clear win for node
>> availability and for performance.
>>
>> The best solution that I have is to turn off port channels in our
>> network equipment, and use Linux kernel bonding, in balance-alb mode. 
>> This provides adaptive load balancing by ARP manipulation.
>>     
> A few years ago I added 802.3ad LAG support to etherboot now gpxe.
>
> When used it negotiates tells the LAG that there is only one member.
> You might want to try that. 
>
> Eric
>
>
>   


From thakur at mcs.anl.gov  Wed Mar 24 06:46:06 2010
From: thakur at mcs.anl.gov (Rajeev Thakur)
Date: Wed, 24 Mar 2010 08:46:06 -0500
Subject: [Beowulf] [hpc-announce] FW: EuroMPI 2010 Call for Papers (extended
	deadline April 19th)
Message-ID: <5874958B3C0140A6BEBA541CDC14ED06@thakurlaptop>

 
-----Original Message-----
From: Rolf Rabenseifner
Sent: Monday, March 22, 2010 11:05 AM
To: thakur at mcs.anl.gov
Subject: EuroMPI 2010 Call for Papers (extended deadline)


 Please excuse the cross-posting.

 Due to many requests, we have postponed the submission deadline by 2
weeks.

------------------------------------------------------------------------
--------

               CALL FOR PAPERS -- Extension of Deadlines

               17th European MPI Users' Group Meeting
                        (EuroMPI 2010)
                 http://www.eurompi2010.org

            Stuttgart, Germany, September 12th-15th 2010

            Extended submission deadline: April 19th 2010

------------------------------------------------------------------------
--------

MPI (Message Passing Interface) has evolved into the standard interfaces
for
high-performance parallel programming in the message-passing paradigm.
EuroMPI
is the most prominent meeting dedicated to the latest developments of
MPI, its
use, including support tools, and implementation, and to applications
using
these interfaces.

The 17th European MPI Users' Group Meeting will be a forum for users and
developers of MPI and other message-passing programming environments.
Through
the presentation of contributed papers, poster presentations and invited
talks,
attendees will have the opportunity to share ideas and experiences to
contribute
to the improvement and furthering of message-passing and related
parallel
programming paradigms.

Topics of interest for the meeting include, but are not limited to:
 - MPI implementation issues and improvements
 - Latest extensions to MPI
 - MPI for high-performance computing, clusters and grid environments
 - New message-passing and hybrid parallel programming paradigms
 - Interaction between message-passing software and hardware
 - Fault tolerance in message-passing programs
 - Performance evaluation of MPI applications
 - Tools and environments for MPI
 - Algorithms using the message-passing paradigm
 - Applications in science and engineering based on message-passing

Submissions on applications demonstrating both the potential and
shortcomings of
message passing programming and specifically MPI are particularly
welcome.

SUBMISSION INFORMATION
Contributors are invited to submit a full paper as a PDF document not
exceeding 8 pages in English (2 pages for poster abstracts). The title
page should contain an abstract of at most 100 words and five specific
keywords. The conference proceedings consisting of abstracts of invited
talks, full papers, and two page abstracts for the posters will be
published by Springer in the LNCS series. Papers need to be formatted
according to the Springer LNCS guidelines. The usage of LaTeX for
preparation of the contribution as well as the submission in camera
ready format is strongly recommended. Style files can be found at
http://www.springer.de/comp/lncs/authors.html . Papers are to be
submitted electronically via the online submission system at
http://www.easychair.org/conferences/?conf=eurompi2010
Submissions to the ParSim2010 session are handled and reviewed by the
respective session chairs. For more information please refer to the
ParSim website http://www.lrr.in.tum.de/~trinitic/parsim10/ .
All accepted submissions are expected to be presented at the conference
by one of the authors, which requires registration for the conference. 
 
IMPORTANT DATES
  EuroMPI Conference         September 12-15th, 2010
  Submission of full papers  April 19th, 2010  (extended deadline)
  Notification of authors    May 17th, 2010
  Camera ready papers        June 10th, 2010

As in the previous years, the special session 'ParSim' will focus on
numerical
simulation for parallel engineering environments. EuroMPI 2010 will also
hold
the 'Outstanding Papers' session, where the best papers selected by the
program
committee will be presented.

For further Information please see the conference website:
  http://www.eurompi2010.org


General Chair:
  Jack Dongarra, University of Tennessee, USA

Program Chair:
  Michael Resch, HLRS, University of Stuttgart, Germany

Program Co-Chairs:
  Rainer Keller, ORNL, USA and HLRS, Germany
  Edgar Gabriel, University of Houston, USA 

Program Committee:
  Richard Barrett, Oak Ridge National Laboratory, USA
  Gil Bloch, Mellanox, Israel
  George Bosilca, University of Tennessee, USA
  Ron Brightwell, Sandia National Laboratories, New Mexico
  Franck Cappello, University of Illinois, USA / INRIA, France
  Barbara Chapman, University of Houston, USA
  Yiannis Cotronis, University of Athens
  Erik D.'Hollander, Ghent University, Belgium
  Jean-Christophe Desplat, ICHEC, Ireland
  Frederic Desprez, INRIA, France
  Jack Dongarra, University of Tennessee, USA
  Edgar Gabriel, University of Houston, USA
  Javier Garcia-Blas, Universidad Carlos III de Madrid, Spain
  Al Geist, Oak Ridge National Laboratory, USA
  Michael Gerndt, Technical University Muenchen, Germany
  Ganesh Gopalakrishnan, University of Utah, USA
  Sergei Gorlatch, University of Muenster, Germany
  Andrzej Goscinski, Deakin University, Australia
  Richard L. Graham, Oak Ridge National Laboratory, USA
  William Gropp, University of Illinois Urbana-Champaign, USA
  Thomas Herault, INRIA/LRI, France
  Torsten Hoefler, Indiana University, USA
  Josh Hursey, Indiana University, USA
  Yutaka Ishikawa, University of Tokyo, Japan
  Tahar Kechadi, University College Dublin, Ireland
  Rainer Keller, Oak Ridge National Laboratory, USA
  Stefan Lankes, RWTH Aachen, Germany
  Jesper Larsson-Traeff, University of Vienna, Austria
  Alexey Lastovetsky, University College Dublin, Ireland
  Andrew Lumsdaine, Indiana University, USA
  Ewing Rusty Lusk, Argonne National Laboratory, USA
  Thomas Margalef, Universitat Autonoma de Barcelona, Spain
  Jean-Francois Mehaut, IMAG, France
  Bernd Mohr, Forschungszentrum Juelich, Germany
  Raymond Namyst, University of Bordeaux, France
  Rolf Rabenseifner, HLRS, University of Stuttgart, Germany
  Michael Resch, HLRS, University of Stuttgart, Germany
  Casiano Rodriguez-Leon, Universidad de la Laguna, Spain
  Robert Ross, Argonne National Laboratory, USA
  Martin Schulz, Lawrence Livermore National Laboratory, USA
  Stephen F. Siegel, University of Delaware, USA
  Jeffrey Squyres, Cisco, Inc., USA
  Bronis R. de Supinski, Lawrence Livermore National Laboratory, USA
  Rajeev Thakur, Argonne National Laboratory, USA
  Carsten Trinitis, Technische Universitaet Muenchen, Germany
  Jan Westerholm, Abo Akademi University, Finland
  Roland Wismueller, Universitaet Siegen, Germany
  Joachim Worringen, International Algorithmic Trading GmbH, Germany


From hugo.hernandez at nih.gov  Fri Mar 26 09:32:05 2010
From: hugo.hernandez at nih.gov (Hernandez, Hugo (NIH/NIAID) [C])
Date: Fri, 26 Mar 2010 12:32:05 -0400
Subject: [Beowulf] Problems installing HPL 2.0
Message-ID: <C7D259C5.6274%hugo.hernandez@nih.gov>

Hello there,
Can somebody help me on a problem I am experiencing when trying to install HPL 2.0 in our system?  The error comes as the HPL_dlamch.c isn?t working because it can?t find the hpl.h include file.   The file already exists in $(TOPdir)/include/hpl.h. Do I am missing something?

I have added the directory /myApps/hpl-2.0/include into my LD_LIBRARY_PATH without any result.

Could you please let me know how to work on this problem?  All help will be really appreciated!

Many thanks,
-Hugo

My system configuration:
RHEL 5.4
Linux myMachine.com 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
openmpi-1.3.2-2.el5
/usr/lib64/openmpi/1.3.2-gcc/bin/mpicc
/usr/lib64/openmpi/1.3.2-gcc/bin/mpif90
Linear Algebra library: GotoBLAS2

Here is my make.arch file:

SHELL = /bin/sh
#
CD = cd
CP = cp
LN_S = ln -s
MKDIR = mkdir
RM = /bin/rm -f
TOUCH = touch
#
ARCH = Linux_x86_64
# - HPL Directory Structure / HPL library ------------------------------
TOPdir = /niaidAdmin/apps/hpl-2.0
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
#
HPLlib = $(LIBdir)/libhpl.a
# - Message Passing library (MPI) --------------------------------------
MPdir = /usr
MPinc =-I$(MPdir)/include/openmpi
MPlib = $(MPdir)/lib64/openmpi/libmpi.so
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
LAdir = /niaidAdmin/apps/GotoBLAS2
LAinc =-I$(MPdir)/include
LAlib = $(LAdir)/libblas.so.3 $(LAdir)/atlas/libblas.so.3
#
F2CDEFS =
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
#
HPL_OPTS = -DHPL_CALL_CBLAS
# ----------------------------------------------------------------------
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
# - Compilers / linkers - Optimization flags ---------------------------
CC = /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc
CCNOOPT =
CCFLAGS = $(HPL_DEFS) -pipe -O3 -funroll-loops
#
LINKER = /usr/lib64/openmpi/1.3.2-gcc/bin/mpif90
LINKFLAGS = $(CCFLAGS)
#
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo

And here is the error message:


[root at test hpl-2.0]# make arch=niaid
make -f Make.top startup_dir     arch=niaid
make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0'
mkdir include/niaid
mkdir: cannot create directory `include/niaid': File exists
make[1]: [startup_dir] Error 1 (ignored)
mkdir lib
mkdir: cannot create directory `lib': File exists
make[1]: [startup_dir] Error 1 (ignored)
mkdir lib/niaid
mkdir: cannot create directory `lib/niaid': File exists
make[1]: [startup_dir] Error 1 (ignored)
mkdir bin
mkdir: cannot create directory `bin': File exists
make[1]: [startup_dir] Error 1 (ignored)
mkdir bin/niaid
mkdir: cannot create directory `bin/niaid': File exists
make[1]: [startup_dir] Error 1 (ignored)
make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top startup_src     arch=niaid
make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=src/auxil       arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd src/auxil ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd src/auxil/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=src/blas        arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd src/blas ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd src/blas/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=src/comm        arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd src/comm ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd src/comm/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=src/grid        arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd src/grid ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd src/grid/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=src/panel       arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd src/panel ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd src/panel/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=src/pauxil      arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd src/pauxil ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd src/pauxil/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=src/pfact       arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd src/pfact ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd src/pfact/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=src/pgesv       arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd src/pgesv ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd src/pgesv/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top startup_tst     arch=niaid
make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=testing/matgen  arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd testing/matgen ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd testing/matgen/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=testing/timer   arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd testing/timer ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd testing/timer/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=testing/pmatgen arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd testing/pmatgen ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd testing/pmatgen/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=testing/ptimer  arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd testing/ptimer ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd testing/ptimer/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top leaf le=testing/ptest   arch=niaid
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd testing/ptest ; mkdir niaid )
mkdir: cannot create directory `niaid': File exists
make[2]: [leaf] Error 1 (ignored)
( cd testing/ptest/niaid ; \
            ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc )
ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists
make[2]: [leaf] Error 1 (ignored)
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top refresh_src     arch=niaid
make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0'
cp makes/Make.auxil    src/auxil/niaid/Makefile
cp makes/Make.blas     src/blas/niaid/Makefile
cp makes/Make.comm     src/comm/niaid/Makefile
cp makes/Make.grid     src/grid/niaid/Makefile
cp makes/Make.panel    src/panel/niaid/Makefile
cp makes/Make.pauxil   src/pauxil/niaid/Makefile
cp makes/Make.pfact    src/pfact/niaid/Makefile
cp makes/Make.pgesv    src/pgesv/niaid/Makefile
make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top refresh_tst     arch=niaid
make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0'
cp makes/Make.matgen   testing/matgen/niaid/Makefile
cp makes/Make.timer    testing/timer/niaid/Makefile
cp makes/Make.pmatgen  testing/pmatgen/niaid/Makefile
cp makes/Make.ptimer   testing/ptimer/niaid/Makefile
cp makes/Make.ptest    testing/ptest/niaid/Makefile
make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top refresh_src     arch=niaid
make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0'
cp makes/Make.auxil    src/auxil/niaid/Makefile
cp makes/Make.blas     src/blas/niaid/Makefile
cp makes/Make.comm     src/comm/niaid/Makefile
cp makes/Make.grid     src/grid/niaid/Makefile
cp makes/Make.panel    src/panel/niaid/Makefile
cp makes/Make.pauxil   src/pauxil/niaid/Makefile
cp makes/Make.pfact    src/pfact/niaid/Makefile
cp makes/Make.pgesv    src/pgesv/niaid/Makefile
make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top refresh_tst     arch=niaid
make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0'
cp makes/Make.matgen   testing/matgen/niaid/Makefile
cp makes/Make.timer    testing/timer/niaid/Makefile
cp makes/Make.pmatgen  testing/pmatgen/niaid/Makefile
cp makes/Make.ptimer   testing/ptimer/niaid/Makefile
cp makes/Make.ptest    testing/ptest/niaid/Makefile
make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make -f Make.top build_src       arch=niaid
make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0'
( cd src/auxil/niaid;         make )
make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0/src/auxil/niaid'
/usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlacpy.o -c  -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops  ../HPL_dlacpy.c
/usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlatcpy.o -c  -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops  ../HPL_dlatcpy.c
/usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_fprintf.o -c  -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops  ../HPL_fprintf.c
/usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_warn.o -c  -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops  ../HPL_warn.c
/usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_abort.o -c  -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops  ../HPL_abort.c
/usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlaprnt.o -c  -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops  ../HPL_dlaprnt.c
/usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlange.o -c  -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops  ../HPL_dlange.c
/usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlamch.o -c   ../HPL_dlamch.c
../HPL_dlamch.c:50:17: error: hpl.h: No such file or directory
../HPL_dlamch.c:57: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS?
../HPL_dlamch.c:60: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS?
../HPL_dlamch.c:64: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS?
../HPL_dlamch.c:67: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS?
../HPL_dlamch.c:70: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS?
../HPL_dlamch.c:74: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS?
../HPL_dlamch.c: In function ?HPL_dlamch?:
../HPL_dlamch.c:85: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?CMACH?
../HPL_dlamch.c:164: error: ?HPL_rone? undeclared (first use in this function)
../HPL_dlamch.c:164: error: (Each undeclared identifier is reported only once
../HPL_dlamch.c:164: error: for each function it appears in.)
../HPL_dlamch.c:164: error: ?HPL_rtwo? undeclared (first use in this function)
../HPL_dlamch.c:166: error: ?HPL_rzero? undeclared (first use in this function)
../HPL_dlamch.c:176: error: ?HPL_MACH_EPS? undeclared (first use in this function)
../HPL_dlamch.c:177: error: ?HPL_MACH_SFMIN? undeclared (first use in this function)
../HPL_dlamch.c:178: error: ?HPL_MACH_BASE? undeclared (first use in this function)
../HPL_dlamch.c:179: error: ?HPL_MACH_PREC? undeclared (first use in this function)
../HPL_dlamch.c:180: error: ?HPL_MACH_MLEN? undeclared (first use in this function)
../HPL_dlamch.c:181: error: ?HPL_MACH_RND? undeclared (first use in this function)
../HPL_dlamch.c:182: error: ?HPL_MACH_EMIN? undeclared (first use in this function)
../HPL_dlamch.c:183: error: ?HPL_MACH_RMIN? undeclared (first use in this function)
../HPL_dlamch.c:184: error: ?HPL_MACH_EMAX? undeclared (first use in this function)
../HPL_dlamch.c:185: error: ?HPL_MACH_RMAX? undeclared (first use in this function)
../HPL_dlamch.c: In function ?HPL_dlamc1?:
../HPL_dlamch.c:262: error: ?HPL_rone? undeclared (first use in this function)
../HPL_dlamch.c:274: error: ?HPL_rtwo? undeclared (first use in this function)
../HPL_dlamch.c: At top level:
../HPL_dlamch.c:345: warning: conflicting types for ?HPL_dlamc2?
../HPL_dlamch.c:345: error: static declaration of ?HPL_dlamc2? follows non-static declaration
../HPL_dlamch.c:161: error: previous implicit declaration of ?HPL_dlamc2? was here
../HPL_dlamch.c: In function ?HPL_dlamc2?:
../HPL_dlamch.c:416: error: ?HPL_rzero? undeclared (first use in this function)
../HPL_dlamch.c:416: error: ?HPL_rone? undeclared (first use in this function)
../HPL_dlamch.c:416: error: ?HPL_rtwo? undeclared (first use in this function)
../HPL_dlamch.c:545: error: ?stderr? undeclared (first use in this function)
../HPL_dlamch.c: At top level:
../HPL_dlamch.c:583: error: conflicting types for ?HPL_dlamc3?
../HPL_dlamch.c:274: error: previous implicit declaration of ?HPL_dlamc3? was here
../HPL_dlamch.c:626: warning: conflicting types for ?HPL_dlamc4?
../HPL_dlamch.c:626: error: static declaration of ?HPL_dlamc4? follows non-static declaration
../HPL_dlamch.c:464: error: previous implicit declaration of ?HPL_dlamc4? was here
../HPL_dlamch.c: In function ?HPL_dlamc4?:
../HPL_dlamch.c:667: error: ?HPL_rone? undeclared (first use in this function)
../HPL_dlamch.c:668: error: ?HPL_rzero? undeclared (first use in this function)
../HPL_dlamch.c: At top level:
../HPL_dlamch.c:698: warning: conflicting types for ?HPL_dlamc5?
../HPL_dlamch.c:698: error: static declaration of ?HPL_dlamc5? follows non-static declaration
../HPL_dlamch.c:570: error: previous implicit declaration of ?HPL_dlamc5? was here
../HPL_dlamch.c: In function ?HPL_dlamc5?:
../HPL_dlamch.c:748: error: ?HPL_rzero? undeclared (first use in this function)
../HPL_dlamch.c:812: error: ?HPL_rone? undeclared (first use in this function)
../HPL_dlamch.c: At top level:
../HPL_dlamch.c:842: error: conflicting types for ?HPL_dipow?
../HPL_dlamch.c:164: error: previous implicit declaration of ?HPL_dipow? was here
../HPL_dlamch.c: In function ?HPL_dipow?:
../HPL_dlamch.c:866: error: ?HPL_rone? undeclared (first use in this function)
../HPL_dlamch.c:871: error: ?HPL_rzero? undeclared (first use in this function)
make[2]: *** [HPL_dlamch.o] Error 1
make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0/src/auxil/niaid'
make[1]: *** [build_src] Error 2
make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0'
make: *** [build] Error 2


--
Hugo R. Hernandez, Contractor
Dell Perot Systems
Sr. Systems Administrator
Mac & Linux Server Team, OCICB/OEB
National Institutes of Health
National Institute of Allergy & Infectious Diseases
10401 Fernwood Drive
Fernwood West - Rm. 2009
Bethesda, MD 20817

Phone: 301-841-4203
Cell: 240-479-1888
Fax: 301-480-0784
www.dell.com/perotsystems

--
"Si seus esfor?os, foram vistos com indefren?a, n?o desanime, que o sol faze un espectacolo maravilhoso todas as manh?s cuando a maior parte das pessoas, ainda estam durmindo"

- An?nimo brasileiro

Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives.


From hugo.hernandez at nih.gov  Fri Mar 26 10:31:08 2010
From: hugo.hernandez at nih.gov (Hernandez, Hugo (NIH/NIAID) [C])
Date: Fri, 26 Mar 2010 13:31:08 -0400
Subject: [Beowulf] Problems installing HPL 2.0
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0FDB66F2@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <C7D2679C.627E%hugo.hernandez@nih.gov>

Hello John,
Thanks for your answer.   I have set correctly my TOPdir and INCdir.   I did
change /myApps for /niaidAdmin/apps but I continue having the same problem.
I will try to use your suggestion on last line (...leave it as '.').
-Hugo

# - HPL Directory Structure / HPL library ------------------------------
TOPdir = /niaidAdmin/apps/hpl-2.0
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
#
HPLlib = $(LIBdir)/libhpl.a
# - Message Passing library (MPI) --------------------------------------
MPdir = /usr
MPinc =-I$(MPdir)/include/openmpi
MPlib = $(MPdir)/lib64/openmpi/libmpi.so
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
LAdir = /niaidAdmin/apps/GotoBLAS2
LAinc =-I$(MPdir)/include
LAlib = $(LAdir)/libblas.so.3 $(LAdir)/atlas/libblas.so.3
#
F2CDEFS =
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)


On 3/26/10 1:21 PM, "Hearns, John" <john.hearns at mclaren.com> wrote:

> Ahem.
>  In your makefile  TOPdir is set to TOPdir = /niaidAdmin/apps/hpl-2.0
> 
> then INCdir is $(TOPdir)/include
> 
> You need to set TOPdir explicitly as /myApps/hpl-2.0
> Or even better I would leave it as '.'
> 
> 
> The contents of this email are confidential and for the exclusive use of the
> intended recipient.  If you receive this email in error you should not copy
> it, retransmit it, use it or disclose its contents but should return it to the
> sender immediately and delete your copy.

--
Hugo R. Hernandez, Contractor
Dell Perot Systems
Sr. Systems Administrator
Mac & Linux Server Team, OCICB/OEB
National Institutes of Health
National Institute of Allergy & Infectious Diseases
10401 Fernwood Drive
Fernwood West - Rm. 2009
Bethesda, MD 20817

Phone: 301-841-4203
Cell: 240-479-1888
Fax: 301-480-0784
www.dell.com/perotsystems
 
--
"Si seus esfor?os, foram vistos com indefren?a, n?o desanime, que o sol faze
un espectacolo maravilhoso todas as manh?s cuando a maior parte das pessoas,
ainda estam durmindo"

- An?nimo brasileiro

Disclaimer: The information in this e-mail and any of its attachments is
confidential and may contain sensitive information. It should not be used by
anyone who is not the original intended recipient. If you have received this
e-mail in error please inform the sender and delete it from your mailbox or
any other storage devices. National Institute of Allergy and Infectious
Diseases shall not accept liability for any statements made that are
sender's own and not expressly made on behalf of the NIAID by one of its
representatives.


From mathog at caltech.edu  Fri Mar 26 15:48:50 2010
From: mathog at caltech.edu (David Mathog)
Date: Fri, 26 Mar 2010 15:48:50 -0700
Subject: [Beowulf] mysterious slow SATA on one machine
Message-ID: <E1NvIKg-0003hb-7p@mendel.bio.caltech.edu>

I'm hoping somebody has seen this before and can suggest what might be
going on.  

One machine (Arima HDAMA-I board, dual Opteron 280, 4GB RAM,
Sil 3114 Sata controller, Sil 5.4.03 firmware) has mysteriously slow
SATA IO.  This is the case for two different disks (WD10EARS and
ST340014AS), two different disk schedulers, and two different OS's
(Mandriva 2010.0 and PLD 2.97 rescue linux.)  Using a different brand of
cable, and plugging into a different SATA port didn't help either. 
However, move those disks to another machine (Asus A8N5X, Nvidia CK804
SATA controller, single core, 1 GB RAM, Knoppix) and they are both much
faster.  Raw results from various experiments here:

  http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf
  http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf

For the sustained write test both disks on the slow system take about
102s to write 4GB to disk, or around 41.3GB/s.  That isn't horrible
horrible, but it isn't great either.  On the faster machine the WD10EARS
does the job in 39 seconds, and even the old Seagate is done in 74s.  It
strikes me that something must be rate limiting both disks to about the
same throughput.  The Sil 3114 chip is somehow interfaced through the
PCI bus, but even if that is only 33MHz it is still 4 bytes wide and
should be able to handle around 132 MB/s, 3X what I'm seeing.  All of
the PCI and PCI-X slots are unoccupied. I have no previous experience
with the Sil 3114 or the Arima board, so don't know if this is typical
for either.

Perhaps the oddest part of this is that during these tests the disk
light on the slow system blinks but is often off for long periods. 
Conversely, on the faster system the disk light stays on pretty
steadily.  As if on the slower system it is doing something else when it
should be doing disk IO.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From coutinho at dcc.ufmg.br  Fri Mar 26 17:30:08 2010
From: coutinho at dcc.ufmg.br (Bruno Coutinho)
Date: Fri, 26 Mar 2010 21:30:08 -0300
Subject: [Beowulf] mysterious slow SATA on one machine
In-Reply-To: <E1NvIKg-0003hb-7p@mendel.bio.caltech.edu>
References: <E1NvIKg-0003hb-7p@mendel.bio.caltech.edu>
Message-ID: <a8d96dec1003261730v7ad3e322gc6e6f60eefd40e84@mail.gmail.com>

AFAIK all your disks and the nvidia CK804 supprt NCQ but the sil 3114 don't.
This could explain the lower drive throughput undel sil 3114 controller.


2010/3/26 David Mathog <mathog at caltech.edu>

> I'm hoping somebody has seen this before and can suggest what might be
> going on.
>
> One machine (Arima HDAMA-I board, dual Opteron 280, 4GB RAM,
> Sil 3114 Sata controller, Sil 5.4.03 firmware) has mysteriously slow
> SATA IO.  This is the case for two different disks (WD10EARS and
> ST340014AS), two different disk schedulers, and two different OS's
> (Mandriva 2010.0 and PLD 2.97 rescue linux.)  Using a different brand of
> cable, and plugging into a different SATA port didn't help either.
> However, move those disks to another machine (Asus A8N5X, Nvidia CK804
> SATA controller, single core, 1 GB RAM, Knoppix) and they are both much
> faster.  Raw results from various experiments here:
>
>  http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf
>  http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf
>
> For the sustained write test both disks on the slow system take about
> 102s to write 4GB to disk, or around 41.3GB/s.  That isn't horrible
> horrible, but it isn't great either.  On the faster machine the WD10EARS
> does the job in 39 seconds, and even the old Seagate is done in 74s.  It
> strikes me that something must be rate limiting both disks to about the
> same throughput.  The Sil 3114 chip is somehow interfaced through the
> PCI bus, but even if that is only 33MHz it is still 4 bytes wide and
> should be able to handle around 132 MB/s, 3X what I'm seeing.  All of
> the PCI and PCI-X slots are unoccupied. I have no previous experience
> with the Sil 3114 or the Arima board, so don't know if this is typical
> for either.
>
> Perhaps the oddest part of this is that during these tests the disk
> light on the slow system blinks but is often off for long periods.
> Conversely, on the faster system the disk light stays on pretty
> steadily.  As if on the slower system it is doing something else when it
> should be doing disk IO.
>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100326/c0e3cbff/attachment.html>

From gdjacobs at gmail.com  Fri Mar 26 18:18:25 2010
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Fri, 26 Mar 2010 20:18:25 -0500
Subject: [Beowulf] mysterious slow SATA on one machine
In-Reply-To: <E1NvIKg-0003hb-7p@mendel.bio.caltech.edu>
References: <E1NvIKg-0003hb-7p@mendel.bio.caltech.edu>
Message-ID: <4BAD5CE1.9030802@gmail.com>

David Mathog wrote:
> I'm hoping somebody has seen this before and can suggest what might be
> going on.  
> 
> One machine (Arima HDAMA-I board, dual Opteron 280, 4GB RAM,
> Sil 3114 Sata controller, Sil 5.4.03 firmware) has mysteriously slow
> SATA IO.  This is the case for two different disks (WD10EARS and
> ST340014AS), two different disk schedulers, and two different OS's
> (Mandriva 2010.0 and PLD 2.97 rescue linux.)  Using a different brand of
> cable, and plugging into a different SATA port didn't help either. 
> However, move those disks to another machine (Asus A8N5X, Nvidia CK804
> SATA controller, single core, 1 GB RAM, Knoppix) and they are both much
> faster.  Raw results from various experiments here:
> 
>   http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf
>   http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf
> 
> For the sustained write test both disks on the slow system take about
> 102s to write 4GB to disk, or around 41.3GB/s.  That isn't horrible
> horrible, but it isn't great either.  On the faster machine the WD10EARS
> does the job in 39 seconds, and even the old Seagate is done in 74s.  It
> strikes me that something must be rate limiting both disks to about the
> same throughput.  The Sil 3114 chip is somehow interfaced through the
> PCI bus, but even if that is only 33MHz it is still 4 bytes wide and
> should be able to handle around 132 MB/s, 3X what I'm seeing.  All of
> the PCI and PCI-X slots are unoccupied. I have no previous experience
> with the Sil 3114 or the Arima board, so don't know if this is typical
> for either.
> 
> Perhaps the oddest part of this is that during these tests the disk
> light on the slow system blinks but is often off for long periods. 
> Conversely, on the faster system the disk light stays on pretty
> steadily.  As if on the slower system it is doing something else when it
> should be doing disk IO.

As mentioned to David in a separate post, I see similar (worse)
performance deltas using an S-I controller. I see the same delta using
sata_sil driving an ATI SB4xx south bridge.

It might be kernel related, as escalated here:
https://bugzilla.redhat.com/show_bug.cgi?id=502499

-- 
Geoffrey D. Jacobs


From hahn at mcmaster.ca  Fri Mar 26 19:40:06 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 26 Mar 2010 22:40:06 -0400 (EDT)
Subject: [Beowulf] Problems installing HPL 2.0
In-Reply-To: <C7D259C5.6274%hugo.hernandez@nih.gov>
References: <C7D259C5.6274%hugo.hernandez@nih.gov>
Message-ID: <Pine.LNX.4.64.1003262233230.20825@coffee.psychology.mcmaster.ca>

> Can somebody help me on a problem I am experiencing when trying to install
>HPL 2.0 in our system?  The error comes as the HPL_dlamch.c isn?t working
>because it can?t find the hpl.h include file.   The file already exists in
>$(TOPdir)/include/hpl.h. Do I am missing something?

the makefile isn't passing the -I when compiling that file:

> /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlange.o -c  -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops  ../HPL_dlange.c
> /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlamch.o -c   ../HPL_dlamch.c
> ../HPL_dlamch.c:50:17: error: hpl.h: No such file or directory

everything between the -DHPL_CALL... and -funroll-loops is missing.
you need to look at the makefile in src/auxil/niaid...

> I have added the directory /myApps/hpl-2.0/include into my LD_LIBRARY_PATH
> without any result.

thank goodness!  ;)  that would make no sense, since LD_LIBRARY_PATH is 
strictly a runtime/library thing, nothing to do with compile-time/headers.

> Disclaimer: The information in this e-mail and any of its attachments is
>confidential and may contain sensitive information. It should not be used by
>anyone who is not the original intended recipient. If you have received this
>e-mail in error please inform the sender and delete it from your mailbox or
>any other storage devices. National Institute of Allergy and Infectious
>Diseases shall not accept liability for any statements made that are
>sender's own and not expressly made on behalf of the NIAID by one of its
>representatives.

I assume you know that this boilerplate is completely meaningless...

regards, mark hahn.

From mathog at caltech.edu  Mon Mar 29 10:55:29 2010
From: mathog at caltech.edu (David Mathog)
Date: Mon, 29 Mar 2010 10:55:29 -0700
Subject: [Beowulf] mysterious slow SATA on one machine
Message-ID: <E1NwJBR-0004X6-NA@mendel.bio.caltech.edu>

> David Mathog wrote:
>Raw results from various experiments here:
> 
>   http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf
>   http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf
> 

Some progress, see updated files above.

lspci showed there were two devices on the bus where the Sil 3114 was
located, it and the ATI Rage VGA controller.  It also showed the pci
latency for the former was 32 and the latter 66.  

The VGA controller is currently running in text mode, without the atyfb
module loaded, and with nothing happening on it (it just has the text
login prompt dislayed), and changing its latency up or down makes no
difference to disk speed (not shown in the files cited above).  The
machine was started once with the VGA jumpered off, but it didn't boot,
so it wasn't possible to completely remove it from the equation.  

However, increasing the pci_latency on the Sil 3114 as far as it would
go (to 144 as shown in lspci)  with 

 setpci -s '01:07.0' latency_timer=99

made a considerable difference.  For the older disk it brought it up to
approximately the same speed as on the nvidia ck804 controller.  For the
WD10EARS it sped things up about 50%, but didn't manage to match the
nvidia controller, possibly because of the absence of the ncq mentioned
previously in this thread.  Also on the 3114 it topped out at around
106MB/sec for the fastest bonnie++ applications, and that is pretty
close to the 132MB/s limit on the 32 bit PCI bus, whereas on the ck804
peaks were 140MB/s, which would be more than the bus can carry.  
Assuming the PCI on the Arima board is really 33 MHz, like the manual
says, and not 66 MHz, as lspci reports.

Vibration may still be playing some role, but apparently that wasn't the
primary problem.

I may still put in a PCI-X SATA controller though, as there is still
another 40% performance to go on the WD drive, and that will provide
enough bus bandwidth to support that.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From cousins at umit.maine.edu  Mon Mar 29 12:51:42 2010
From: cousins at umit.maine.edu (Steve Cousins)
Date: Mon, 29 Mar 2010 15:51:42 -0400 (EDT)
Subject: [Beowulf] HP 10 GbE card use/warranty
Message-ID: <alpine.LFD.2.00.1003291431280.14085@razzo.umeoce.maine.edu>


I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have 
been trying to get them to work in non-HP Linux systems. After failing to 
be able to comile/install the nx_nic drivers on Fedora 9 and 12 (it checks 
to see if kernel >= 2.6.27 and if so looks for net/8021q/vlan.h but can't 
find it) I installed one of the supported distributions (or close enough): 
CentOS 5. Driver installation went fine and I was able to get one of the 
cards to work. The other one seems to be bad. The nx_nic driver loads but 
no eth2 device shows up. Also, the Activity LED is constant but no Link 
LED lights up. Since I can't get the eth device loaded I can't update the 
firmware.

So, I have one good one and one bad one. Pretty clear-cut. I've tried to 
get technical support from HP to RMA it and I end up in no-mans land. It 
seems that they do not have support for peripherals like this. They 
consider this part of a system and the only systems they seem to think 
these go in are their Proliant servers so I end up at Proliant tech 
support. But I have no serial number for a Proliant server so it is a dead 
end.

I tried "Customer Satisfaction" and got none.

They insist that these cards are *only* for HP Proliant servers but I have 
not seen any indication of this at the sites I go to to buy this type of 
card. I bought this from a reseller when we bought a bunch of Procurve 
equipment with some 10 GbE modules for a switch. The reseller is seeing 
what he can do but in the mean time I thought I'd check here to see if 
anyone has run into this sort of thing from HP. I've always had good luck 
with HP but it has mainly been from the Procurve section and they won't 
touch this either.

Anyone running HP cards in non-HP equipment? Any tips on getting an RMA 
for this? I'm thinking of trying to track down an HP server on campus to 
work with just to get the RMA but I doubt if there are too many of these 
that aren't in use and still under warranty.

Thanks,

Steve
______________________________________________________________________
  Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
  Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
  Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
  (207) 581-4302     ~ cousins at umit.maine.edu ~      (207) 866-6552


From chekh at pcbi.upenn.edu  Mon Mar 29 13:28:53 2010
From: chekh at pcbi.upenn.edu (Alex Chekholko)
Date: Mon, 29 Mar 2010 16:28:53 -0400
Subject: [Beowulf] HP 10 GbE card use/warranty
In-Reply-To: <alpine.LFD.2.00.1003291431280.14085@razzo.umeoce.maine.edu>
References: <alpine.LFD.2.00.1003291431280.14085@razzo.umeoce.maine.edu>
Message-ID: <20100329162853.aa18be0b.chekh@pcbi.upenn.edu>

Hi Steve,

I can report that I had the same problem.  I have a bunch of NC510c
cards in IBM X3650 servers.  One broke, with the symptom that it would
just hang any machine I put it into.  Luckily (or with great
foresight), I had previously ordered a spare to have on hand.  At one
point, the HP tech on the line was able to find the serial number in
the HP system somewhere but only confirm that my card was out of
warranty (it was manufactured long before we purchased it).  In the
end, our VAR was able to convince their distributor to mail me a new
working card, free.  Took a couple of weeks.

Here, I can even copy in our e-mail chain, somewhat sanitized:


*****

From: helpful VAR Jeff
To: Alex Chekholko <chekh at pcbi.upenn.edu>
Subject: RE: FW: bad network card
Date: Thu, 23 Oct 2008 08:46:25 -0400

Good morning Alex,

We finally got the distributor to ship you a new card - it should be
shipping today.  Sorry for all the trouble.

________________________________________
From: Alex Chekholko [chekh at pcbi.upenn.edu]
Sent: Monday, October 20, 2008 12:15 PM
To: helpful VAR Jeff
Cc: manager at genomics.upenn.edu
Subject: Re: FW: bad network card

Hi Jeff,

Tech support line that I used is 1-800-474-6836

The last time I spoke to HP on Oct 8th, I ended up in the queue
Tech Support -> Networking -> ProCurve

This time I spoke to HP, I ended up in the queue
Tech Support -> Servers -> Proliant

They couldn't find any record of that card anywhere in their system and
said I have to get a replacement through the reseller the card was
purchased from and recommended I call 1-888-943-8476 (turned out to be
a "customer satisfaction" line).

I called back again and went to
Tech Support -> Networking -> ProCurve

They initially couldn't find it in the system, until I recollected that
they called it a "server adapter" last time; that allowed them to find
it, and then they dumped me back to the initial menu.  Apparently, the
serial number doesn't help, and the part number doesn't show up in the
system (for the ProCurve folks).

Then I called the "customer satisfaction" line, who suggested the HP
Parts Store (1-800-227-8164 option 2) and transferred me there.  They
said that Tech Support is the only place that could authorize warranty
replacement, and were unable to look up the order numbers listed below,
but connected me to "HW Support Orders" department (1-800-525-7104) who
then forwarded me to the front menu of Tech Support.

Total time, 2hrs.

Good luck.

On the UPenn side, the PO number is 2010394.

Regards,
Alex

On Mon, 13 Oct 2008 17:33:53 -0400
helpful VAR wrote:

> Alex,
>
> Per note below can you try HP support one more time?  Sorry for the trouble! Thanks.
>
> -----Original Message-----
> From: distributor "arrow" On Behalf Of ISS Team
> Sent: Monday, October 13, 2008 5:18 PM
> To: helpful VAR Jeff
> Cc: other arrow folks
> Subject: RE: bad network card
>
> Hi Jeff.  We came up with, we hope, information that will prove to HP
> that the card is under warranty.  It was ordered through Synnex and the
> order numbers are 24927045 and 26715052.  They shipped from Synnex on
> March 20, 2008.
>
> Part# 414129-B21 (they ordered a qty of 3)
> HP NC510C PCIE 10 GIGABIT SERVER ADAPTER
> Reseller Info:
> helpful VAR...
>
> End-User Info:
>
> our address
> PHILADELPHIA, PA 19104
>
> Please have the end user call HP and give them this information.  Please
> get back with me if HP still won't replace it.
>
> Thanks.
>
>

*****
 
On Mon, 29 Mar 2010 15:51:42 -0400 (EDT)
Steve Cousins <cousins at umit.maine.edu> wrote:

> 
> I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have 
> been trying to get them to work in non-HP Linux systems. After failing to 
> be able to comile/install the nx_nic drivers on Fedora 9 and 12 (it checks 
> to see if kernel >= 2.6.27 and if so looks for net/8021q/vlan.h but can't 
> find it) I installed one of the supported distributions (or close enough): 
> CentOS 5. Driver installation went fine and I was able to get one of the 
> cards to work. The other one seems to be bad. The nx_nic driver loads but 
> no eth2 device shows up. Also, the Activity LED is constant but no Link 
> LED lights up. Since I can't get the eth device loaded I can't update the 
> firmware.
> 
> So, I have one good one and one bad one. Pretty clear-cut. I've tried to 
> get technical support from HP to RMA it and I end up in no-mans land. It 
> seems that they do not have support for peripherals like this. They 
> consider this part of a system and the only systems they seem to think 
> these go in are their Proliant servers so I end up at Proliant tech 
> support. But I have no serial number for a Proliant server so it is a dead 
> end.
> 
> I tried "Customer Satisfaction" and got none.
>
> They insist that these cards are *only* for HP Proliant servers but I have 
> not seen any indication of this at the sites I go to to buy this type of 
> card. I bought this from a reseller when we bought a bunch of Procurve 
> equipment with some 10 GbE modules for a switch. The reseller is seeing 
> what he can do but in the mean time I thought I'd check here to see if 
> anyone has run into this sort of thing from HP. I've always had good luck 
> with HP but it has mainly been from the Procurve section and they won't 
> touch this either.
> 
> Anyone running HP cards in non-HP equipment? Any tips on getting an RMA 
> for this? I'm thinking of trying to track down an HP server on campus to 
> work with just to get the RMA but I doubt if there are too many of these 
> that aren't in use and still under warranty.
> 
> Thanks,
> 
> Steve
> ______________________________________________________________________
>   Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>   Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
>   Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
>   (207) 581-4302     ~ cousins at umit.maine.edu ~      (207) 866-6552
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


-- 
Alex Chekholko   chekh at pcbi.upenn.edu   


From mathog at caltech.edu  Mon Mar 29 16:33:18 2010
From: mathog at caltech.edu (David Mathog)
Date: Mon, 29 Mar 2010 16:33:18 -0700
Subject: [Beowulf] mysterious slow SATA on one machine
Message-ID: <E1NwOSM-0004gs-Fk@mendel.bio.caltech.edu>

> > David Mathog wrote:
> >Raw results from various experiments here:
> > 
> >   http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf
> >   http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf
> > 
> 
> Some progress, see updated files above.

And a step back...

With the latency set to 22 on the VGA, and 144 on the 
Sil 3114, three consecutive boots varying (and this is probably a red
herring) only the type of partition 5 (swap partition, which is the
first logical partition in the one extended partition, following the
first partition, which is both real and the boot partition)

Boot    Type    bonnie++ (-Per Chr- --Block-- -Rewrite- 
                   -Per Chr- --Block-- --Seeks-- line)
A       83      48104  91 65832  32 
                33587  13 48852  89 106061  16 219.9   1
B       82      47572  89 47176  18
                19601   6 35147  79 59726   8 188.2   1
C       83      48662  92 65424  26
                32901  12 48648  89 105560  15 214.9   1

Conversely, the sustained write test was about the same for all 3 boots,
although slightly (5%) faster for A,C than B.   Run the bonnie++ test
over and over during each uptime and the results come out more or less
the same.  Reboot, and they changed. Could the partition type really
matter?  Change it back, reboot, giving D:

D        82     48188  90 63984  27
                33609  13 47391  89 106234  16 217.9   1

So no, the partition type isn't the story, or at least not the whole
story. Something else must be going on. 

If feels like there is a bit somewhere that is blowing in the wind...

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From ebiederm at xmission.com  Tue Mar 30 00:00:51 2010
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Tue, 30 Mar 2010 00:00:51 -0700
Subject: [Beowulf] HP 10 GbE card use/warranty
In-Reply-To: <alpine.LFD.2.00.1003291431280.14085@razzo.umeoce.maine.edu>
	(Steve Cousins's message of "Mon\,
	29 Mar 2010 15\:51\:42 -0400 \(EDT\)")
References: <alpine.LFD.2.00.1003291431280.14085@razzo.umeoce.maine.edu>
Message-ID: <m1aatqthss.fsf@fess.ebiederm.org>

Steve Cousins <cousins at umit.maine.edu> writes:

> I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have been
> trying to get them to work in non-HP Linux systems. After failing to be able to
> comile/install the nx_nic drivers on Fedora 9 and 12 (it checks to see if kernel
>>= 2.6.27 and if so looks for net/8021q/vlan.h but can't find it)

I suspect that header was simply not packaged in the appropriate rpm.
You might be able to get away with commenting out that include.

> I installed
> one of the supported distributions (or close enough): CentOS 5. Driver
> installation went fine and I was able to get one of the cards to work. The other
> one seems to be bad. The nx_nic driver loads but no eth2 device shows up. Also,
> the Activity LED is constant but no Link LED lights up. Since I can't get the
> eth device loaded I can't update the firmware.

Does fedora not build the in kernel driver?  They should.

Except for occasionally having to flash the firmware up to the latest
image I have had good luck with the netxen nics.  I don't know anything
about the your weird purchase/support situation.

Eric


From cousins at umit.maine.edu  Tue Mar 30 06:24:48 2010
From: cousins at umit.maine.edu (Steve Cousins)
Date: Tue, 30 Mar 2010 09:24:48 -0400 (EDT)
Subject: [Beowulf] HP 10 GbE card use/warranty
In-Reply-To: <fc.004c4d194207fd5b3b9aca001e9c6136.4207fd5c@umit.maine.edu>
References: <alpine.LFD.2.00.1003291431280.14085@razzo.umeoce.maine.edu>
	<fc.004c4d194207fd5b3b9aca001e9c6136.4207fd5c@umit.maine.edu>
Message-ID: <alpine.LFD.2.00.1003300919490.20058@razzo.umeoce.maine.edu>

On Tue, 30 Mar 2010, Eric W. Biederman wrote:

> Steve Cousins <cousins at umit.maine.edu> writes:
>
>>> I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have been
>>> trying to get them to work in non-HP Linux systems. After failing to be able to
>>> comile/install the nx_nic drivers on Fedora 9 and 12 (it checks to see if kernel
>>>> = 2.6.27 and if so looks for net/8021q/vlan.h but can't find it)
>>
> I suspect that header was simply not packaged in the appropriate rpm.
> You might be able to get away with commenting out that include.

That was the first thing I tried. It lead to a path that got wider as I 
went.

>>> I installed
>>> one of the supported distributions (or close enough): CentOS 5. Driver
>>> installation went fine and I was able to get one of the cards to work. The other
>>> one seems to be bad. The nx_nic driver loads but no eth2 device shows up. Also,
>>> the Activity LED is constant but no Link LED lights up. Since I can't get the
>>> eth device loaded I can't update the firmware.
>>
> Does fedora not build the in kernel driver?  They should.


Yes it does. I got basic functionality with it but under large transfers 
it locks up. No lockups with the nx_nic driver.


> Except for occasionally having to flash the firmware up to the latest
> image I have had good luck with the netxen nics.  I don't know anything
> about the your weird purchase/support situation.


> Eric
>

______________________________________________________________________
  Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
  Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
  Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
  (207) 581-4302     ~ cousins at umit.maine.edu ~      (207) 866-6552


From cousins at umit.maine.edu  Tue Mar 30 11:47:59 2010
From: cousins at umit.maine.edu (Steve Cousins)
Date: Tue, 30 Mar 2010 14:47:59 -0400
Subject: [Beowulf] HP 10 GbE card use/warranty
In-Reply-To: <20100329162853.aa18be0b.chekh@pcbi.upenn.edu>
References: <alpine.LFD.2.00.1003291431280.14085@razzo.umeoce.maine.edu>
	<20100329162853.aa18be0b.chekh@pcbi.upenn.edu>
Message-ID: <fc.004c4d19420af1ff3b9aca001e9c6136.420b261c@umit.maine.edu>

Hi Alex,

Thanks a lot.  Sounds very similar to what I've been doing. I too *luckily* ordered a spare thinking that some day we'd be able to use it in another system. 

Thanks to some people from the list I've got a lead on some help. Our VAR is still working on it too. I'll let you know either way. I hope this gets the word out that the HP NICs are definitely for HP equipment only. Maybe not functionally but as
far as warranty goes keep it in mind.

Steve 

Alex Chekholko <chekh at pcbi.upenn.edu> writes:
>Hi Steve,
>
>I can report that I had the same problem.  I have a bunch of NC510c
>cards in IBM X3650 servers.  One broke, with the symptom that it would
>just hang any machine I put it into.  Luckily (or with great
>foresight), I had previously ordered a spare to have on hand.  At one
>point, the HP tech on the line was able to find the serial number in
>the HP system somewhere but only confirm that my card was out of
>warranty (it was manufactured long before we purchased it).  In the
>end, our VAR was able to convince their distributor to mail me a new
>working card, free.  Took a couple of weeks.
>
>Here, I can even copy in our e-mail chain, somewhat sanitized:
>
>
>*****
>
>From: helpful VAR Jeff
>To: Alex Chekholko <chekh at pcbi.upenn.edu>
>Subject: RE: FW: bad network card
>Date: Thu, 23 Oct 2008 08:46:25 -0400
>
>Good morning Alex,
>
>We finally got the distributor to ship you a new card - it should be
>shipping today.  Sorry for all the trouble.
>
>________________________________________
>From: Alex Chekholko [chekh at pcbi.upenn.edu]
>Sent: Monday, October 20, 2008 12:15 PM
>To: helpful VAR Jeff
>Cc: manager at genomics.upenn.edu
>Subject: Re: FW: bad network card
>
>Hi Jeff,
>
>Tech support line that I used is 1-800-474-6836
>
>The last time I spoke to HP on Oct 8th, I ended up in the queue
>Tech Support -> Networking -> ProCurve
>
>This time I spoke to HP, I ended up in the queue
>Tech Support -> Servers -> Proliant
>
>They couldn't find any record of that card anywhere in their system and
>said I have to get a replacement through the reseller the card was
>purchased from and recommended I call 1-888-943-8476 (turned out to be
>a "customer satisfaction" line).
>
>I called back again and went to
>Tech Support -> Networking -> ProCurve
>
>They initially couldn't find it in the system, until I recollected that
>they called it a "server adapter" last time; that allowed them to find
>it, and then they dumped me back to the initial menu.  Apparently, the
>serial number doesn't help, and the part number doesn't show up in the
>system (for the ProCurve folks).
>
>Then I called the "customer satisfaction" line, who suggested the HP
>Parts Store (1-800-227-8164 option 2) and transferred me there.  They
>said that Tech Support is the only place that could authorize warranty
>replacement, and were unable to look up the order numbers listed below,
>but connected me to "HW Support Orders" department (1-800-525-7104) who
>then forwarded me to the front menu of Tech Support.
>
>Total time, 2hrs.
>
>Good luck.
>
>On the UPenn side, the PO number is 2010394.
>
>Regards,
>Alex
>
>On Mon, 13 Oct 2008 17:33:53 -0400
>helpful VAR wrote:
>
>> Alex,
>>
>> Per note below can you try HP support one more time?  Sorry for the trouble! Thanks.
>>
>> -----Original Message-----
>> From: distributor "arrow" On Behalf Of ISS Team
>> Sent: Monday, October 13, 2008 5:18 PM
>> To: helpful VAR Jeff
>> Cc: other arrow folks
>> Subject: RE: bad network card
>>
>> Hi Jeff.  We came up with, we hope, information that will prove to HP
>> that the card is under warranty.  It was ordered through Synnex and the
>> order numbers are 24927045 and 26715052.  They shipped from Synnex on
>> March 20, 2008.
>>
>> Part# 414129-B21 (they ordered a qty of 3)
>> HP NC510C PCIE 10 GIGABIT SERVER ADAPTER
>> Reseller Info:
>> helpful VAR...
>>
>> End-User Info:
>>
>> our address
>> PHILADELPHIA, PA 19104
>>
>> Please have the end user call HP and give them this information.  Please
>> get back with me if HP still won't replace it.
>>
>> Thanks.
>>
>>
>
>*****
> 
>On Mon, 29 Mar 2010 15:51:42 -0400 (EDT)
>Steve Cousins <cousins at umit.maine.edu> wrote:
>
>> 
>> I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have 
>> been trying to get them to work in non-HP Linux systems. After failing to 
>> be able to comile/install the nx_nic drivers on Fedora 9 and 12 (it checks 
>> to see if kernel >= 2.6.27 and if so looks for net/8021q/vlan.h but can't 
>> find it) I installed one of the supported distributions (or close enough): 
>> CentOS 5. Driver installation went fine and I was able to get one of the 
>> cards to work. The other one seems to be bad. The nx_nic driver loads but 
>> no eth2 device shows up. Also, the Activity LED is constant but no Link 
>> LED lights up. Since I can't get the eth device loaded I can't update the 
>> firmware.
>> 
>> So, I have one good one and one bad one. Pretty clear-cut. I've tried to 
>> get technical support from HP to RMA it and I end up in no-mans land. It 
>> seems that they do not have support for peripherals like this. They 
>> consider this part of a system and the only systems they seem to think 
>> these go in are their Proliant servers so I end up at Proliant tech 
>> support. But I have no serial number for a Proliant server so it is a dead 
>> end.
>> 
>> I tried "Customer Satisfaction" and got none.
>>
>> They insist that these cards are *only* for HP Proliant servers but I have 
>> not seen any indication of this at the sites I go to to buy this type of 
>> card. I bought this from a reseller when we bought a bunch of Procurve 
>> equipment with some 10 GbE modules for a switch. The reseller is seeing 
>> what he can do but in the mean time I thought I'd check here to see if 
>> anyone has run into this sort of thing from HP. I've always had good luck 
>> with HP but it has mainly been from the Procurve section and they won't 
>> touch this either.
>> 
>> Anyone running HP cards in non-HP equipment? Any tips on getting an RMA 
>> for this? I'm thinking of trying to track down an HP server on campus to 
>> work with just to get the RMA but I doubt if there are too many of these 
>> that aren't in use and still under warranty.
>> 
>> Thanks,
>> 
>> Steve
>> ______________________________________________________________________
>>   Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>>   Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
>>   Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
>>   (207) 581-4302     ~ cousins at umit.maine.edu ~      (207) 866-6552
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>
>-- 
>Alex Chekholko   chekh at pcbi.upenn.edu   


______________________________________________________________________
 Steve Cousins, Ocean Modeling Group    Email: cousins at umit.maine.edu
 Marine Sciences, 452 Aubert Hall       http://rocky.umeoce.maine.edu
 Univ. of Maine, Orono, ME 04469        Phone: (207) 581-4302


From ebiederm at xmission.com  Wed Mar 31 05:23:27 2010
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Wed, 31 Mar 2010 05:23:27 -0700
Subject: [Beowulf] HP 10 GbE card use/warranty
In-Reply-To: <alpine.LFD.2.00.1003300919490.20058@razzo.umeoce.maine.edu>
	(Steve Cousins's message of "Tue\,
	30 Mar 2010 09\:24\:48 -0400 \(EDT\)")
References: <alpine.LFD.2.00.1003291431280.14085@razzo.umeoce.maine.edu>
	<fc.004c4d194207fd5b3b9aca001e9c6136.4207fd5c@umit.maine.edu>
	<alpine.LFD.2.00.1003300919490.20058@razzo.umeoce.maine.edu>
Message-ID: <m11vf0znls.fsf@fess.ebiederm.org>

Steve Cousins <cousins at umit.maine.edu> writes:

>
>>>> I installed
>>>> one of the supported distributions (or close enough): CentOS 5. Driver
>>>> installation went fine and I was able to get one of the cards to work. The other
>>>> one seems to be bad. The nx_nic driver loads but no eth2 device shows up. Also,
>>>> the Activity LED is constant but no Link LED lights up. Since I can't get the
>>>> eth device loaded I can't update the firmware.
>>>
>> Does fedora not build the in kernel driver?  They should.
>
>
> Yes it does. I got basic functionality with it but under large transfers it
> locks up. No lockups with the nx_nic driver.

You might want to work with the maintainers of the netxen driver in
the kernel.  That have been fairly responsive when I have worked with
them.

Eric


From orion at cora.nwra.com  Wed Mar 31 08:27:29 2010
From: orion at cora.nwra.com (Orion Poplawski)
Date: Wed, 31 Mar 2010 09:27:29 -0600
Subject: [Beowulf] AMD 6100 vs Intel 5600
Message-ID: <4BB369E1.60906@cora.nwra.com>

Looks like it's time to start evaluating the AMD 6100 (magny-cours) 
offerings versus the Intel 5600 (Nehalem-EX?) offerings.  Any 
suggestions for resources?

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA/CoRA Division                    FAX: 303-415-9702
3380 Mitchell Lane                  orion at cora.nwra.com
Boulder, CO 80301              http://www.cora.nwra.com


From smulcahy at atlanticlinux.ie  Wed Mar 31 08:36:46 2010
From: smulcahy at atlanticlinux.ie (stephen mulcahy)
Date: Wed, 31 Mar 2010 16:36:46 +0100
Subject: [Beowulf] AMD 6100 vs Intel 5600
In-Reply-To: <4BB369E1.60906@cora.nwra.com>
References: <4BB369E1.60906@cora.nwra.com>
Message-ID: <4BB36C0E.1010502@atlanticlinux.ie>

Orion Poplawski wrote:
> Looks like it's time to start evaluating the AMD 6100 (magny-cours) 
> offerings versus the Intel 5600 (Nehalem-EX?) offerings.  Any 
> suggestions for resources?
> 

http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/10

has some benchmarks as a starting point. Would be interested to hear 
from others with more HPC-oriented benchmark results.

-stephen

-- 
Stephen Mulcahy     Atlantic Linux         http://www.atlanticlinux.ie
Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway)


From kilian.cavalotti.work at gmail.com  Wed Mar 31 10:37:01 2010
From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI)
Date: Wed, 31 Mar 2010 19:37:01 +0200
Subject: [Beowulf] AMD 6100 vs Intel 5600
In-Reply-To: <4BB369E1.60906@cora.nwra.com>
References: <4BB369E1.60906@cora.nwra.com>
Message-ID: <r2tcda054991003311037g93c36d64gb5895e2444b016f8@mail.gmail.com>

On Wed, Mar 31, 2010 at 5:27 PM, Orion Poplawski <orion at cora.nwra.com> wrote:
> Looks like it's time to start evaluating the AMD 6100 (magny-cours)
> offerings versus the Intel 5600 (Nehalem-EX?) offerings. ?Any suggestions
> for resources?

Just for the sake of precision, Intel 5600 series was codenamed
Westmere (dual-socket, 32nm, 6-cores, 3 memory channels). Intel 7500
series was codenamed Beckton, aka Nehalem-EX (quad-socket and beyond,
45nm, 8-cores, 4 memory-channels).

I would say that the 2x6-cores Magny-Cours probably has to be compared
to Nehalem-EX.

Some SPEC results are being posted on
http://www.spec.org/cpu2006/results/res2010q1/

Cheers,
-- 
Kilian


From bill at cse.ucdavis.edu  Wed Mar 31 13:51:18 2010
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Wed, 31 Mar 2010 13:51:18 -0700
Subject: [Beowulf] AMD 6100 vs Intel 5600
In-Reply-To: <r2tcda054991003311037g93c36d64gb5895e2444b016f8@mail.gmail.com>
References: <4BB369E1.60906@cora.nwra.com>
	<r2tcda054991003311037g93c36d64gb5895e2444b016f8@mail.gmail.com>
Message-ID: <4BB3B5C6.8060001@cse.ucdavis.edu>

On 03/31/2010 10:37 AM, Kilian CAVALOTTI wrote:
> On Wed, Mar 31, 2010 at 5:27 PM, Orion Poplawski <orion at cora.nwra.com> wrote:
>> Looks like it's time to start evaluating the AMD 6100 (magny-cours)
>> offerings versus the Intel 5600 (Nehalem-EX?) offerings.  Any suggestions
>> for resources?
> 
> Just for the sake of precision, Intel 5600 series was codenamed
> Westmere (dual-socket, 32nm, 6-cores, 3 memory channels). Intel 7500
> series was codenamed Beckton, aka Nehalem-EX (quad-socket and beyond,
> 45nm, 8-cores, 4 memory-channels).
> 
> I would say that the 2x6-cores Magny-Cours probably has to be compared
> to Nehalem-EX.

Why?  Various vendors try various strategies to differentiate products based
on features.  For the most part HPC types care about performance per $,
performance per watt, and reliability.  I'd be pretty surprised to see large
HPC cluster built out of Nehalem-EX chips.  Sure, large NUMA machines from
SGI, or HA clusters for running oracle and related business critical applications.

The best price/perf from intel looks to be the 5600, and the best from AMD is
the Magny-Cours.

Granted these are from AMD but:
http://www.amd.com/us/products/server/benchmarks/Pages/memory-bandwidth-stream-two-socket-servers.aspx

http://www.amd.com/us/products/server/processors/six-core-opteron/Pages/SPECfp-rate2006-two-socket-servers.aspx
http://www.amd.com/us/products/server/processors/six-core-opteron/Pages/SPECint-rate-2006-two-socket-servers.aspx

Of course this is all hand waving without system prices though.  I have to say
I've been pleasantly surprised.  At a Supermicro reseller I configured 2
reasonable compute nodes for a particular application and came up with:

1U dual Opteron 6128 (8 corex2.0 GHz), 32GB DDR3-1333, 2x1TB, IPMI 2.0 = $3802
1U dual Xeon X5650   (6 corex2.6 GHz), 24GB DDR3-1333, 2x1TB, IPMI 2.0 = $4639

Granted IPC isn't the same, but I was amused to see AMD offering 16 2.0 GHz
cores = 32 GHz, and the Intel config had 12 2.6 GHz cores = 31.2 GHz.

I've yet to get an account on the new AMD chips to measure our actual
application performance, but I have to say AMD looks pretty good at the
moment.   I figured maybe the 6 core intel is priced artificially high so I
tried a 4 core:
1U Xeon 5620    (4 core x 2.4 GHz), 24GB DDR3-1333, 2x1TB, IPMI 2.0 = $3293

So $3,293 for the cheaper Intel, or pay $509 to upgrade to the AMD and get
another 8GB ram and double the cores.  Granted they are at 2.0 GHz instead of 2.4.

Seems like AMDs offering more memory memory bandwidth and specFP rate per
dollar.  Certainly enough to have me looking for an account to measure
performance on our codes.


From cbergstrom at pathscale.com  Wed Mar 31 14:16:30 2010
From: cbergstrom at pathscale.com (=?UTF-8?B?IkMuIEJlcmdzdHLDtm0i?=)
Date: Thu, 01 Apr 2010 04:16:30 +0700
Subject: [Beowulf] AMD 6100 vs Intel 5600
In-Reply-To: <4BB3B5C6.8060001@cse.ucdavis.edu>
References: <4BB369E1.60906@cora.nwra.com>	<r2tcda054991003311037g93c36d64gb5895e2444b016f8@mail.gmail.com>
	<4BB3B5C6.8060001@cse.ucdavis.edu>
Message-ID: <4BB3BBAE.5070802@pathscale.com>

Bill Broadley wrote:
> ...
> Seems like AMDs offering more memory memory bandwidth and specFP rate per
> dollar.  Certainly enough to have me looking for an account to measure
> performance on our codes.
>   
While for my own selfish reasons I'm happy AMD may have some chance at a 
comeback, but...  I caution everyone to please ignore SPEC* as any 
indicator of performance..  This will be especially true for any 
benchmarks based on AMD's compiler.  Your code will always be the best 
benchmark and I'm happy to assist anyone offlist that needs help getting 
unbiased numbers.

Best,

./C


#pathscale - irc.freenode.net
CTOPathScale - twitter


From malexand at scaledinfra.com  Tue Mar 30 13:30:20 2010
From: malexand at scaledinfra.com (Michael Alexander)
Date: Tue, 30 Mar 2010 22:30:20 +0200
Subject: [Beowulf] CfP with Extended Deadline 5th Workshop on Virtualization
	in High-Performance Cloud Computing (VHPC'10)
Message-ID: <D3FCE619-E9ED-42A7-940F-E0AF1A4360C7@scaledinfra.com>

Apologies if you received multiple copies of this message.


=================================================================

CALL FOR PAPERS

5th Workshop on

Virtualization in High-Performance Cloud Computing

VHPC'10

as part of Euro-Par 2010, Island of Ischia-Naples, Italy

=================================================================

Date: August 31, 2010

Euro-Par 2009: http://www.europar2010.org/

Workshop URL: http://vhpc.org

SUBMISSION DEADLINE:

Abstracts: April 4, 2010 (extended)
Full Paper: June 19, 2010 (extended) 


Scope:

Virtualization has become a common abstraction layer in modern data
centers, enabling resource owners to manage complex infrastructure
independently of their applications. Conjointly virtualization is
becoming a driving technology for a manifold of industry grade IT
services. Piloted by the Amazon Elastic Computing Cloud services, the
cloud concept includes the notion of a separation between resource
owners and users, adding services such as hosted application
frameworks and queuing. Utilizing the same infrastructure, clouds
carry significant potential for use in high-performance scientific
computing. The ability of clouds to provide for requests and releases
of vast computing resource dynamically and close to the marginal cost
of providing the services is unprecedented in the history of
scientific and commercial computing.

Distributed computing concepts that leverage federated resource access
are popular within the grid community, but have not seen previously
desired deployed levels so far. Also, many of the scientific
datacenters have not adopted virtualization or cloud concepts yet.

This workshop aims to bring together industrial providers with the
scientific community in order to foster discussion, collaboration and
mutual exchange of knowledge and experience.

The workshop will be one day in length, composed of 20 min paper
presentations, each followed by 10 min discussion sections.
Presentations may be accompanied by interactive demonstrations. It
concludes with a 30 min panel discussion by presenters.

TOPICS

Topics include, but are not limited to, the following subjects:

- Virtualization in cloud, cluster and grid HPC environments
- VM cloud, cluster load distribution algorithms
- Cloud, cluster and grid filesystems
- QoS and and service level guarantees
- Cloud programming models, APIs and databases
- Software as a service (SaaS)
- Cloud provisioning
- Virtualized I/O
- VMMs and storage virtualization
- MPI, PVM on virtual machines
- High-performance network virtualization
- High-speed interconnects
- Hypervisor extensions
- Tools for cluster and grid computing
- Xen/other VMM cloud/cluster/grid tools
- Raw device access from VMs
- Cloud reliability, fault-tolerance, and security
- Cloud load balancing
- VMs - power efficiency
- Network architectures for VM-based environments
- VMMs/Hypervisors
- Hardware support for virtualization
- Fault tolerant VM environments
- Workload characterizations for VM-based environments
- Bottleneck management
- Metering
- VM-based cloud performance modeling
- Cloud security, access control and data integrity
- Performance management and tuning hosts and guest VMs
- VMM performance tuning on various load types
- Research and education use cases
- Cloud use cases
- Management of VM environments and clouds
- Deployment of VM-based environments


PAPER SUBMISSION

Papers submitted to the workshop will be reviewed by at least two
members of the program committee and external reviewers. Submissions
should include abstract, key words, the e-mail address of the
corresponding author, and must not exceed 10 pages, including tables
and figures at a main font size no smaller than 11 point. Submission
of a paper should be regarded as a commitment that, should the paper
be accepted, at least one of the authors will register and attend the
conference to present the work.

Accepted papers will be published in the Springer LNCS series - the
format must be according to the Springer LNCS Style. Initial
submissions are in PDF, accepted papers will be requested to provided
source files.

Format Guidelines: http://www.springer.de/comp/lncs/authors.html

Submission Link: http://edas.info/newPaper.php?c=8553


IMPORTANT DATES

April 4 - Abstract submission due (extended)
May 19 - Full paper submission (extended)
July 14 - Acceptance notification
August 3 - Camera-ready version due
August 31 - September 3 - conference


CHAIR

Michael Alexander (chair), scaledinfra technologies GmbH, Austria
Gianluigi Zanetti (co-chair), CRS4, Italy


PROGRAM COMMITTEE

Padmashree Apparao, Intel Corp., USA
Volker Buege, University of Karlsruhe, Germany
Roberto Canonico, University of Napoli Federico II, Italy
Tommaso Cucinotta, Scuola Superiore Sant'Anna, Italy
Werner Fischer, Thomas Krenn AG, Germany
William Gardner, University of Guelph, Canada
Wolfgang Gentzsch, DEISA. Max Planck Gesellschaft, Germany
Derek Groen, UVA, The Netherlands
Marcus Hardt, Forschungszentrum Karlsruhe, Germany
Sverre Jarp, CERN, Switzerland
Shantenu Jha, Louisiana State University, USA
Xuxian Jiang, NC State, USA
Kenji Kaneda, Google, Japan
Yves Kemp, DESY Hamburg, Germany
Ignacio Llorente, Universidad Complutense de Madrid, Spain
Naoya Maruyama, Tokyo Institute of Technology, Japan
Jean-Marc Menaud, Ecole des Mines de Nantes, France
Anastassios Nano, National Technical University of Athens, Greece
Oliver Oberst, Karlsruhe Institute of Technology, Germany
Jose Renato Santos, HP Labs, USA
Borja Sotomayor, University of Chicago, USA
Yoshio Turner, HP Labs, USA
Kurt Tuschku, University of Vienna, Austria
Lizhe Wang, Indiana University, USA
Chao-Tung Yang, Tunghai University, Taiwan


DURATION: Workshop Duration is one day.

GENERAL INFORMATION

The workshop will be held as part of Euro-Par 2010,
Island of Ischia-Naples, Italy.
Euro-Par 2010: http://www.europar2010.org/


From kilian.cavalotti.work at gmail.com  Wed Mar 31 23:36:37 2010
From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI)
Date: Thu, 1 Apr 2010 08:36:37 +0200
Subject: [Beowulf] AMD 6100 vs Intel 5600
In-Reply-To: <4BB3B5C6.8060001@cse.ucdavis.edu>
References: <4BB369E1.60906@cora.nwra.com>
	<r2tcda054991003311037g93c36d64gb5895e2444b016f8@mail.gmail.com>
	<4BB3B5C6.8060001@cse.ucdavis.edu>
Message-ID: <s2ycda054991003312336k4f38ae70q667eaaadffa6d5d5@mail.gmail.com>

On Wed, Mar 31, 2010 at 10:51 PM, Bill Broadley <bill at cse.ucdavis.edu> wrote:
>> I would say that the 2x6-cores Magny-Cours probably has to be compared
>> to Nehalem-EX.
>
> Why?

Maybe first because that's where the core spaces from AMD and Intel
intersect (8-cores Beckton and 8-cores Magny-Cours). I'm not sure it's
really significant to compare performance between a 6-cores Westmere
and a 12-cores Magny-Cours. I feel it makes more sense to compare
apples to apples, ie. same core count.

And then, also maybe because they are the same MP class, not
dual-socket only. Meaning there are similarly equipped in terms of
memory channels and inter-CPU links (QPI or HT), to be associated
in platforms of 4 or more.

> Various vendors try various strategies to differentiate products based
> on features. ?For the most part HPC types care about performance per $,
> performance per watt, and reliability. ?I'd be pretty surprised to see large
> HPC cluster built out of Nehalem-EX chips.

Not entirely built out of Nehalem-EX, probably, but including a fair
share of this newly coming (again) SMP machines, I have no doubt.
Users, both academic and from the industry, have more and more needs
for huge amounts of memory, that can not easily be met using the
distributed memory approach. Nehalem-EX and Magny-Cours offers just
that, hundreds of GB or RAM. I know some people drooling right now at
the idea of putting their hands on a 1TB machine.

I'm totally in line with you on the price/perf points you made, though.

Cheers,
-- 
Kilian