From bob at drzyzgula.org  Thu Jul  1 03:10:08 2010
From: bob at drzyzgula.org (Bob Drzyzgula)
Date: Thu, 1 Jul 2010 06:10:08 -0400
Subject: [Beowulf] Re: Multiple FlexLM lmgrd services on a single Linux
	machine?
In-Reply-To: <Pine.LNX.4.64.1007010109290.813@coffee.psychology.mcmaster.ca>
References: <AANLkTikhll_4PABr2eIxxmrdtGVrqKZIJUQpdJFs6eae@mail.gmail.com>
	<Pine.LNX.4.64.1007010109290.813@coffee.psychology.mcmaster.ca>
Message-ID: <20100701101008.GA21330@mx1.drzyzgula.org>

One could also, clearly, set up multiple KVM- or Xen-based
virutual machine images on which to run lmgrd. But one
might then ask why one would want to do this, given
that part of the point of mulitple lmgrds is to provide
phyisical server redundancy, unless as Mark appears to
be thinking, you simply believe you need one lmgrd for
each vendor...

On 01/07/10 01:21 -0400, Mark Hahn wrote:
>> Linux OS support multiple lmgrd services or not? If its not directly, is
>> there a way to do it?
>
> I don't really understand what you're asking.  yes, linux provides fully
> functional TCP/IP.  yes, flexlm can run either with a merged license file
> (single base port, multiple vendor ports), or with multiple completely
> separate instances (listening on say, ports 27000+27001 and 28000+28001).
> the latter is often more convenient, since it means you can adjust one
> instance without affecting the other.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lynesh at Cardiff.ac.uk  Thu Jul  1 03:38:31 2010
From: lynesh at Cardiff.ac.uk (Huw Lynes)
Date: Thu, 01 Jul 2010 11:38:31 +0100
Subject: [Beowulf] Re: Multiple FlexLM lmgrd services on a single Linux
	machine?
In-Reply-To: <20100701101008.GA21330@mx1.drzyzgula.org>
References: <AANLkTikhll_4PABr2eIxxmrdtGVrqKZIJUQpdJFs6eae@mail.gmail.com>
	<Pine.LNX.4.64.1007010109290.813@coffee.psychology.mcmaster.ca>
	<20100701101008.GA21330@mx1.drzyzgula.org>
Message-ID: <1277980711.2155.16.camel@w1181.insrv.cf.ac.uk>

On Thu, 2010-07-01 at 06:10 -0400, Bob Drzyzgula wrote:
> One could also, clearly, set up multiple KVM- or Xen-based
> virutual machine images on which to run lmgrd. But one
> might then ask why one would want to do this, given
> that part of the point of mulitple lmgrds is to provide
> phyisical server redundancy, unless as Mark appears to
> be thinking, you simply believe you need one lmgrd for
> each vendor...
> 

While it is possible to run licenses from multiple vendors under a
single LMGRD I would advise against it on the basis that 99.9% of
vendors don't understand how flexlm works.

The worst case I've seen of this is PGI where we have to run two
separate virtual machines for the windows and linux licenses. The way
they set up their license files makes them impossible to merge. And to
add insult to injury the vendor daemon has paths to lock files
hard-coded so you can't run two PGI license servers on the same box.

It's possible that there is some magic environment variable that would
fix the lock-file issue.

On this issue of VMs to run license servers; the reason we do it is to
provide physical server redundancy since the VMs reside in an ESX HA
cluster.

Cheers,
Huw

-- 
Huw Lynes                       | Advanced Research Computing
HEC Sysadmin                    | Cardiff University
                                | Redwood Building, 
Tel: +44 (0) 29208 70626        | King Edward VII Avenue, CF10 3NB


From john.hearns at mclaren.com  Thu Jul  1 03:46:37 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Thu, 1 Jul 2010 11:46:37 +0100
Subject: [Beowulf] Multiple FlexLM lmgrd services on a single Linux
	machine?
In-Reply-To: <AANLkTikhll_4PABr2eIxxmrdtGVrqKZIJUQpdJFs6eae@mail.gmail.com>
References: <AANLkTikhll_4PABr2eIxxmrdtGVrqKZIJUQpdJFs6eae@mail.gmail.com>
Message-ID: <68A57CCFD4005646957BD2D18E60667B10F7567C@milexchmb1.mil.tagmclarengroup.com>


          We're in a process of implementing a centralized FlexLM
license server for multiple commercial applications. Can some one tell
us, whether Linux OS support multiple lmgrd services or not? If its not
directly, is there a way to do it? 

For example, can we install FlexLM license servers of both ANSYS and
STAR CD on a single linux server?


You can run multiple lmgrd on a single machine. As Mark says, use
different ports.

 
The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100701/58a8dd40/attachment.html>

From ntmoore at gmail.com  Thu Jul  1 06:25:37 2010
From: ntmoore at gmail.com (Nathan Moore)
Date: Thu, 1 Jul 2010 08:25:37 -0500
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
Message-ID: <AANLkTimpx2v5VTOjnG0CW1GmRH9FBVMrLCvB-K_nI1S9@mail.gmail.com>

I spent a summer working at IBM (porting applications) when LLNL's
Blue Gene system was being installed/finalized.  After spending 10
years in college/grad school with no real outside experience, it was
an interesting time.  A few observations might be relevant to the
discussion.

(1) to a certain extent, intellectual/scientific prestige was very
important to the culture of the place.  Promotions were/are based in
part on how many patents you generate (not dissimilar to a
publications count), but at least superficially, patents don't seem like
a major revenue stream.  Another data point, the company has a few
internal research journals,
http://domino.research.ibm.com/tchjr/journalindex.nsf/Home?OpenForm .

(2) about once a week, my supervisor (a very skilled applications
programmer) would ask, "So, have you figured out how to sell a million
Blue Gene's yet?".  Once the design was finalized/produced, the clear
goal was to sell lots of them. (Fastest/best/national lab etc only
really matter for a short time - people have to be paid...).

(3) The local view seemed to be that the interconnect fabric (really
fast and high-bandwith, ideal for finite-element calculations, and
actually somewhat difficult to implement (well) in Molecular Dynamics) in the
BGL was included because of LLNL's application needs, and the machine
was accordingly hard to sell to "regular customers."  (Something akin
to selling a fleet of porsche's to a Taxi Company).

(3.5) a little more.  0.8GHz cpus, minimum allocation is 512/1024
CPU's at a time.  Not really an architecture that the guys at Citibank
are used to writing for...  As I recall, this was a result of the
design requirements from LLNL.  Its an amazing system to look at
though - its just one big board with a bunch of chips (CPU+memory)
plugged in.  The system density and low power consumption was the most
impressive thing to me.

(4)  From my experience, it seems like one of the roots of IBM's
success was taking a computer that you have to replace every two years
(or can build from parts on NewEgg) and turning it into an industrial
appliance (like a hobart mixer or a drill-press) that you service
regularly and can get 10 or more years
of life out of. This seemed like the essence of the "i-Series, and
earlier "System-360" machines.


From hearnsj at googlemail.com  Thu Jul  1 06:47:17 2010
From: hearnsj at googlemail.com (John Hearns)
Date: Thu, 1 Jul 2010 14:47:17 +0100
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTimpx2v5VTOjnG0CW1GmRH9FBVMrLCvB-K_nI1S9@mail.gmail.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<AANLkTimpx2v5VTOjnG0CW1GmRH9FBVMrLCvB-K_nI1S9@mail.gmail.com>
Message-ID: <AANLkTikxQx29NgPrkGzDxhwSqEWvecEgu94MJZTE1i8e@mail.gmail.com>

On 1 July 2010 14:25, Nathan Moore <ntmoore at gmail.com> wrote:
>> (1) to a certain extent, intellectual/scientific prestige was very
> important to the culture of the place. ?Promotions were/are based in
> part on how many patents you generate (not dissimilar to a
> publications count), but at least superficially, patents don't seem like
> a major revenue stream.

You should go to a Richard M Stallman talk on software patents
sometime. (They're always on software patents).
Software patents are used by companies as leverage in legal disputes
between themselves - therefore a company
which has many patents can 'defend' itself against others by
threatening to counter-sue for infringment of their patents.
The more patents you have the better.

RMS's argument of course is that software patents (he deals with
software - don't extrapolate to other types of patent)
should be ended.


From bcostescu at gmail.com  Thu Jul  1 07:14:21 2010
From: bcostescu at gmail.com (Bogdan Costescu)
Date: Thu, 1 Jul 2010 16:14:21 +0200
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2C076A.5010703@scalableinformatics.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<4C2B3C5D.9030405@ias.edu>
	<4C2B510E.5050406@scalableinformatics.com>
	<4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com>
	<4C2B9E5F.2010602@ias.edu>
	<4C2C076A.5010703@scalableinformatics.com>
Message-ID: <AANLkTillAAUgW39DKnplpjtlEpGlQ-Hp0O6V32mrEili@mail.gmail.com>

On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman
<landman at scalableinformatics.com> wrote:
> At the end of the day, the fundamental question we are debating is, does
> the "prestige" of working with a top university/national lab have any
> real tangible value that you can ascribe to the bottom line, does it
> actually impact sales.
>
> I posit that the answer to this is a resounding "no". ?You obviously
> disagree.

I also disagree, but I have another point of view: the fact of working
with a top university/national lab can be important for the
development of the product or line of products. A top
university/national lab is considered top because it has clever people
who are renowned for their way of thinking and/or published results;
given a new (type of) parallel machine, they might come up with
amazing results and/or might allow them to become even more famous -
their publications will mention the (type of) parallel machine on
which their results were obtained and other people looking to obtain
similar results or looking for even better results (=competitors :-))
will become interested. This doesn't necessarily mean that they will
buy the same (type of) parallel machines now but, if the results were
amazing enough, the _next_ generation of parallel machines from this
or other vendor will be able to achieve the same amazing results
because, by then, buyers will ask for it. So it effectively becomes an
investment in the future.

Bogdan


From bcostescu at gmail.com  Thu Jul  1 07:14:21 2010
From: bcostescu at gmail.com (Bogdan Costescu)
Date: Thu, 1 Jul 2010 16:14:21 +0200
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2C076A.5010703@scalableinformatics.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<4C2B3C5D.9030405@ias.edu>
	<4C2B510E.5050406@scalableinformatics.com>
	<4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com>
	<4C2B9E5F.2010602@ias.edu>
	<4C2C076A.5010703@scalableinformatics.com>
Message-ID: <AANLkTillAAUgW39DKnplpjtlEpGlQ-Hp0O6V32mrEili@mail.gmail.com>

On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman
<landman at scalableinformatics.com> wrote:
> At the end of the day, the fundamental question we are debating is, does
> the "prestige" of working with a top university/national lab have any
> real tangible value that you can ascribe to the bottom line, does it
> actually impact sales.
>
> I posit that the answer to this is a resounding "no". ?You obviously
> disagree.

I also disagree, but I have another point of view: the fact of working
with a top university/national lab can be important for the
development of the product or line of products. A top
university/national lab is considered top because it has clever people
who are renowned for their way of thinking and/or published results;
given a new (type of) parallel machine, they might come up with
amazing results and/or might allow them to become even more famous -
their publications will mention the (type of) parallel machine on
which their results were obtained and other people looking to obtain
similar results or looking for even better results (=competitors :-))
will become interested. This doesn't necessarily mean that they will
buy the same (type of) parallel machines now but, if the results were
amazing enough, the _next_ generation of parallel machines from this
or other vendor will be able to achieve the same amazing results
because, by then, buyers will ask for it. So it effectively becomes an
investment in the future.

Bogdan


From landman at scalableinformatics.com  Thu Jul  1 07:51:11 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 01 Jul 2010 10:51:11 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTillAAUgW39DKnplpjtlEpGlQ-Hp0O6V32mrEili@mail.gmail.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>	<4C2B3C5D.9030405@ias.edu>	<4C2B510E.5050406@scalableinformatics.com>	<4C2B64BA.3000704@ias.edu>
	<4C2B6C69.60309@scalableinformatics.com>	<4C2B9E5F.2010602@ias.edu>	<4C2C076A.5010703@scalableinformatics.com>
	<AANLkTillAAUgW39DKnplpjtlEpGlQ-Hp0O6V32mrEili@mail.gmail.com>
Message-ID: <4C2CAB5F.208@scalableinformatics.com>

Bogdan Costescu wrote:
> On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman
> <landman at scalableinformatics.com> wrote:
>> At the end of the day, the fundamental question we are debating is, does
>> the "prestige" of working with a top university/national lab have any
>> real tangible value that you can ascribe to the bottom line, does it
>> actually impact sales.
>>
>> I posit that the answer to this is a resounding "no".  You obviously
>> disagree.
> 
> I also disagree, but I have another point of view: the fact of working
> with a top university/national lab can be important for the
> development of the product or line of products. A top

This isn't the issue.  The issue is, will a discount which amounts to 
you as a vendor paying your customer to taking your product, in order to 
garner prestige ... will this prestige translate to the bottom line in 
the near term ... will it positively impact sales.

> university/national lab is considered top because it has clever people
> who are renowned for their way of thinking and/or published results;

This of course impugns the quality of people at "not-top" sites, places 
which may produce excellent science, but don't have the name brand of 
the "top" folks.  My experience is that innovation happens where bright 
people are, whom are motivated to innovate.  Excellent science happens 
many places which are not "top" sites.

> given a new (type of) parallel machine, they might come up with
> amazing results and/or might allow them to become even more famous -
> their publications will mention the (type of) parallel machine on
> which their results were obtained and other people looking to obtain
> similar results or looking for even better results (=competitors :-))

"Might" "might allow" "will mention"

Which one of these directly impacts against the bottom line?  Which one 
of these actively increases sales and revenues?

> will become interested. This doesn't necessarily mean that they will
> buy the same (type of) parallel machines now but, if the results were

Exactly.  They won't necessarily buy.  That does impact the bottom line.

> amazing enough, the _next_ generation of parallel machines from this
> or other vendor will be able to achieve the same amazing results
> because, by then, buyers will ask for it. So it effectively becomes an
> investment in the future.

... but this doesn't matter, unless you are quantifying this return on 
investment by classifying the discount as an investment.

The danger in doing this, and there is a profound danger in doing this, 
is that *everyone* will want you as the vendor to "invest" in them. 
This does happen, and that is why the NDA is such an important tool for 
vendors to control the distribution of the deal details.  Morever, these 
investments, as I have indicated, and again, I haven't seen a single 
indication in email (private or public) otherwise, simply don't have a 
meaningful ROI.  They aren't accretive to the bottom line.

Which means, if you start giving everyone a discount and couch this as 
an investment, all you have done is lowered your margins to close to 
zero or below.  You have no real expectation of a return on this 
"investment".

Again, I am not bashing anyone.  I do think people need to think these 
arguments over very carefully before they present this stuff to their 
vendor of choice.  Most I have spoken to over the last year have told me 
that any discount comes with a significant quid pro quo, something that 
will help offset other costs elsewhere, e.g. have a measurable impact 
upon the bottom line.  Prestige adds nothing to be bottom line, it gives 
you talking points.  It won't steer a detectable/measurable number of 
customers your way.

Investment in product development happens generally independently of the 
sales efforts.  That is a real cost to the bottom line.  If you couch 
this as product investment, then you have a cost center.  Which 
negatively impacts bottom line.


> Bogdan

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From james.p.lux at jpl.nasa.gov  Thu Jul  1 08:29:45 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Thu, 1 Jul 2010 08:29:45 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTikxQx29NgPrkGzDxhwSqEWvecEgu94MJZTE1i8e@mail.gmail.com>
Message-ID: <C8520279.11204%James.P.Lux@jpl.nasa.gov>


On 7/1/10 6:47 AM, "John Hearns" <hearnsj at googlemail.com> wrote:

> On 1 July 2010 14:25, Nathan Moore <ntmoore at gmail.com> wrote:
>>> (1) to a certain extent, intellectual/scientific prestige was very
>> important to the culture of the place. ?Promotions were/are based in
>> part on how many patents you generate (not dissimilar to a
>> publications count), but at least superficially, patents don't seem like
>> a major revenue stream.
> 
> You should go to a Richard M Stallman talk on software patents
> sometime. (They're always on software patents).
> Software patents are used by companies as leverage in legal disputes
> between themselves - therefore a company
> which has many patents can 'defend' itself against others by
> threatening to counter-sue for infringment of their patents.
> The more patents you have the better.
> 
> RMS's argument of course is that software patents (he deals with
> software - don't extrapolate to other types of patent)
> should be ended.
> 
> 

RMS is an interesting guy.  He does tend to stake out one extreme on the
intellectual property rights spectrum.  I've had many a stimulating(?)
discussion with folks who advocate a form of Marxism for software: that is,
all software should be freely available to all, and magic elves/the
state/some entity will ensure that the writers of such software will have a
roof over their heads and food to eat, because it's a sharing of a societal
good thing.  Sadly, even in the halls of academe such a situation doesn't
really exist.  RMS knows this and has a finely nuanced way to deal with it.
The same cannot be said of his philosophical adherents.<grin>

Such as it is, the IP world we live in is the one we have to work in.  We
can't rely on the Renaissance patronage approach, or the aristocratic
gentleman of independent means approach for support.  Government support is
sort of the new "patronage", but it is a fickle master (although perhaps,
now that I think about it, not any more fickle than Prince Ludivico of
Milan, etc.).  For the rest of us, the enormous number of technology and
science developers in the private world, the value of IP ("goodwill" on the
books) is that with which a profit making entity justifies the original
investment, and with which they pay your salary, which lets you put that
roof over your head etc.

With respect to software patents, IBM gets bunches o' patents on hardware
too, and has always done so.

The modern trend to building a portfolio as a strategic weapon against other
portfolios (e.g. So you have something to cross license with) is somewhat
worrying, because it tends to favor the big boys. That is, if you have 100
patents in your quiver and someone knocks a few out, you've still got 90+ to
beat them over the head with.  If you have 1 patent....

It's an interesting topic...


From james.p.lux at jpl.nasa.gov  Thu Jul  1 08:46:01 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Thu, 1 Jul 2010 08:46:01 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTillAAUgW39DKnplpjtlEpGlQ-Hp0O6V32mrEili@mail.gmail.com>
Message-ID: <C8520649.11209%James.P.Lux@jpl.nasa.gov>


On 7/1/10 7:14 AM, "Bogdan Costescu" <bcostescu at gmail.com> wrote:

> On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman
> <landman at scalableinformatics.com> wrote:
>> At the end of the day, the fundamental question we are debating is, does
>> the "prestige" of working with a top university/national lab have any
>> real tangible value that you can ascribe to the bottom line, does it
>> actually impact sales.
>> 
>> I posit that the answer to this is a resounding "no". ?You obviously
>> disagree.
> 
> I also disagree, but I have another point of view: the fact of working
> with a top university/national lab can be important for the
> development of the product or line of products.

This manifests itself in other ways, too.

For instance, during the .com bubble, there were joking comments about being
paid in "space dollars", typically in reference to the disparity between
wages paid to develop research satellite equipment and in the wireless
industry. 

There are people at JPL who maintain that we shouldn't be worrying about
paying competitive salaries, because if you don't want to work on "space
stuff" as a personal goal, regardless of compensation, you shouldn't be
working there. I view this as unrealistic and comparable memes like the
"starving in a garrett leading to true art"  or the "if you truly cared,
you'd do it for free and live in tent with the other homeless people along
the arroyo".  While the latter might have been acceptable to me in my 20s,
now that I've passed the half century mark, I'm a bit more inured to
creature comforts and, more to the point, so are my wife and children.

I have talked with senior managers at technology companies about why they
would contemplate being a vendor for JPL vs the commercial world (JPL tends
to be in the category of a "high maintenance" customer who asks a lot of
questions, wants to peer into every aspect of your processes, and asks a lot
of their vendors, and because we're the government, you're not going to make
huge profits).. Their response is often that it allows them to attract top
talent, who can then benefit the company in other ways.  In the case of NASA
work, too, it's public, unlike other high tech work for the defense type
markets, which is often classified.  If you're recruiting smart people out
of school, telling them that they can work on a radio for a Mars probe is
much sexier than telling them that they will be the third assistant door
latch controller developer on the automotive products team.

So, given that someone wants to hire top people, and there are lots of
potential employers for top people, you need something to distinguish
yourself beyond the hygiene issue of salary. (A hygiene issue is one that
has a sort of threshold effect.. Nobody cares much about the details of
having it, but not having it is a big negative.  )

Working on "the worlds fastest computer" is one of those things that gets
you in the door.  And, as a "top person" you might find that, though
skilled, that's not your thing, and that you really have a talent for
something else that IBM does, so IBM benefits, in many ways.


From jlforrest at berkeley.edu  Thu Jul  1 09:03:29 2010
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Thu, 01 Jul 2010 09:03:29 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <C8520649.11209%James.P.Lux@jpl.nasa.gov>
References: <C8520649.11209%James.P.Lux@jpl.nasa.gov>
Message-ID: <4C2CBC51.6090704@berkeley.edu>

Another reason why some vendors are willing to
sell stuff at reduced prices to universities is
for visibility. The thinking is that when grad
students (finally) graduate and go off into
industry, they'll want to buy the same stuff
they used when they were students. I'm not
sure how valid this approach is, but I don't
argue with it.

As the #1 public university in many (most?)
scientific fields, UC Berkeley gets approached
with all kinds of deals. During the boom times
we sometimes had to turn down such deals because
we simply didn't have the space and/or the people
to allow us to accept the equipment.

Things are different now, but space and people
are still more expensive than most equipment.

Cordially,
-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From landman at scalableinformatics.com  Thu Jul  1 09:24:17 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 01 Jul 2010 12:24:17 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2CBC51.6090704@berkeley.edu>
References: <C8520649.11209%James.P.Lux@jpl.nasa.gov>
	<4C2CBC51.6090704@berkeley.edu>
Message-ID: <4C2CC131.4010403@scalableinformatics.com>

Jon Forrest wrote:
> Another reason why some vendors are willing to
> sell stuff at reduced prices to universities is
> for visibility. The thinking is that when grad
> students (finally) graduate and go off into
> industry, they'll want to buy the same stuff
> they used when they were students. I'm not
> sure how valid this approach is, but I don't
> argue with it.

Heh ... I deleted that section of one of my responses.  There isn't a 
lot of data to support the notion that they will go out into the world 
and buy the same stuff.  They will often go for the best deals.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From james.p.lux at jpl.nasa.gov  Thu Jul  1 09:31:51 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Thu, 1 Jul 2010 09:31:51 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2CAB5F.208@scalableinformatics.com>
Message-ID: <C8521107.11215%James.P.Lux@jpl.nasa.gov>


On 7/1/10 7:51 AM, "Joe Landman" <landman at scalableinformatics.com> wrote:

> Bogdan Costescu wrote:
>> On Thu, Jul 1, 2010 at 5:11 AM, Joe Landman
>> <landman at scalableinformatics.com> wrote:
>>> At the end of the day, the fundamental question we are debating is, does
>>> the "prestige" of working with a top university/national lab have any
>>> real tangible value that you can ascribe to the bottom line, does it
>>> actually impact sales.
>>> 
>>> I posit that the answer to this is a resounding "no".  You obviously
>>> disagree.
>> 
>> I also disagree, but I have another point of view: the fact of working
>> with a top university/national lab can be important for the
>> development of the product or line of products. A top
> 
> This isn't the issue.  The issue is, will a discount which amounts to
> you as a vendor paying your customer to taking your product, in order to
> garner prestige ... will this prestige translate to the bottom line in
> the near term ... will it positively impact sales.


<and lots of other good discussion>

I think it depends on the organization deriving the putative benefit from
prestige.  For some organizations, there is none.. It doesn't directly
translate to increased profits in the near term.  For others, it might be
less tangible, but as real: attracting better job candidates increases the
value of "workforce capital".

As you've ably described, there's a huge tension at the top of most
corporations between things that can be accurately valued in cash terms and
those that are intangible.  And, while lots of companies may say "our most
valuable asset walks out the door each evening" they may not actually act
that way in real life, particularly if their board feels that they have to
responsible to "shareholder value" concerns.

It would be interesting to compare whether more technology companies have
gone broke because they over valued "prestige" or under valued  it.  As Joe
has pointed out, more than one company has gotten into trouble for the
"discounts for sexy customers" policy.  On the other hand, if all your
people walk out the door, permanently, because it's no fun working on the
latest mid range commodity server, you also die, just slower.  This is the
classic "let's do away with R&D, because they're just a cost center" problem
( Chainsaw Al and his M&A ilk).

This is much like when I worked in the entertainment industry. If you have
some unique skill or capability, everyone comes to you with offers to work
for free/minimal pay, "because it's a great opportunity, and you'll get to
work with X, and the next job will pay better", and because without you,
their grand idea will never work.  (They do this with people who have
commodity skills, too, but those have very depressed prices: e.g. Actors
basically work for free and hope to get lucky. )

You pretty quickly learn to be pretty hard nosed about this:  "Is this an
investment opportunity or a work opportunity, and if it's the former, what's
my fraction of the gross revenue"  (Everybody knows you never, never, never
take "net points")

Sometimes, though, the "you'll work with X" *is* worth taking the job,
either because it's something you'd literally pay for if offered the chance
(hey, people bid in charity auctions for a dinner with Y or Z, it's not that
different), or because the potential downstream reward has a high enough
expected value (Probability(event)*Revenue(event) > reduced income from this
one job).  It's when P(event) is down in the 0.01 range and the R(event) is
in the <1 years income category you say, "gosh that sounds nice, but I'm
sort of busy right now, and can I refer you to someone else)

The actor model is slightly different.. P(event) where event is "getting a
real paying gig" is very small, but the R(event) is fairly high (that gig
gets you into the union, for life, which helps improve the P(future event))
AND most important, the "foregone revenue" is basically zero.  Actors KNOW
that very few make any money, so they get jobs to feed and house themselves
that are flexibly scheduled  (wait staff, casual day labor, construction) so
they don't lose revenue by taking the opportunity presented.


I can't see anybody in the HPC business taking the 'actor' approach as a
business plan, though. 


From john.hearns at mclaren.com  Thu Jul  1 09:38:38 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Thu, 1 Jul 2010 17:38:38 +0100
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <C8520279.11204%James.P.Lux@jpl.nasa.gov>
References: <AANLkTikxQx29NgPrkGzDxhwSqEWvecEgu94MJZTE1i8e@mail.gmail.com> 
	<C8520279.11204%James.P.Lux@jpl.nasa.gov>
Message-ID: <68A57CCFD4005646957BD2D18E60667B10F75B4E@milexchmb1.mil.tagmclarengroup.com>

> 
> RMS is an interesting guy.  He does tend to stake out one extreme on
> the
> intellectual property rights spectrum.  I've had many a stimulating(?)
> discussion with folks who advocate a form of Marxism for software:
that
> is,
> all software should be freely available to all, and magic elves/the
> state/some entity will ensure that the writers of such software will
> have a
> roof over their heads and food to eat, because it's a sharing of a
> societal
> good thing.  Sadly, even in the halls of academe such a situation
> doesn't
> really exist.  RMS knows this and has a finely nuanced way to deal
with
> it.


Jim, RMS makes a distinction between patents and copyright.

Remember that the GNU Copyleft is a copyright, used to defend free
software.
There is nothing wrong with copyrighting your work, and being paid for
it, and indeed 
making money from it.

It is the patents which lead to the absurdities of companies scrambling
for patents for
commonsense techniques - in order to build up that portfolio to brandish
at other companies.


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From gus at ldeo.columbia.edu  Thu Jul  1 09:41:38 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Thu, 01 Jul 2010 12:41:38 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2CBC51.6090704@berkeley.edu>
References: <C8520649.11209%James.P.Lux@jpl.nasa.gov>
	<4C2CBC51.6090704@berkeley.edu>
Message-ID: <4C2CC542.9060007@ldeo.columbia.edu>

Hi Jon, Rahul, list

Jon Forrest wrote:
> Another reason why some vendors are willing to
> sell stuff at reduced prices to universities is
> for visibility. The thinking is that when grad
> students (finally) graduate and go off into
> industry, they'll want to buy the same stuff
> they used when they were students. I'm not
> sure how valid this approach is, but I don't
> argue with it.
> 

I agree this may not work with high end HPC.
However, this is to some extent what Microsoft
does very effectively with kids from pre-kindergarten
to graduate school worldwide,
including part of their charity donations to schools, etc.

It creates a culture, a habit, a dependency,
in what is a much bigger market,
but yet a market with much less choices than HPC - surprisingly or not.
(This is despite the inroads that Ubuntu and others may have created 
lately.)

Hey Rahul:

Your original question went a long way, didn't it?  :)
Somehow you always ask questions that trigger these
interesting debates.

Cheers,
Gus Correa

> As the #1 public university in many (most?)
> scientific fields, UC Berkeley gets approached
> with all kinds of deals. During the boom times
> we sometimes had to turn down such deals because
> we simply didn't have the space and/or the people
> to allow us to accept the equipment.
> 
> Things are different now, but space and people
> are still more expensive than most equipment.
> 
> Cordially,


From landman at scalableinformatics.com  Thu Jul  1 10:00:30 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 01 Jul 2010 13:00:30 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTim7xLlHXfxsXuGBWZNErfqvG_aLl4hqENhHynUQ@mail.gmail.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>	<4C2B3C5D.9030405@ias.edu>	<4C2B510E.5050406@scalableinformatics.com>	<4C2B64BA.3000704@ias.edu>
	<AANLkTim7xLlHXfxsXuGBWZNErfqvG_aLl4hqENhHynUQ@mail.gmail.com>
Message-ID: <4C2CC9AE.7030202@scalableinformatics.com>

Greg Rubino wrote:
> I have to say I partially agree with Prentice.  I don't know if
> prestige directly translates into revenue, but if your a huge company

Thats the thesis that I am saying I do not believe to be the case, and 
Prentis is (as far as I understand it) indicating that he believes this 
to be the case.

If a discount doesn't translate into revenue (e.g. no return on 
"investment"), then what does it translate into?  The accountants will 
tell you.

> and your platform is the first one upon which some new innovation in
> HPC is implemented (cutthroat or not), you have a huge opportunity on
> your hands.  I guess it depends upon the terms under which you took
> that initial "loss" (s/loss/risk/g).

Thats marketing.  Few companies list marketing as a profit center.  It 
is an expense.  It reduces the bottom line.

No one is denying the marketing cache' of a nice PR win.  However ... 
marketing cache' doesn't often turn into cash (or, more accurately, 
profitable revenue).

Which is the basis of what I am arguing.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From jlforrest at berkeley.edu  Thu Jul  1 10:10:31 2010
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Thu, 01 Jul 2010 10:10:31 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <C85214C9.11218%James.P.Lux@jpl.nasa.gov>
References: <C85214C9.11218%James.P.Lux@jpl.nasa.gov>
Message-ID: <4C2CCC07.1020004@berkeley.edu>

On 7/1/2010 9:47 AM, Lux, Jim (337C) wrote:

> Giving it away for free to educational institutions worked for Unix, eh? (at
> least in the long run)

Maybe so, for some definition of "worked".

I don't know how it was at other places, but at
Berkeley, especially in the Computer Science Dept.
where I worked, the presence of certain brands
of equipment was like age rings in trees. By this
I mean that 1991-1993 were the DEC years, 1994-1996
were the HP years, 1997-2000 were the Intel years
(I'm making up these years and vendors).
During these periods the vendors made their
equipment available to us at extremely good
prices. Then, something would happen that
caused the vendors to loose interest, and
another vendor would gain interest.

In our case, I think part of the reason why
this happened is because the vendors wanted
access to the professors and grad students
involved in the research projects seemed
like they would have promising commercial
value, such as RAID, RISC, Postgres, and NOW.

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From gus at ldeo.columbia.edu  Thu Jul  1 10:29:07 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Thu, 01 Jul 2010 13:29:07 -0400
Subject: [Beowulf] guide for pbs/torque and mpi
In-Reply-To: <AANLkTik1lNXNS3n4XGzC-bQX7qlCVkSnK4UluL5mhW81@mail.gmail.com>
References: <AANLkTik1lNXNS3n4XGzC-bQX7qlCVkSnK4UluL5mhW81@mail.gmail.com>
Message-ID: <4C2CD063.9020606@ldeo.columbia.edu>

Hi Akshar

akshar bhosale wrote:
> hi,
>  we want to have a good reference guide for torque(pbs),maui and mpi
> 
> akshar

Torque and Maui guides are available from ClusterResources:

http://www.clusterresources.com/pages/products/torque-resource-manager.php
http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php

LLNL has good MPI (and other) tutorials:

https://computing.llnl.gov/?set=training&page=index
https://computing.llnl.gov/tutorials/mpi/

I hope it helps,
Gus Correa

> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From james.p.lux at jpl.nasa.gov  Thu Jul  1 09:47:53 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Thu, 1 Jul 2010 09:47:53 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2CBC51.6090704@berkeley.edu>
Message-ID: <C85214C9.11218%James.P.Lux@jpl.nasa.gov>


On 7/1/10 9:03 AM, "Jon Forrest" <jlforrest at berkeley.edu> wrote:

> Another reason why some vendors are willing to
> sell stuff at reduced prices to universities is
> for visibility. The thinking is that when grad
> students (finally) graduate and go off into
> industry, they'll want to buy the same stuff
> they used when they were students. I'm not
> sure how valid this approach is, but I don't
> argue with it.
>

Giving it away for free to educational institutions worked for Unix, eh? (at
least in the long run)

And IBM used to give very attractive lease rates to universities (in the
good old days, you couldn't BUY an IBM mainframe, only lease them)

An interesting question is whether this strategy would be viable today,
given the more short term return orientation of the capital markets. 


From rpnabar at gmail.com  Thu Jul  1 12:39:07 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Thu, 1 Jul 2010 14:39:07 -0500
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2CC542.9060007@ldeo.columbia.edu>
References: <C8520649.11209%James.P.Lux@jpl.nasa.gov>
	<4C2CBC51.6090704@berkeley.edu>
	<4C2CC542.9060007@ldeo.columbia.edu>
Message-ID: <AANLkTinf2EamNBTf3KclTy0mUIs6RmROYgHGsqaHNT26@mail.gmail.com>

On Thu, Jul 1, 2010 at 11:41 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Hey Rahul:
>
> Your original question went a long way, didn't it? ?:)
> Somehow you always ask questions that trigger these
> interesting debates.

Yes it did. I am not sure if that is good or bad though. :) I hope it
doesn't make me look like a "troll". I just had a genuine curiosity to
know more about the economic side of HPC, economies of scale etc.;
this is the stuff often in the shadows and I don't see much coverage
in the typical material I read.

Best,

-- 
Rahul


From rpnabar at gmail.com  Thu Jul  1 12:46:34 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Thu, 1 Jul 2010 14:46:34 -0500
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2CC131.4010403@scalableinformatics.com>
References: <C8520649.11209%James.P.Lux@jpl.nasa.gov>
	<4C2CBC51.6090704@berkeley.edu>
	<4C2CC131.4010403@scalableinformatics.com>
Message-ID: <AANLkTimnA-ROesT2dijcmIRsGSnPV_ANnEO5OrLkJtvl@mail.gmail.com>

On Thu, Jul 1, 2010 at 11:24 AM, Joe Landman
<landman at scalableinformatics.com> wrote:
> Heh ... I deleted that section of one of my responses. ?There isn't a lot of
> data to support the notion that they will go out into the world and buy the
> same stuff. ?They will often go for the best deals.
>

People will only have a limited number of vendors on their list when
they get quotes and compare specs. etc. And the sort of prestige
branding does make you more likely to get on the list of vendors
people will consider for a particular project.

e.g I imagine there are at least 20 different HPC vendors (if not
more) that would sell the kind of equipment that went into our latest
cluster upgrade. How many did I actually talk to? Maybe 7. (Maybe
others are more diligent than I am)  And this is the sort of
subjective decision that will get made on the basis of reputation and
familiarity.

Word-of-mouth counts a lot too. If there is a Vendor-X that nobody on
Campus has ever worked with there is a far more uphill battle to
convince the decision makers to give them a chance. The reasons have
to  be pretty compelling.

-- 
Rahul


From gus at ldeo.columbia.edu  Thu Jul  1 14:51:32 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Thu, 01 Jul 2010 17:51:32 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTinf2EamNBTf3KclTy0mUIs6RmROYgHGsqaHNT26@mail.gmail.com>
References: <C8520649.11209%James.P.Lux@jpl.nasa.gov>	<4C2CBC51.6090704@berkeley.edu>	<4C2CC542.9060007@ldeo.columbia.edu>
	<AANLkTinf2EamNBTf3KclTy0mUIs6RmROYgHGsqaHNT26@mail.gmail.com>
Message-ID: <4C2D0DE4.1070908@ldeo.columbia.edu>

Hi Rahul, list

So far we have three data points, if I counted right:

1) Yours: US$35k/TFlop, 100-node Beowulf cluster.
2) Dmitry Chubarov's US$158k/TFlop, SKIF/Cyberia cluster,
Feb/2007, Russia
3) Mark Hahn's $CAD 30k/TFlop (approx US$28/Tflop), in Canada,
a Beowulf cluster, I presume, size not specified.

Not sure if all these numbers correspond to nominal Tflops (Rmax),
or, to actual HPL benchmark (Rpeak),
or to some estimate, say,(85% Rpeak/Rmax HPL)*(nominal Tflops),
or to some other performance metric.

I can add one point to the data table.
19 months ago we paid about US$37k/Tflop (nominal),
or US$43k/Tflop (actual HPL).
(Small Beowulf cluster w/ IB, IPMI, GigE,
not counting UPS, storage, and head node.)

I guess if you buy the latest greatest fastest processors they
add a quite a bit on the $ side.

The prices seem to be significantly higher if you buy
outside the USA, Canada, maybe the EU also,
as Dmitry's number suggests.

Cray recently sold a 244Tflop XT6
to Brazil for weather forecast for US$20M (US$82k/Tflop).
A bigger Petaflop version was sold to
Los Alamos Natl. Lab for US$45M (US$45k/Tflop):

http://www.theregister.co.uk/2010/04/29/brazil_buys_cray_xt6/
http://insidehpc.com/2010/04/21/cray-wins-big-deal-in-brazil/
http://www.hpcwire.com/offthewire/Cray-Wins-20M-Contract-with-Brazils-National-Institute-for-Space-Research-91710269.html
http://www.networkworld.com/news/2010/040210-cray-supercomputer.html
http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=1415533&highlight=
http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=1409130&highlight

Interesting that the $/Tflop ratio seems to (still)
be significantly lower in the Beowulfs,
even though commodity processors, RAM, etc,
are used more and more in the brand name machines.

Mark got the best deal.
Maybe we should go buy in Canada, as people do with medicine.  :)

Cheers,
Gus Correa

Rahul Nabar wrote:
> On Thu, Jul 1, 2010 at 11:41 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>> Hey Rahul:
>>
>> Your original question went a long way, didn't it?  :)
>> Somehow you always ask questions that trigger these
>> interesting debates.
> 
> Yes it did. I am not sure if that is good or bad though. :) I hope it
> doesn't make me look like a "troll". I just had a genuine curiosity to
> know more about the economic side of HPC, economies of scale etc.;
> this is the stuff often in the shadows and I don't see much coverage
> in the typical material I read.
> 
> Best,
> 


From rpnabar at gmail.com  Thu Jul  1 14:58:30 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Thu, 1 Jul 2010 16:58:30 -0500
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2D0DE4.1070908@ldeo.columbia.edu>
References: <C8520649.11209%James.P.Lux@jpl.nasa.gov>
	<4C2CBC51.6090704@berkeley.edu>
	<4C2CC542.9060007@ldeo.columbia.edu>
	<AANLkTinf2EamNBTf3KclTy0mUIs6RmROYgHGsqaHNT26@mail.gmail.com>
	<4C2D0DE4.1070908@ldeo.columbia.edu>
Message-ID: <AANLkTik4BT1tUHDO-PkJ_a7rB3W3dO1hzJ1BfWHGmeW0@mail.gmail.com>

On Thu, Jul 1, 2010 at 4:51 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Hi Rahul, list
>
> Not sure if all these numbers correspond to nominal Tflops (Rmax),
> or, to actual HPL benchmark (Rpeak),
> or to some estimate, say,(85% Rpeak/Rmax HPL)*(nominal Tflops),
> or to some other performance metric.

Mine were nominal Tflops.

One more data point for Gus' list:

$27,000 per Teraflop (nominal) if I read this article about Crays
currently. US deployment and seems to be sort of an averaged out
number over several Cray systems circa Feb 2010

http://www.theregister.co.uk/2010/02/24/cray_dod_deals/

-- 
Rahul


From mathog at caltech.edu  Thu Jul  1 15:54:03 2010
From: mathog at caltech.edu (David Mathog)
Date: Thu, 01 Jul 2010 15:54:03 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
Message-ID: <E1OUSdv-0000ow-0S@mendel.bio.caltech.edu>

This thread brings to mind the punch line from the second of the SNL
"Citiwide Change Bank" ads.  Here is a link to the transcript:

  http://snltranscripts.jt.org/88/88achangebank2.phtml

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From samuel at unimelb.edu.au  Thu Jul  1 21:02:06 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Fri, 02 Jul 2010 14:02:06 +1000
Subject: [Beowulf] Multiple FlexLM lmgrd services on a single Linux
	machine?
In-Reply-To: <AANLkTikhll_4PABr2eIxxmrdtGVrqKZIJUQpdJFs6eae@mail.gmail.com>
References: <AANLkTikhll_4PABr2eIxxmrdtGVrqKZIJUQpdJFs6eae@mail.gmail.com>
Message-ID: <D45958078CD65C429557B4C5F492B6A6088CD73D@IS-EX-BEV3.unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/07/10 14:51, Sangamesh B wrote:

>           We're in a process of implementing a
> centralized FlexLM license server for multiple
> commercial applications. Can some one tell us,
> whether Linux OS support multiple lmgrd services
> or not?

Yes you can, when I was at VPAC we were doing it (and
they still are) and now I'm here we're doing it for our
cluster (though only for 2 at present).

What we do is have a single FlexLM install and then
a directory per application under /opt/licenses/.

In that directory we keep the license.dat file along
with the vendor daemon for the application and the
(optional) license.opt file.

cheers,
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwtZL4ACgkQO2KABBYQAh+cawCeO5iH3wJaJMHS8iTOwrCVRXt3
Uq8An0DQvevclbyfcETh9WPHNjfQaxdd
=qayK
-----END PGP SIGNATURE-----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100702/caef8fc1/attachment.html>

From john.hearns at mclaren.com  Fri Jul  2 01:59:05 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 2 Jul 2010 09:59:05 +0100
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <E1OUSdv-0000ow-0S@mendel.bio.caltech.edu>
References: <E1OUSdv-0000ow-0S@mendel.bio.caltech.edu>
Message-ID: <68A57CCFD4005646957BD2D18E60667B10F75E1F@milexchmb1.mil.tagmclarengroup.com>


> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
> On Behalf Of David Mathog
> Sent: 01 July 2010 23:54
> To: beowulf at beowulf.org
> Subject: Re: [Beowulf] dollars-per-teraflop : any lists like the
> Top500?
> 
> This thread brings to mind the punch line from the second of the SNL
> "Citiwide Change Bank" ads.  Here is a link to the transcript:
> 
>   http://snltranscripts.jt.org/88/88achangebank2.phtml

Hmmmm. 

I'd just returned from a business trip to London, and all the cash I had
was a five-pound note. 
Citiwide wasn't able to convert it to dollars, but they did give me four
guineas, two crowns, four shillings, and ten pence.

http://home.clara.net/brianp/money.html

One guinea = 21 shillings = 105p in decimal money
Crown   = five shillings 25p in decimal money
One shilling   = 5p in decimal money
One pence old money = 0.417p decimal

I make that 479.17pence decimal in exchange for five pounds (500p). Yup,
that's how Citibank make their money

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From Bill.Rankin at sas.com  Fri Jul  2 07:37:23 2010
From: Bill.Rankin at sas.com (Bill Rankin)
Date: Fri, 2 Jul 2010 14:37:23 +0000
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B10F75E1F@milexchmb1.mil.tagmclarengroup.com>
References: <E1OUSdv-0000ow-0S@mendel.bio.caltech.edu>
	<68A57CCFD4005646957BD2D18E60667B10F75E1F@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <76097BB0C025054786EFAB631C4A2E3C093322A0@MERCMBX04R.na.SAS.com>

David:

> > This thread brings to mind the punch line from the second of the SNL
> > "Citiwide Change Bank" ads.  Here is a link to the transcript:
> >
> >   http://snltranscripts.jt.org/88/88achangebank2.phtml

Thank you, now my keyboard is covered in coffee. ;-)


> Hmmmm.
> 
> I'd just returned from a business trip to London, and all the cash I
> had was a five-pound note.
> Citiwide wasn't able to convert it to dollars, but they did give me
> four guineas, two crowns, four shillings, and ten pence.
[...]
> I make that 479.17pence decimal in exchange for five pounds (500p).
> Yup, that's how Citibank make their money

Well John, that's American math for you.  I blame our educational system.

(You would have to go add that up, wouldn't you ;-).

Going completely off topic, most US banks will not convert anything other than paper currency and I have a baggie full of Euro and Pound coinage to show for it.  So the fact that "Citiwide" converted the note to coins would be another win for them.

Have a good holiday weekend, for those of us on this side of the pond(*).  To everyone else, have a nice normal weekend.

-b

(*) and north of Mexico and south of Canada, unless of course you are in Alaska or Hawaii.


From scrusan at UR.Rochester.edu  Thu Jul  1 09:53:51 2010
From: scrusan at UR.Rochester.edu (Steve Crusan)
Date: Thu, 01 Jul 2010 12:53:51 -0400
Subject: [Beowulf] guide for pbs/torque and mpi
In-Reply-To: <AANLkTik1lNXNS3n4XGzC-bQX7qlCVkSnK4UluL5mhW81@mail.gmail.com>
Message-ID: <C852405F.66D4%scrusan@ur.rochester.edu>

For Toreuq/Maui, go here:
http://www.clusterresources.com/products.php

There should be manuals for each product.


On 6/30/10 3:32 PM, "akshar bhosale" <akshar.bhosale at gmail.com> wrote:

> hi,
> ?we want to have a good reference guide for torque(pbs),maui and mpi
> 
> akshar
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


----------------------
Steve Crusan
System Administrator
Center for Research Computing
University of Rochester
(585) 276-5599
https://www.crc.rochester.edu/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100701/5d8bce17/attachment.html>

From Nico.Mittenzwey at informatik.tu-chemnitz.de  Fri Jul  2 04:11:03 2010
From: Nico.Mittenzwey at informatik.tu-chemnitz.de (Nico Mittenzwey)
Date: Fri, 02 Jul 2010 13:11:03 +0200
Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain user
Message-ID: <4C2DC947.8000903@informatik.tu-chemnitz.de>

Hi all,
we have a user with jobs that create heavy I/O load on our network file 
system every x minutes. If I block her jobs and start them manually with 
a delay of about 10 minutes, she can run about 60 jobs simultaneously.
However, if I don't do that and the jobs are started at more or less the 
same time (for example after a big job from another user finished), she 
can run about 30 jobs without stressing our file system so hard that 
"ls" takes 30 seconds to display any home directory.

So I would like to tell PBS/Maui to wait x seconds before starting 
another job of that particular user. Do you know of any means to 
accomplish that (even if I have to change the source)?

Since I may need this for some other users too I would prefer using 
PBS/Maui directly rather then blocking all jobs of this users and 
starting the jobs using a script.

Cheers,
Nico


From Bill.Rankin at sas.com  Fri Jul  2 08:06:38 2010
From: Bill.Rankin at sas.com (Bill Rankin)
Date: Fri, 2 Jul 2010 15:06:38 +0000
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <C85214C9.11218%James.P.Lux@jpl.nasa.gov>
References: <4C2CBC51.6090704@berkeley.edu>
	<C85214C9.11218%James.P.Lux@jpl.nasa.gov>
Message-ID: <76097BB0C025054786EFAB631C4A2E3C093322CD@MERCMBX04R.na.SAS.com>


> > Another reason why some vendors are willing to
> > sell stuff at reduced prices to universities is
> > for visibility. The thinking is that when grad
> > students (finally) graduate and go off into
> > industry, they'll want to buy the same stuff
> > they used when they were students. I'm not
> > sure how valid this approach is, but I don't
> > argue with it.
> >
> 
> Giving it away for free to educational institutions worked for Unix,
> eh? (at least in the long run)

Well, SW != HW.  Giving away the former is a matter of opportunity cost, whereas you have capital tied up in the latter.

But there are also (as I understand it) significant tax write-offs in the discounts given to universities.  Sun used to have heavy subsidies for their hardware sold to educational institutions, as did Dell and IBM.  I suspect that additional money on the balance sheet made the "visibility" argument a lot more palatable.

Microsoft once made a huge software "donation" to Duke Univ.  Many copies of Windows, Office, Visio, et al., values at millions of $.  But instead of shipping a few copies of the distribution media and a big list of individual software keys (which would have made storage and installation easier), they received pallets upon pallets containing individual boxed sets of all the software.  I expect that this was required in order for them to write-off the donation in their accounting books.


> And IBM used to give very attractive lease rates to universities (in
> the good old days, you couldn't BUY an IBM mainframe, only lease them)

Yup.  They used to do the same thing on their SP-X systems (do they still do this with the BlueGene's?)  I remember when the North Carolina Supercomputing Center defaulted after the third year of their five year SP3 lease due to state budget cuts.  True to form, IBM sent the trucks in and repossessed a 3-year old "supercomputer".

I think that could have made for great Reality TV. ;-)


> An interesting question is whether this strategy would be viable today,
> given the more short term return orientation of the capital markets.

As well as the fact that (at least for the HPC arena) the useful lifespan of a system is a lot less than that most mainframes.  The technology front is just moving too fast.

-bill


From john.hearns at mclaren.com  Fri Jul  2 08:30:29 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 2 Jul 2010 16:30:29 +0100
Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain user
In-Reply-To: <4C2DC947.8000903@informatik.tu-chemnitz.de>
References: <4C2DC947.8000903@informatik.tu-chemnitz.de>
Message-ID: <68A57CCFD4005646957BD2D18E60667B10FDC232@milexchmb1.mil.tagmclarengroup.com>


> Hi all,
> we have a user with jobs that create heavy I/O load on our network
file
> system every x minutes. If I block her jobs and start them manually
> with
> a delay of about 10 minutes, she can run about 60 jobs simultaneously.
> However, if I don't do that and the jobs are started at more or less
> the
> same time (for example after a big job from another user finished),
she
> can run about 30 jobs without stressing our file system so hard that
> "ls" takes 30 seconds to display any home directory.

Nico,
  I have the perfect tool for this job:
http://www.bofhcam.org/co-larters/lart-reference/


I did have a quick look in the PBSpro admin manual - I can't see
anything at first glance
which limits number of jobs per scheduler cycle. I could very well be
wrong.
You can increase the intervals between scheduler runs - but that will
not do exactly what you want.

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From james.p.lux at jpl.nasa.gov  Fri Jul  2 08:47:18 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Fri, 2 Jul 2010 08:47:18 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <76097BB0C025054786EFAB631C4A2E3C093322A0@MERCMBX04R.na.SAS.com>
Message-ID: <C8535816.11251%James.P.Lux@jpl.nasa.gov>


On 7/2/10 7:37 AM, "Bill Rankin" <Bill.Rankin at sas.com> wrote:

> 
> 
> Have a good holiday weekend, for those of us on this side of the pond(*).  To
> everyone else, have a nice normal weekend.
> 
> -b
> 
> (*) and north of Mexico and south of Canada, unless of course you are in
> Alaska or Hawaii.
> 
Yesterday was Canada Day/F?te du Canada... (and, according to NPR, it was
also the national day in Somalia, and given that in Muslim countries the
weekend is often Thursday/Friday, it was probably a holiday weekend for them
too)


From eugen at leitl.org  Sun Jul  4 09:05:01 2010
From: eugen at leitl.org (Eugen Leitl)
Date: Sun, 4 Jul 2010 18:05:01 +0200
Subject: [Beowulf] 6 TFlops, 450 MFlops/W watercooled IBM @ ETH
Message-ID: <20100704160501.GQ31956@leitl.org>


http://www.physorg.com/news197295578.html

IBM Hot Water-Cooled Supercomputer Goes Live at ETH Zurich

July 2, 2010

(PhysOrg.com) -- IBM has delivered a first-of-a-kind hot water-cooled
supercomputer to the Swiss Federal Institute of Technology Zurich (ETH
Zurich), marking a new era in energy-aware computing.  The innovative system,
dubbed Aquasar, consumes up to 40 percent less energy than a comparable
air-cooled machine. Through the direct use of waste heat to provide warmth to
university buildings, Aquasar's carbon footprint is reduced by up to 85
percent.

Building energy efficient computing systems and data centers is a staggering
undertaking. In fact, up to 50 percent of an average air-cooled data center's
energy consumption and carbon footprint today is not caused by computing but
by powering the necessary cooling systems to keep the processors from
overheating - a situation that is far from optimal when looking at energy
efficiency from a holistic perspective.

The development of Aquasar began one year ago as part of IBM's
First-Of-A-Kind (FOAK) program, which engages IBM scientists with clients to
explore and pilot emerging technologies that address business problems. The
supercomputer consists of special water-cooled IBM BladeCenter Servers, which
were designed and manufactured by IBM scientists in Zurich and Boblingen,
Germany. For direct comparison with traditional systems, Aquasar also holds
additional air-cooled IBM BladeCenter servers.

In total, the system achieves a performance of six Teraflops and has an
energy efficiency of about 450 megaflops per watt. In addition, nine
kilowatts of thermal power are fed into the ETH Zurich's building heating
system. With its innovative water-cooling system and direct utilization of
waste heat, Aquasar is now fully-operational at the Department of Mechanical
and Process Engineering at ETH Zurich.

"With Aquasar, we make an important contribution to the development of
sustainable high performance computers and computer system. In the future it
will be important to measure how efficiently a computer is per watt and per
gram of equivalent CO2 production," said Prof. Dimos Poulikakos, head of the
Laboratory of Thermodynamics in New Technologies, ETH Zurich.

Innovative water-cooling system

The processors and numerous other components in the new high performance
computer are cooled with up to 60 degrees C warm water. This is made possible
by an innovative cooling system that comprises micro-channel liquid coolers
which are attached directly to the processors, where most heat is generated.
With this chip-level cooling the thermal resistance between the processor and
the water is reduced to the extent that even cooling water temperatures of up
to 60 degrees C ensure that the operating temperatures of the processors
remain well below the maximally allowed 85 degrees C. The high input
temperature of the coolant results in an even higher-grade heat at the
output, which in this case is up to 65 degrees C. Overall, water removes heat
4,000 times more efficiently than air.

"With Aquasar we achieved an important milestone on the way to CO2-neutral
data centers," said Dr. Bruno Michel, manager of Advanced Thermal Packaging
at IBM Research - Zurich. "The next step in our research is to focus on the
performance and characteristics of the cooling system which will be measured
with an extensive system of sensors, in order to optimize it further."

Aquasar is part of a three-year collaborative research program called "Direct
use of waste heat from liquid-cooled supercomputers: the path to energy
saving, emission-high performance computers and data centers." In addition to
ETH Zurich and IBM Research - Zurich, the project also involves ETH Lausanne.
It is supported by the Swiss Centre of Competence of support for Energy and
Mobility (CCEM).

Source: IBM


From cbergstrom at pathscale.com  Sun Jul  4 09:30:57 2010
From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=)
Date: Sun, 04 Jul 2010 23:30:57 +0700
Subject: [Beowulf] 6 TFlops, 450 MFlops/W watercooled IBM @ ETH
In-Reply-To: <20100704160501.GQ31956@leitl.org>
References: <20100704160501.GQ31956@leitl.org>
Message-ID: <4C30B741.2070807@pathscale.com>

Eugen Leitl wrote:
> http://www.physorg.com/news197295578.html
>
> IBM Hot Water-Cooled Supercomputer Goes Live at ETH Zurich
>
> July 2, 2010
>
> (PhysOrg.com) -- IBM has delivered a first-of-a-kind hot water-cooled
> supercomputer to the Swiss Federal Institute of Technology Zurich (ETH
> Zurich), marking a new era in energy-aware computing.  The innovative system,
> dubbed Aquasar, consumes up to 40 percent less energy than a comparable
> air-cooled machine. Through the direct use of waste heat to provide warmth to
> university buildings
Others have already made the joke that Fermi could double as a space 
heater, but I wonder if that could end up being reality..  It's also not 
really breaking news that a water cooled system is more efficient than 
air, but how real world tested is this?  From the pictures on the 
youtube video [1] I wonder how this could be adapted to the current 
trend moving away from cell and towards gpus..  (The conduit looked 
pretty well mounted to processors which would be much harder to do with 
a vertical PCIe card...  Not to mention leaks..)

[1] http://www.youtube.com/watch?v=FbGyAXsLzIc


From samuel at unimelb.edu.au  Sun Jul  4 17:11:56 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Mon, 05 Jul 2010 10:11:56 +1000
Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain user
In-Reply-To: <4C2DC947.8000903@informatik.tu-chemnitz.de>
References: <4C2DC947.8000903@informatik.tu-chemnitz.de>
Message-ID: <D45958078CD65C429557B4C5F492B6A6088CD747@IS-EX-BEV3.unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/07/10 21:11, Nico Mittenzwey wrote:

> So I would like to tell PBS/Maui to wait x seconds before
> starting another job of that particular user. Do you know
> of any means to accomplish that (even if I have to change
> the source)?

Why not just limit the number of jobs they can run by
using MAXJOB to a level that your file server can cope
with ?

Is your fileserver a RHEL box using ext3 by some chance ?

cheers,
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwxI0wACgkQO2KABBYQAh9F2ACfchBAZ+qGV1qwKl9vtRtRBTp6
ARoAnjTs5SqcCnz8iNCAfrM56H2H/liq
=XPBD
-----END PGP SIGNATURE-----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100705/9440431b/attachment.html>

From hahn at mcmaster.ca  Tue Jul  6 08:12:17 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 6 Jul 2010 11:12:17 -0400 (EDT)
Subject: [Beowulf] HPL efficiency on Magny-Cours and Westmere?
Message-ID: <Pine.LNX.4.64.1007061103590.28333@coffee.psychology.mcmaster.ca>

Hi all,
can anyone tell me what kind of efficiency you're seeing on Magny-Cours
and Westmere systems?  by efficiency, I mean actual HPL performance as a
fraction of cores * clock * 4 flops/cycle.  I realize some of this can 
be drived from top500 results, but I'd also be be interested in single-
socket and single-node scores for comparison.

thanks, mark hahn.


From joshua_mora at usa.net  Tue Jul  6 08:57:48 2010
From: joshua_mora at usa.net (Joshua mora acosta)
Date: Tue, 06 Jul 2010 10:57:48 -0500
Subject: [Beowulf] HPL efficiency on Magny-Cours and Westmere?
Message-ID: <495ogFP5W5952S02.1278431868@web02.cms.usa.net>

MC 12core at 2.2GHz: 91% on die, 86.7% on node 2 socket , above 82% on
cluster.

Joshua

------ Original Message ------
Received: 10:34 AM CDT, 07/06/2010
From: Mark Hahn <hahn at mcmaster.ca>
To: Beowulf Mailing List <beowulf at beowulf.org>
Subject: [Beowulf] HPL efficiency on Magny-Cours and Westmere?

> Hi all,
> can anyone tell me what kind of efficiency you're seeing on Magny-Cours
> and Westmere systems?  by efficiency, I mean actual HPL performance as a
> fraction of cores * clock * 4 flops/cycle.  I realize some of this can 
> be drived from top500 results, but I'd also be be interested in single-
> socket and single-node scores for comparison.
> 
> thanks, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From prentice at ias.edu  Tue Jul  6 10:32:23 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 06 Jul 2010 13:32:23 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2CC9AE.7030202@scalableinformatics.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>	<4C2B3C5D.9030405@ias.edu>	<4C2B510E.5050406@scalableinformatics.com>	<4C2B64BA.3000704@ias.edu>
	<AANLkTim7xLlHXfxsXuGBWZNErfqvG_aLl4hqENhHynUQ@mail.gmail.com>
	<4C2CC9AE.7030202@scalableinformatics.com>
Message-ID: <4C3368A7.2090506@ias.edu>

Joe Landman wrote:
> Greg Rubino wrote:
>> I have to say I partially agree with Prentice.  I don't know if
>> prestige directly translates into revenue, but if your a huge company
> 
> Thats the thesis that I am saying I do not believe to be the case, and
> Prentis is (as far as I understand it) indicating that he believes this
> to be the case.
> 

My original point has been quite perverted by this thread, that I'm
abstaining from further comments. Except this one.

--
Prentice


From prentice at ias.edu  Tue Jul  6 11:35:58 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 06 Jul 2010 14:35:58 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C336DDB.6050600@scalableinformatics.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>	<4C2B3C5D.9030405@ias.edu>	<4C2B510E.5050406@scalableinformatics.com>	<4C2B64BA.3000704@ias.edu>	<AANLkTim7xLlHXfxsXuGBWZNErfqvG_aLl4hqENhHynUQ@mail.gmail.com>	<4C2CC9AE.7030202@scalableinformatics.com>
	<4C3368A7.2090506@ias.edu>
	<4C336DDB.6050600@scalableinformatics.com>
Message-ID: <4C33778E.3010303@ias.edu>

>> Joe Landman wrote:
>>> Greg Rubino wrote:
>>>> I have to say I partially agree with Prentice.  I don't know if
>>>> prestige directly translates into revenue, but if your a huge company
>>>
>>> Thats the thesis that I am saying I do not believe to be the case, and
>>> Prentis is (as far as I understand it) indicating that he believes this
>>> to be the case.
>>>

I think my original point was misconstrued, and may have been completely
forgotten in the subsequent conversation. Here's another attempt at
conveying my original point:

Using the big systems at the top of the Top500 list to get $/FLOP
wouldn't be a useful exercise, because these systems are usually sold
under NDA's and (probably) at a loss to the vendor.

I posited that the vendors sell these systems Top500-winning systems at
a loss (Roadrunner and Jaguar, in particulat) in exchange for other
"intangibles":

1. Gain knowledge through the R&D that goes into building these systems.

2. Collaborating with the computer science geniuses at the customer's
site (like the computer geniuses at LANL), which could lead to knowledge
transfer.

3.Bragging rights (which I referred to as "prestige" in my original
post, which may have lead to confusion).

I further said that making it to the top of the list provides valuable
media coverage, which equates to advertising for the the system vendor.
This seems to be where the confusion/furor started.

Other have gone on to argue whether or not that leads to a tangible
return on investments or pleases shareholders, but that wasn't really my
point.

My main point was that it would be difficult or impossible to get the
price of these systems. And since they account for so many of the FLOPS
in the Top500, they could skew the results of the average $/FLOP in the
Top500, or make such a number meaningless, since your average
institution can't by such a system under the same circumstance.

Now some more analogies that could be akin to adding fuel to the fire:

You could equate building such systems to making "the world's largest
pizza". I'm sure the small pizza place who makes it losses significant
money making it, but it will make the local papers, be in the Guinness
book of world's records forever (or until someone else makes a bigger
one) and probably be mentioned on his signs, business cards, and menus.
Clearly a publicity/advertising stunt. Can't think of any technology
transfer that would make a normal sized pizza any better in this case.

Car manufacturers often make exotic supercars for the same reason.
Remember the Ford GT, or the Mercedes-Benz McLaren SLR? These exotics
don't always make money, but the get a lot of press for the
manufacturer, bring prestige to the brand, and if the car is
sufficiently advanced enough technologically, the respect of
competitors. Seldom do these cars turn a profit, but since Ford and
Mercedes are large companies, their profits elsewhere subsidize these
projects. And of course, the high technology in these cars usually
trickles down to more proletarian models over the years.

While we're on the topic of cars, here's a perfect analogy: How much $$$
Did Henry Ford II (aka "The Deuce") spend to develop the GT40s, which he
 built solely to beat Ferrari at Le Mans?

-- 
Prentice


From deadline at eadline.org  Tue Jul  6 12:49:54 2010
From: deadline at eadline.org (Douglas Eadline)
Date: Tue, 6 Jul 2010 15:49:54 -0400 (EDT)
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C33778E.3010303@ias.edu>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com>
	<4C2B64BA.3000704@ias.edu>
	<AANLkTim7xLlHXfxsXuGBWZNErfqvG_aLl4hqENhHynUQ@mail.gmail.com>
	<4C2CC9AE.7030202@scalableinformatics.com> <4C3368A7.2090506@ias.edu>
	<4C336DDB.6050600@scalableinformatics.com> <4C33778E.3010303@ias.edu>
Message-ID: <41790.76.98.139.137.1278445794.squirrel@mail.eadline.org>

I have been predisposed so I missed the beginning of this
thread (seems like a good thing)

In any case, years ago there was some people suggesting that
both cost and power be added to the Top500 results.

The cost issue was just as contentious as it is now.
As I recall, the discussion always came down
to determining the true cost to purchase/install a production
cluster vs the cost to run HPL.

Things like actual hardware cost, staging cost (HW and SW),
optimization cost, power/cooling are too variable to really
track without a lot of effort (i.e. how do you account for
a turn-key cluster vs. student built) The composite and
piecemeal nature of clusters makes these types of
numbers difficult to determine.

As an aside, the Top500 is a great thing and I believe it
is used for things for which it was never intended.
It is after all one performance data point which has little
relevance to the codes most people run or the number of cores
most people use.

--
Doug


>>> Joe Landman wrote:
>>>> Greg Rubino wrote:
>>>>> I have to say I partially agree with Prentice.  I don't know if
>>>>> prestige directly translates into revenue, but if your a huge company
>>>>
>>>> Thats the thesis that I am saying I do not believe to be the case, and
>>>> Prentis is (as far as I understand it) indicating that he believes
>>>> this
>>>> to be the case.
>>>>
>
> I think my original point was misconstrued, and may have been completely
> forgotten in the subsequent conversation. Here's another attempt at
> conveying my original point:
>
> Using the big systems at the top of the Top500 list to get $/FLOP
> wouldn't be a useful exercise, because these systems are usually sold
> under NDA's and (probably) at a loss to the vendor.
>
> I posited that the vendors sell these systems Top500-winning systems at
> a loss (Roadrunner and Jaguar, in particulat) in exchange for other
> "intangibles":
>
> 1. Gain knowledge through the R&D that goes into building these systems.
>
> 2. Collaborating with the computer science geniuses at the customer's
> site (like the computer geniuses at LANL), which could lead to knowledge
> transfer.
>
> 3.Bragging rights (which I referred to as "prestige" in my original
> post, which may have lead to confusion).
>
> I further said that making it to the top of the list provides valuable
> media coverage, which equates to advertising for the the system vendor.
> This seems to be where the confusion/furor started.
>
> Other have gone on to argue whether or not that leads to a tangible
> return on investments or pleases shareholders, but that wasn't really my
> point.
>
> My main point was that it would be difficult or impossible to get the
> price of these systems. And since they account for so many of the FLOPS
> in the Top500, they could skew the results of the average $/FLOP in the
> Top500, or make such a number meaningless, since your average
> institution can't by such a system under the same circumstance.
>
> Now some more analogies that could be akin to adding fuel to the fire:
>
> You could equate building such systems to making "the world's largest
> pizza". I'm sure the small pizza place who makes it losses significant
> money making it, but it will make the local papers, be in the Guinness
> book of world's records forever (or until someone else makes a bigger
> one) and probably be mentioned on his signs, business cards, and menus.
> Clearly a publicity/advertising stunt. Can't think of any technology
> transfer that would make a normal sized pizza any better in this case.
>
> Car manufacturers often make exotic supercars for the same reason.
> Remember the Ford GT, or the Mercedes-Benz McLaren SLR? These exotics
> don't always make money, but the get a lot of press for the
> manufacturer, bring prestige to the brand, and if the car is
> sufficiently advanced enough technologically, the respect of
> competitors. Seldom do these cars turn a profit, but since Ford and
> Mercedes are large companies, their profits elsewhere subsidize these
> projects. And of course, the high technology in these cars usually
> trickles down to more proletarian models over the years.
>
> While we're on the topic of cars, here's a perfect analogy: How much $$$
> Did Henry Ford II (aka "The Deuce") spend to develop the GT40s, which he
>  built solely to beat Ferrari at Le Mans?
>
> --
> Prentice
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>


--
Doug

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mdidomenico4 at gmail.com  Tue Jul  6 19:01:08 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Tue, 6 Jul 2010 22:01:08 -0400
Subject: [Beowulf] hp dl170h g6?
Message-ID: <AANLkTim-aFPu3WxcaEUmYQiBBu4wNAMz0gcrqtXCdbQV@mail.gmail.com>

Does anyone on the list have HP DL170h G6 blade chassis's on their
floor?  Ours came with the on-board NIC mac addresses programmed in
descending order, I'm curious if this is something new, I've never
seen this done before.  Every machine we have on the floor now has
them in ascending order.

The downside to the nic enumeration is that in bios eth0 is eth0 and
pxe's from eth0, however, when inside anaconda (redhat) eth0 is really
eth1, and thus kickstart cant run.  The only time i've seen this
happen is when the nic drivers load out of order, but that's easy to
fix in the initrd

HP gave me a bunch of software workarounds, that I'm not overly happy
with, but I'd rather not have to put in a bunch of workarounds all
over the place for these specific machines.

Does anyone know if this is firmware flash fixable?  HP refuses to
acknowledge the question...


From jlb17 at duke.edu  Tue Jul  6 19:24:38 2010
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Tue, 6 Jul 2010 22:24:38 -0400 (EDT)
Subject: [Beowulf] hp dl170h g6?
In-Reply-To: <AANLkTim-aFPu3WxcaEUmYQiBBu4wNAMz0gcrqtXCdbQV@mail.gmail.com>
References: <AANLkTim-aFPu3WxcaEUmYQiBBu4wNAMz0gcrqtXCdbQV@mail.gmail.com>
Message-ID: <alpine.LRH.2.00.1007062219090.12745@hogwarts.egr.duke.edu>

On Tue, 6 Jul 2010 at 10:01pm, Michael Di Domenico wrote

> Does anyone on the list have HP DL170h G6 blade chassis's on their
> floor?  Ours came with the on-board NIC mac addresses programmed in
> descending order, I'm curious if this is something new, I've never
> seen this done before.  Every machine we have on the floor now has
> them in ascending order.
>
> The downside to the nic enumeration is that in bios eth0 is eth0 and
> pxe's from eth0, however, when inside anaconda (redhat) eth0 is really
> eth1, and thus kickstart cant run.  The only time i've seen this
> happen is when the nic drivers load out of order, but that's easy to
> fix in the initrd

I demoed the a chassis full of the HP SL2x170z G6s, and they had the same 
problem -- BIOS/PXE eth0 became eth1 in anaconda.  I worked around it by 
passing anaconda "ksdevice=bootif" in the pxe config file.  But, yeah, 
it's annoying having to work around oddness like that.  I'm looking at 
that model or the DL160 G6 (standard 1U), and I imagine the 160 will have 
the same issue.

> Does anyone know if this is firmware flash fixable?  HP refuses to
> acknowledge the question...

And, if it is, how many person-years will it take to find said firmware 
flash file on HP's website (seriously, how broken is that site!?)?

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


From beckerjes at mail.nih.gov  Tue Jul  6 19:44:07 2010
From: beckerjes at mail.nih.gov (Jesse Becker)
Date: Tue, 6 Jul 2010 22:44:07 -0400
Subject: [Beowulf] hp dl170h g6?
In-Reply-To: <AANLkTim-aFPu3WxcaEUmYQiBBu4wNAMz0gcrqtXCdbQV@mail.gmail.com>
References: <AANLkTim-aFPu3WxcaEUmYQiBBu4wNAMz0gcrqtXCdbQV@mail.gmail.com>
Message-ID: <20100707024407.GN1324@mail.nih.gov>

I've all manner of enumeration problems like this with HP hardware,
going back to the DL145G2 series.  I've neither seen, nor tried, any
firmware fixes.  It's massively annoying, to be sure.  I've a DL385 with
4 on-board NICs labeled as NET1 to NET4.  They are correspondingly
enumerated as eth2, eth3, eth0, eth1.


On Tue, Jul 06, 2010 at 10:01:08PM -0400, Michael Di Domenico wrote:
>Does anyone on the list have HP DL170h G6 blade chassis's on their
>floor?  Ours came with the on-board NIC mac addresses programmed in
>descending order, I'm curious if this is something new, I've never
>seen this done before.  Every machine we have on the floor now has
>them in ascending order.
>
>The downside to the nic enumeration is that in bios eth0 is eth0 and
>pxe's from eth0, however, when inside anaconda (redhat) eth0 is really
>eth1, and thus kickstart cant run.  The only time i've seen this
>happen is when the nic drivers load out of order, but that's easy to
>fix in the initrd
>
>HP gave me a bunch of software workarounds, that I'm not overly happy
>with, but I'd rather not have to put in a bunch of workarounds all
>over the place for these specific machines.
>
>Does anyone know if this is firmware flash fixable?  HP refuses to
>acknowledge the question...
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Jesse Becker
NHGRI Linux support (Digicon Contractor)


From alscheinine at tuffmail.us  Tue Jul  6 19:52:53 2010
From: alscheinine at tuffmail.us (Alan Louis Scheinine)
Date: Tue, 06 Jul 2010 21:52:53 -0500
Subject: [Beowulf] hp dl170h g6?
In-Reply-To: <AANLkTim-aFPu3WxcaEUmYQiBBu4wNAMz0gcrqtXCdbQV@mail.gmail.com>
References: <AANLkTim-aFPu3WxcaEUmYQiBBu4wNAMz0gcrqtXCdbQV@mail.gmail.com>
Message-ID: <4C33EC05.7070701@tuffmail.us>

Just so I understand better, in the file
/tftpboot/linux-install/pxelinux.cfg/<IP address>
changing from eth0 to eth1 the line
append ksdevice=eth1 [etc.]
does not solve the problem?

I've had similar problems but I don't remember how we solved it,
we tried everything randomly and in a semi-panic.  But since we
are calmly discussing it in the mailing list, it would be nice to
organize the question.  There is ksdevice in the file described above
and in addition there is
  "--device eth0" (or that could be "--device eth1") in the ks.cfg file.
Changing neither one nor the other nor both solves the problem?

Regards,
Alan
-- 

  Alan Scheinine
  200 Georgann Dr., Apt. E6
  Vicksburg, MS  39180

  Email: alscheinine at tuffmail.us
  Mobile phone: 225 288 4176

  http://www.flickr.com/photos/ascheinine


From alscheinine at tuffmail.us  Tue Jul  6 21:22:42 2010
From: alscheinine at tuffmail.us (Alan Louis Scheinine)
Date: Tue, 06 Jul 2010 23:22:42 -0500
Subject: [Beowulf] hp dl170h g6?
In-Reply-To: <alpine.LRH.2.00.1007062219090.12745@hogwarts.egr.duke.edu>
References: <AANLkTim-aFPu3WxcaEUmYQiBBu4wNAMz0gcrqtXCdbQV@mail.gmail.com>
	<alpine.LRH.2.00.1007062219090.12745@hogwarts.egr.duke.edu>
Message-ID: <4C340112.7080007@tuffmail.us>


  "ksdevice=bootif"  I had not previous heard of that option.

Joshua Baker-LePain writes:
> And, if it is, how many person-years will it take to find said
 > firmware flash file on HP's website (seriously, how broken is that site!?)?

To find a ppd file for an HP printer, navigating their website does not work
for me, whereas, using Google brings me to the right page on the HP web site.

With regards to Google, it finds a suggestion from Jay Hilliard
> 
> In your pxelinux config file:
> 
> add ksdevice=bootif
> 
> also add "IPAPPEND 2" to the end of the file
> 
> In your kickstart file, don't specify a device:
>    "network --bootproto dhcp"

There is also
> http://fedoraproject.org/wiki/Anaconda/Kickstart#network

Is this more complete?  Or is it incorrect?
I don't know, just asking.

-- 

  Alan Scheinine
  200 Georgann Dr., Apt. E6
  Vicksburg, MS  39180

  Email: alscheinine at tuffmail.us
  Mobile phone: 225 288 4176

  http://www.flickr.com/photos/ascheinine


From mdidomenico4 at gmail.com  Wed Jul  7 06:07:02 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Wed, 7 Jul 2010 09:07:02 -0400
Subject: [Beowulf] hp dl170h g6?
In-Reply-To: <4C340112.7080007@tuffmail.us>
References: <AANLkTim-aFPu3WxcaEUmYQiBBu4wNAMz0gcrqtXCdbQV@mail.gmail.com>
	<alpine.LRH.2.00.1007062219090.12745@hogwarts.egr.duke.edu>
	<4C340112.7080007@tuffmail.us>
Message-ID: <AANLkTik_lLArOqgq8IhEpTLKZRqSCzS6R8ok-QBy2ySR@mail.gmail.com>

fyi... The two suggestions i got from HP, might help others

1.

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&objectID=c01430330&jumpid=reg_R1002_USEN

2.

Please try renaming the interfaces using UDEV rules. Before changing
the names, please note which MAC address corresponds to which NIC port
and the way they are to be ordered. DMIDECODE command may be used to
obtain this information.

For changing the NIC's name allocation, please change the file:
/etc/udev/rules.d/XX-net_persistent_names.rules inserting the correct
names in reference to MAC address. Depending on the distro and version
the XX could be any number and the name of the file could be
persistent-net.rules.

root at linux:~# vi /etc/udev/rules.d/30-net_persistent_names.rules
[...]
SUBSYSTEM=="net", ACTION=="add", SYSFS{address}=="00:00:00::00:01",
IMPORT="/lib/udev/rename_netiface %k eth0"
SUBSYSTEM=="net", ACTION=="add", SYSFS{address}=="00:00:00:00:00:02",
IMPORT="/lib/udev/rename_netiface %k eth1"

At the end of each line replace the name ethX with the name you want
to use, ie instead of eth0 use eth2 and finally reboot the system.

You may refer to the following link for more information on writing udev rules:

	http://www.reactivated.net/writing_udev_rules.html

---

Both of which would work to workaround the problem, as well as the
ksdevice options do.  However, the trouble comes in that we have post
install scripts inside our kickstart files, if the NIC's enumerate
wrong they don't work.  We expect every machine in our building to
have eth0 (mgmt) and eth1 (regular) network connections and up until
now 99% of them work this way.

since at least one other person has gotten the systems the same way, i
guess there's not much i can do.


On Wed, Jul 7, 2010 at 12:22 AM, Alan Louis Scheinine
<alscheinine at tuffmail.us> wrote:
>
> ?"ksdevice=bootif" ?I had not previous heard of that option.
>
> Joshua Baker-LePain writes:
>>
>> And, if it is, how many person-years will it take to find said
>
>> firmware flash file on HP's website (seriously, how broken is that
>> site!?)?
>
> To find a ppd file for an HP printer, navigating their website does not work
> for me, whereas, using Google brings me to the right page on the HP web
> site.
>
> With regards to Google, it finds a suggestion from Jay Hilliard
>>
>> In your pxelinux config file:
>>
>> add ksdevice=bootif
>>
>> also add "IPAPPEND 2" to the end of the file
>>
>> In your kickstart file, don't specify a device:
>> ? "network --bootproto dhcp"
>
> There is also
>>
>> http://fedoraproject.org/wiki/Anaconda/Kickstart#network
>
> Is this more complete? ?Or is it incorrect?
> I don't know, just asking.
>
> --
>
> ?Alan Scheinine
> ?200 Georgann Dr., Apt. E6
> ?Vicksburg, MS ?39180
>
> ?Email: alscheinine at tuffmail.us
> ?Mobile phone: 225 288 4176
>
> ?http://www.flickr.com/photos/ascheinine
>


From holden.dapenor at gmail.com  Sun Jul  4 22:36:17 2010
From: holden.dapenor at gmail.com (Holden Dapenor)
Date: Mon, 5 Jul 2010 01:36:17 -0400
Subject: [Beowulf] diskless cluster questions
Message-ID: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>

How does diskless clustering work for those aspects of the OS that need to
be unique for each node? For instance, network configuration and hostfiles
need to be specified somewhere, but if all nodes boot the same root, then
where is this information stored?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100705/761113da/attachment.html>

From roger at HPC.MsState.Edu  Tue Jul  6 08:58:52 2010
From: roger at HPC.MsState.Edu (Roger L. Smith)
Date: Tue, 06 Jul 2010 10:58:52 -0500
Subject: [Beowulf] HPL efficiency on Magny-Cours and Westmere?
In-Reply-To: <Pine.LNX.4.64.1007061103590.28333@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1007061103590.28333@coffee.psychology.mcmaster.ca>
Message-ID: <4C3352BC.60505@HPC.MsState.Edu>

We achieved ~87% on HPL across 3072 cores (256 nodes) with QDR IB and 2GB/core. 
  If I remember correctly, we got about 91% on a single node (across 12 cores).
We didn't run single-socket tests.

This is on an IBM iDataPlex with 2.8GHz X5660 Westmeres.

Mark Hahn wrote:
> Hi all,
> can anyone tell me what kind of efficiency you're seeing on Magny-Cours
> and Westmere systems?  by efficiency, I mean actual HPL performance as a
> fraction of cores * clock * 4 flops/cycle.  I realize some of this can 
> be drived from top500 results, but I'd also be be interested in single-
> socket and single-node scores for comparison.
> 
> thanks, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Roger L. Smith
Senior Systems Administrator
Mississippi State University
High Performance Computing Collaboratory


From per at computer.org  Tue Jul  6 12:22:01 2010
From: per at computer.org (Per Jessen)
Date: Tue, 06 Jul 2010 21:22:01 +0200
Subject: [Beowulf] 6 TFlops, 450 MFlops/W watercooled IBM @ ETH
References: <20100704160501.GQ31956@leitl.org> <4C30B741.2070807@pathscale.com>
Message-ID: <i0vvop$539$1@saturn.local.net>

"C. Bergstr?m" wrote:

> Eugen Leitl wrote:
>> http://www.physorg.com/news197295578.html
>>
>> IBM Hot Water-Cooled Supercomputer Goes Live at ETH Zurich
>>
>> July 2, 2010
>>
>> (PhysOrg.com) -- IBM has delivered a first-of-a-kind hot water-cooled
>> supercomputer to the Swiss Federal Institute of Technology Zurich
>> (ETH
>> Zurich), marking a new era in energy-aware computing.  The innovative
>> system, dubbed Aquasar, consumes up to 40 percent less energy than a
>> comparable air-cooled machine. Through the direct use of waste heat
>> to provide warmth to university buildings
>
> Others have already made the joke that Fermi could double as a space
> heater, but I wonder if that could end up being reality..  It's also
> not really breaking news that a water cooled system is more efficient
> than air, but how real world tested is this? 

The concept is very real world, but quite old.  Back in the 80s I worked
for a Danish bank - our IBM 3090s were water-cooled as were the 
cooling-towers.  Around '89 we installed heat-exchangers for re-using
the hot cooling water for warming up the offices (in winter). 


/Per Jessen, Z?rich


From eagles051387 at gmail.com  Wed Jul  7 07:33:30 2010
From: eagles051387 at gmail.com (Jonathan Aquilina)
Date: Wed, 7 Jul 2010 16:33:30 +0200
Subject: [Beowulf] diskless cluster questions
In-Reply-To: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
References: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
Message-ID: <AANLkTinNHy75rprTHNNXiYsn_Ot2I39F_Mlirkvg45SS@mail.gmail.com>

its actually easier if you use one os you know how to work with that way.
the os doesnt need to be unique for each node, but  you want the slave nodes
to have as much ram as possible. also you will need to use pxe to boot off
the master node as well as tftp to transfer the information from master to
slaves. when goign diskless all information on slaves is stored in ram. then
once you power them off if im not mistaken any data is sent back to the
master for storage


> For instance, network configuration and hostfiles need to be specified
> somewhere, but if all nodes boot the same root, then where is this
> information stored?
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>


-- 
Jonathan Aquilina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100707/4a7db00e/attachment.html>

From ashley at pittman.co.uk  Wed Jul  7 07:47:23 2010
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Wed, 7 Jul 2010 15:47:23 +0100
Subject: [Beowulf] diskless cluster questions
In-Reply-To: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
References: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
Message-ID: <83248991-E0D9-4D84-AE97-CCBA76622244@pittman.co.uk>


On 5 Jul 2010, at 06:36, Holden Dapenor wrote:

> How does diskless clustering work for those aspects of the OS that need to be unique for each node?

Differently for each distribution.

> For instance, network configuration and hostfiles need to be specified somewhere

hostfiles are the same, network configuration including hostname can be done by dhcp.

> but if all nodes boot the same root, then where is this information stored? 

This is not mandated by diskless configuration, you may choose to share / ro between clients and have a rw copy of /var for each client or you may choose to have an entire fs tree for each client.

Another option might be to use fuse although I don't have much experience of that myself, it's basically the same but each client would have a copy-on-write version of /var and /etc to allow them to write to files in these directories.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From hearnsj at googlemail.com  Wed Jul  7 08:02:19 2010
From: hearnsj at googlemail.com (John Hearns)
Date: Wed, 7 Jul 2010 16:02:19 +0100
Subject: [Beowulf] diskless cluster questions
In-Reply-To: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
References: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
Message-ID: <AANLkTin-F6VjNx1xag7NjQL75kncMhg_2cAynTxp8JrI@mail.gmail.com>

On 5 July 2010 06:36, Holden Dapenor <holden.dapenor at gmail.com> wrote:
> How does diskless clustering work for those aspects of the OS that need to
> be unique for each node?

As Ashley says, you use DHCP for the network configuration.
There is very little else you should need to configure differently on
each individual host - for instance batch scheduler systems store
information on
batch nodes in a central place. All the node needs to do is be
configured to know its batch master, and to start the batch system
daemon, then
wait for the jobs to come in.
Any changes to (say) /etc/pbs.conf are generally made to all the
cluster nodes identically.


From eagles051387 at gmail.com  Wed Jul  7 09:14:03 2010
From: eagles051387 at gmail.com (Jonathan Aquilina)
Date: Wed, 7 Jul 2010 18:14:03 +0200
Subject: [Beowulf] diskless cluster questions
In-Reply-To: <C85A1262.848F%scrusan@ur.rochester.edu>
References: <AANLkTinNHy75rprTHNNXiYsn_Ot2I39F_Mlirkvg45SS@mail.gmail.com> 
	<C85A1262.848F%scrusan@ur.rochester.edu>
Message-ID: <AANLkTingWgA2AfS_9zhFmIwbzrxuI-IPWmymiNpZNXTL@mail.gmail.com>

i thought any changes made get transmitted back to the master node and
stored so that next time the node is powered up it can be restored to that
state
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100707/42f76463/attachment.html>

From Nico.Mittenzwey at informatik.tu-chemnitz.de  Wed Jul  7 09:15:39 2010
From: Nico.Mittenzwey at informatik.tu-chemnitz.de (Nico Mittenzwey)
Date: Wed, 07 Jul 2010 18:15:39 +0200
Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain user
In-Reply-To: <201007051900.o65J09rf004655@bluewest.scyld.com>
References: <201007051900.o65J09rf004655@bluewest.scyld.com>
Message-ID: <4C34A82B.10000@informatik.tu-chemnitz.de>

Christopher Samuel wrote:
> Why not just limit the number of jobs they can run by
> using MAXJOB to a level that your file server can cope
> with ?

Because I have free nodes and don't want them to idle while jobs are 
available. ;) And - as always - the users want their results asap...

> Is your fileserver a RHEL box using ext3 by some chance ?

No a rather big Lustre storage system - but the mentioned user alone 
creates an I/O load of 2.5GB/s.

cheers
Nico


From Nico.Mittenzwey at informatik.tu-chemnitz.de  Wed Jul  7 09:33:29 2010
From: Nico.Mittenzwey at informatik.tu-chemnitz.de (Nico Mittenzwey)
Date: Wed, 07 Jul 2010 18:33:29 +0200
Subject: [Beowulf] diskless cluster questions
In-Reply-To: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
References: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
Message-ID: <4C34AC59.4030405@informatik.tu-chemnitz.de>

Holden Dapenor wrote:
> How does diskless clustering work for those aspects of the OS that need 
> to be unique for each node?

You may take a look at www.perceus.org which is a nice solution for 
diskless clusters and handles that aspects.

cheers,
Nico


From rpnabar at gmail.com  Wed Jul  7 09:54:32 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 7 Jul 2010 11:54:32 -0500
Subject: [Beowulf] PBS/Maui delay starting of several jobs for certain 
	user
In-Reply-To: <4C2DC947.8000903@informatik.tu-chemnitz.de>
References: <4C2DC947.8000903@informatik.tu-chemnitz.de>
Message-ID: <AANLkTinRlwd8pRdCaFdALaVGDha8eYTxmxG1vV02_b5n@mail.gmail.com>

On Fri, Jul 2, 2010 at 6:11 AM, Nico Mittenzwey
<Nico.Mittenzwey at informatik.tu-chemnitz.de> wrote:
> So I would like to tell PBS/Maui to wait x seconds before starting another
> job of that particular user. Do you know of any means to accomplish that
> (even if I have to change the source)?

Would a modified prologue script do the job? You could have a bash
script that looks at a username and adds a "wait y" delay there only
for this user's jobs. y can be made some function of x and the current
number of that users jobs? There could be race conditions but if you
add a random delay before the logic, this might not be an issue.

Yes, it's a hack but it might work?

-- 
Rahul


From mathog at caltech.edu  Wed Jul  7 15:34:01 2010
From: mathog at caltech.edu (David Mathog)
Date: Wed, 07 Jul 2010 15:34:01 -0700
Subject: [Beowulf] instances where a failed storage block is not all zero?
Message-ID: <E1OWdBp-0003Zb-6F@mendel.bio.caltech.edu>

With "modern" hardware are there currently any notable instances where a
failed read of a hardware storage area block results in that missing
data being filled in with something other than null bytes?   For
instance, if a disk swapped a bad block out of the inside of a file, or
a region of a DVD goes bad.  (Assuming that the software reading it can
even go on beyond the failure, which is often not possible, for instance
on many tapes.)  

I know for instance that when reading from damaged media

  dd conv=sync,noerror

will fill in with null bytes, but there is a lot of other software out
there...

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From lindahl at pbm.com  Wed Jul  7 16:04:40 2010
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 7 Jul 2010 16:04:40 -0700
Subject: [Beowulf] instances where a failed storage block is not all zero?
In-Reply-To: <E1OWdBp-0003Zb-6F@mendel.bio.caltech.edu>
References: <E1OWdBp-0003Zb-6F@mendel.bio.caltech.edu>
Message-ID: <20100707230440.GB9218@bx9.net>

On Wed, Jul 07, 2010 at 03:34:01PM -0700, David Mathog wrote:

> With "modern" hardware are there currently any notable instances where a
> failed read of a hardware storage area block results in that missing
> data being filled in with something other than null bytes?

Yes. You might get the wrong block due to a misdirected write or read,
or you might get an old block because the previous write experienced
"write tearing".

If the OS knows it was unable to read a block and replaced it with
zeros, it will throw an error. In Linux, the behavior depends on what
you chose: panic on error, mount r/o, or continue. If the nulls are
part of the filesystem metadata, all hell can break loose.

The errors in the first paragraph won't be detected at all. They're
rare, but...

-- greg


From hahn at mcmaster.ca  Wed Jul  7 16:14:42 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Wed, 7 Jul 2010 19:14:42 -0400 (EDT)
Subject: [Beowulf] instances where a failed storage block is not all zero?
In-Reply-To: <E1OWdBp-0003Zb-6F@mendel.bio.caltech.edu>
References: <E1OWdBp-0003Zb-6F@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.64.1007071904421.31387@coffee.psychology.mcmaster.ca>

> With "modern" hardware are there currently any notable instances where a
> failed read of a hardware storage area block results in that missing
> data being filled in with something other than null bytes?   For

I'm surprised at the question: I expect failed reads to result in 
out-of-band errors, not zeros.  a failed read on a disk, for instance,
indicates that the ECC failed - considering that the ECC is quite 
strong, it would be surprising to encounter a failure which wasn't
even detected by the ECC.

on what kind of medium are you finding errors-returned-as-zero?

> a region of a DVD goes bad.  (Assuming that the software reading it can
> even go on beyond the failure, which is often not possible, for instance
> on many tapes.)

I know it's sorta possible to read raw (extended) sectors from disks,
but it's pretty deep voodoo.  I guess I would expect the contents to 
not fail-to-zero on disks (which I guess use some kind of NRZ-like encoding).
I wouldn't be surprised if damaged flash would read all-0 or all-1
(flash erase sets a block to all-1, right, and writing is basically
clearing selective zeros?)

> will fill in with null bytes, but there is a lot of other software out
> there...

I think it's a question of whether you're using an exotic interface
or not - normal kernel block/char devices aren't ever going to do this.
people in the forensics/recovery business would be the ones to ask.

-mark hahn


From ebiederm at xmission.com  Wed Jul  7 20:12:53 2010
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Wed, 07 Jul 2010 20:12:53 -0700
Subject: [Beowulf] instances where a failed storage block is not all zero?
In-Reply-To: <E1OWdBp-0003Zb-6F@mendel.bio.caltech.edu> (David Mathog's
	message of "Wed\, 07 Jul 2010 15\:34\:01 -0700")
References: <E1OWdBp-0003Zb-6F@mendel.bio.caltech.edu>
Message-ID: <m1ocei8y4q.fsf@fess.ebiederm.org>

"David Mathog" <mathog at caltech.edu> writes:

> With "modern" hardware are there currently any notable instances where a
> failed read of a hardware storage area block results in that missing
> data being filled in with something other than null bytes?   For
> instance, if a disk swapped a bad block out of the inside of a file, or
> a region of a DVD goes bad.  (Assuming that the software reading it can
> even go on beyond the failure, which is often not possible, for instance
> on many tapes.)  

I seem to remember people looking at the expected failure rates were saying
that were are getting close to the point where a single pass through the
disk will have a single undetected bit flip.  Which is one of the reasons
disk manufacturers have wanted to go to 4K blocks recently so the can get
more error checking/correcting per bit.

Eric


From cbergstrom at pathscale.com  Thu Jul  8 17:06:47 2010
From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=)
Date: Fri, 09 Jul 2010 07:06:47 +0700
Subject: [Beowulf] Slightly OT : GPU Optimized HPL and other benchmarks
Message-ID: <4C366817.8040106@pathscale.com>


Hi all

We recently announced our new ENZO gpu solution for Nvidia hardware and 
now working on performance for various benchmarks.  If anyone has a 
small gpu cluster or is planning to have one in the future we're looking 
for beta testers who can share kernels and give good feedback.  Instead 
of a CUDA/OpenCL front-end we've opted for HMPP pragma approach which 
offers a lot of great benefits, but please contact me off list for more 
details on that.

We're specifically interested to work with anyone or group who really 
wants to push the efficiency/performance of HPL (lapack).

Thanks

Christopher


From mdidomenico4 at gmail.com  Thu Jul  8 18:05:26 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Thu, 8 Jul 2010 21:05:26 -0400
Subject: [Beowulf] Slightly OT : GPU Optimized HPL and other benchmarks
In-Reply-To: <4C366817.8040106@pathscale.com>
References: <4C366817.8040106@pathscale.com>
Message-ID: <AANLkTilitRG-ovtKgYhHLpefylQSjp9thcxSQyzTB6-L@mail.gmail.com>

If you can point me/others towards some documentation on the
system/api's, perhaps some of my/our researchers might be
interested...

2010/7/8 "C. Bergstr?m" <cbergstrom at pathscale.com>:
>
> Hi all
>
> We recently announced our new ENZO gpu solution for Nvidia hardware and now
> working on performance for various benchmarks. ?If anyone has a small gpu
> cluster or is planning to have one in the future we're looking for beta
> testers who can share kernels and give good feedback. ?Instead of a
> CUDA/OpenCL front-end we've opted for HMPP pragma approach which offers a
> lot of great benefits, but please contact me off list for more details on
> that.
>
> We're specifically interested to work with anyone or group who really wants
> to push the efficiency/performance of HPL (lapack).
>
> Thanks
>
> Christopher
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


From lathama at gmail.com  Wed Jul  7 07:22:40 2010
From: lathama at gmail.com (Andrew Latham)
Date: Wed, 7 Jul 2010 10:22:40 -0400
Subject: [Beowulf] diskless cluster questions
In-Reply-To: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
References: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
Message-ID: <AANLkTik7d-UVWhDBDcIVa-ASqTIC0T06ZELQBbDo48Lg@mail.gmail.com>

Tools like DHCP can manage information for individual nodes as a
server.  The nodes can be identified by the network card MAC address.
Things like IP, Netmask, Router, DNS, Hostname and other options can
be set per MAC identifier.  A great example is the mass deployment of
VoIP Hardphones that ask a central server for a configuration based on
the MAC address.


~
Andrew "lathama" Latham
lathama at gmail.com

* Learn more about OSS http://en.wikipedia.org/wiki/Open-source_software
* Learn more about Linux http://en.wikipedia.org/wiki/Linux
* Learn more about Tux http://en.wikipedia.org/wiki/Tux


On Mon, Jul 5, 2010 at 1:36 AM, Holden Dapenor <holden.dapenor at gmail.com> wrote:
> How does diskless clustering work for those aspects of the OS that need to
> be unique for each node? For instance, network configuration and hostfiles
> need to be specified somewhere, but if all nodes boot the same root, then
> where is this information stored?
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From scrusan at UR.Rochester.edu  Wed Jul  7 08:15:46 2010
From: scrusan at UR.Rochester.edu (Steve Crusan)
Date: Wed, 07 Jul 2010 11:15:46 -0400
Subject: [Beowulf] diskless cluster questions
In-Reply-To: <AANLkTinNHy75rprTHNNXiYsn_Ot2I39F_Mlirkvg45SS@mail.gmail.com>
Message-ID: <C85A1262.848F%scrusan@ur.rochester.edu>

> then once you power them off if im not mistaken any data is sent back to the
master for storage

You should lose all of your changes if your OS is kept in RAM once you power
off the node, reboot it, etc.


On 7/7/10 10:33 AM, "Jonathan Aquilina" <eagles051387 at gmail.com> wrote:

> 
> its actually easier if you use one os you know how to work with that way. the
> os doesnt need to be unique for each node, but? you want the slave nodes to
> have as much ram as possible. also you will need to use pxe to boot off the
> master node as well as tftp to transfer the information from master to slaves.
> when goign diskless all information on slaves is stored in ram. then once you
> power them off if im not mistaken any data is sent back to the master for
> storage
> ?
>> For instance, network configuration and hostfiles need to be specified
>> somewhere, but if all nodes boot the same root, then where is this
>> information stored?
>> 
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>> 
> 
> 


----------------------
Steve Crusan
System Administrator
Center for Research Computing
University of Rochester
https://www.crc.rochester.edu/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100707/bf0ddd1c/attachment.html>

From vallard at benincosa.com  Wed Jul  7 09:09:59 2010
From: vallard at benincosa.com (Vallard Benincosa)
Date: Wed, 7 Jul 2010 09:09:59 -0700
Subject: [Beowulf] diskless cluster questions
In-Reply-To: <AANLkTin-F6VjNx1xag7NjQL75kncMhg_2cAynTxp8JrI@mail.gmail.com>
References: <AANLkTinwuQFChhkTa7nJxnJ-6PHoIsWr5f6-Rp8GrnQQ@mail.gmail.com>
	<AANLkTin-F6VjNx1xag7NjQL75kncMhg_2cAynTxp8JrI@mail.gmail.com>
Message-ID: <AANLkTinxwnShNTvRB0FFe8rME-XJ_XeLVs2DaKJsXacZ@mail.gmail.com>

With diskless clusters you also need to be aware of the many ways to do it:

- RAM root - where all of the OS is loaded in memory
- NFS root - which is what a lot of people seem to call diskless
- RamRoot/NFS root hybrid - where some directories like /root live on RAM
and /usr lives on NFS for example.

We really like RAM root for HPC because you can make a small image
(150-300MB with InfiniBand) that is portable, has great performance and is
easy to reproduce and update.  The disadvantages are if you run multiple
applications where some library may not be in the image.  In that sense,
using NFS root works better for those environments that run a lot of
different applications.  However, many large clusters I have worked with are
dedicated to one single application and ram root fits the bill perfectly.

Like Ashley said all the host name info is configured via DHCP.  Many people
also put arguments in the PXE boot file to help specify additional
parameters.  I think the old Red Hat stateless did NFSROOT= for example.  I
have also seen many other homegrown ones where they throw everything but the
kitchen sink in as arguments.

In addition for configuring other devices (like InfiniBand IP addresses)
instead of just IPADDR=10.3.0.201 in the config file there would be some
script:  IPADDR=10.3.0.$(`hostname` | sed 's/node//')

These seem to be the tricks I see on doing this.  You may also want to look
into two projects that do stateless/diskless booting:  xCAT and Perceus.
Both of them allow for all three methods described above.  There may be
others as well.

Hope that helps some what.

On Wed, Jul 7, 2010 at 8:02 AM, John Hearns <hearnsj at googlemail.com> wrote:

> On 5 July 2010 06:36, Holden Dapenor <holden.dapenor at gmail.com> wrote:
> > How does diskless clustering work for those aspects of the OS that need
> to
> > be unique for each node?
>
> As Ashley says, you use DHCP for the network configuration.
> There is very little else you should need to configure differently on
> each individual host - for instance batch scheduler systems store
> information on
> batch nodes in a central place. All the node needs to do is be
> configured to know its batch master, and to start the batch system
> daemon, then
> wait for the jobs to come in.
> Any changes to (say) /etc/pbs.conf are generally made to all the
> cluster nodes identically.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Vallard
http://sumavi.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100707/3d2426fc/attachment.html>

From akshar.bhosale at gmail.com  Wed Jul  7 12:01:59 2010
From: akshar.bhosale at gmail.com (akshar bhosale)
Date: Thu, 8 Jul 2010 00:31:59 +0530
Subject: [Beowulf] shutting down pbs server and maui for half an hour will
	affect running jobs?
Message-ID: <AANLkTin6y-mYJQggkVu-_gJ8BZ5q2d4m1mJriCjl-DIK@mail.gmail.com>

hi,
we have maintenance of pbs server so it is going down for half an hour
..will it affect running jobs?where is the timeout defined?can it be
increased? on pbs mom side or pbs server side we need to change?any other
parameter we need to check ?will it hold the already running jobs for half
an hour? what care should we take in order to avoid jobs not getting
killed?we have torque installed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100708/1f481093/attachment.html>

From douglas.guptill at dal.ca  Fri Jul  9 09:43:13 2010
From: douglas.guptill at dal.ca (Douglas Guptill)
Date: Fri, 9 Jul 2010 13:43:13 -0300
Subject: [Beowulf] first cluster [was [OMPI users] trouble using openmpi
	under slurm]
In-Reply-To: <4C35D614.1090607@ldeo.columbia.edu>
References: <AANLkTilcgGu1kA8x6VRA2DZAQn-jD-8ecy6EYJ8NkwNY@mail.gmail.com>
	<5CF41CDB-39F0-477C-B6D6-4F2E50BE6909@open-mpi.org>
	<AANLkTilOZtBNIYgP8BObhp7XUNQHfar5R2L9psHlJ-Ig@mail.gmail.com>
	<2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org>
	<AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
Message-ID: <20100709164313.GA25062@sopalepc>

On Thu, Jul 08, 2010 at 09:43:48AM -0400, Gus Correa wrote:
> Douglas Guptill wrote:
>> On Wed, Jul 07, 2010 at 12:37:54PM -0600, Ralph Castain wrote:
>>
>>> No....afraid not. Things work pretty well, but there are places
>>> where things just don't mesh. Sub-node allocation in particular is
>>> an issue as it implies binding, and slurm and ompi have conflicting
>>> methods.
>>>
>>> It all can get worked out, but we have limited time and nobody cares
>>> enough to put in the effort. Slurm just isn't used enough to make it
>>> worthwhile (too small an audience).
>>
>> I am about to get my first HPC cluster (128 nodes), and was
>> considering slurm.  We do use MPI.
>>
>> Should I be looking at Torque instead for a queue manager?
>>
> Hi Douglas
>
> Yes, works like a charm along with OpenMPI.
> I also have MVAPICH2 and MPICH2, no integration w/ Torque,
> but no conflicts either.

Thanks, Gus.

After some lurking and reading, I plan this:
  Debian (lenny)
  + fai                   - for compute-node operating system install
  + Torque                - job scheduler/manager
  + MPI (Intel MPI)       - for the application
  + MPI (OpenMP)          - alternative MPI

Does anyone see holes in this plan?

Thanks,
Douglas
-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From douglas.guptill at dal.ca  Fri Jul  9 13:57:26 2010
From: douglas.guptill at dal.ca (Douglas Guptill)
Date: Fri, 9 Jul 2010 17:57:26 -0300
Subject: [Beowulf] first cluster
In-Reply-To: <Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
References: <AANLkTilOZtBNIYgP8BObhp7XUNQHfar5R2L9psHlJ-Ig@mail.gmail.com>
	<2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org>
	<AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
Message-ID: <20100709205726.GA7313@sopalepc>

On Fri, Jul 09, 2010 at 02:19:53PM -0400, Mark Hahn wrote:
>>  Debian (lenny)
>
> why?  centos is generally considered the safest choice,
> unless you're religiously committed to debian.

Almost religiously.  I have found it a very stable platform for
everything up to clusters.  If Debian fails to do the job, CentoOS is
my backup plan.

>>  + fai                   - for compute-node operating system install
>
> do you explicitly want a diskful install?  I think there's pretty wide 
> consensus that nfs root (or at least net-loaded ram image) clusters are 
> better.

Thank you for that opinion, which is new to me.  

I believe fai can do a variety of install-types, including diskful,
and nfs root.  But then, I am still in the planning stage, and have no
practical experience.

Thanks,
Douglas.
-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From rpnabar at gmail.com  Fri Jul  9 14:00:16 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Fri, 9 Jul 2010 16:00:16 -0500
Subject: [Beowulf] Question about maui scheduler and reservations on a node:
	logical AND or OR
Message-ID: <AANLkTikIMsvoRzxeA-trPnnRTiHb-MITijlyXmEsq4dr@mail.gmail.com>

If there are twin reservations set for the same timespan on a certain
node do they get ANDed or ORed?

setres -u userfoo -s '+5' -d '10:00:00' node1
setres -u userbar -s '+5' -d '10:00:00' node1

Will userfoo have access to the node or userbar or neither? Or is it
the first reservation that is always active?

Actual situation: Due to the way funding and priorities work on our
cluster there are weeks in which I am supposed to give exclusive
access to a certain user on a certain node. But the downside is that
sometimes for debugging or maintainance etc. I might have to use that
same node to run system jobs from a "maintainance" user. It would be
convenient of there was a way to tell the scheduler "Reserve node foo
for use by either foouser or baruser"

A related question:
showres shows me reservations but doesn't indicate what nodes these
have been made for. e.g.

stotz.58            User - -7:02:35:04 54:01:21:17 61:03:56:21    1/8
  Fri Jul  2 13:03:39
stotz.59            User - -7:02:35:04 54:01:21:17 61:03:56:21    1/8
  Fri Jul  2 13:03:39

But is there a way to know what node stotz.58 is active for?

PS. I had asked the first question a few weeks ago on the MAUI list
but received no replies hence I thought I should check if anyone on
this list has a tip. Sorry if someone gets the question twice!

--
Rahul


From gus at ldeo.columbia.edu  Fri Jul  9 16:06:05 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Fri, 09 Jul 2010 19:06:05 -0400
Subject: [Beowulf] first cluster [was [OMPI users] trouble using openmpi
	under slurm]
In-Reply-To: <20100709164313.GA25062@sopalepc>
References: <AANLkTilcgGu1kA8x6VRA2DZAQn-jD-8ecy6EYJ8NkwNY@mail.gmail.com>	<5CF41CDB-39F0-477C-B6D6-4F2E50BE6909@open-mpi.org>	<AANLkTilOZtBNIYgP8BObhp7XUNQHfar5R2L9psHlJ-Ig@mail.gmail.com>	<2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org>	<AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>	<20100707191645.GA25781@sopalepc>	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
Message-ID: <4C37AB5D.9050308@ldeo.columbia.edu>

Douglas Guptill wrote:
> On Thu, Jul 08, 2010 at 09:43:48AM -0400, Gus Correa wrote:
>> Douglas Guptill wrote:
>>> On Wed, Jul 07, 2010 at 12:37:54PM -0600, Ralph Castain wrote:
>>>
>>>> No....afraid not. Things work pretty well, but there are places
>>>> where things just don't mesh. Sub-node allocation in particular is
>>>> an issue as it implies binding, and slurm and ompi have conflicting
>>>> methods.
>>>>
>>>> It all can get worked out, but we have limited time and nobody cares
>>>> enough to put in the effort. Slurm just isn't used enough to make it
>>>> worthwhile (too small an audience).
>>> I am about to get my first HPC cluster (128 nodes), and was
>>> considering slurm.  We do use MPI.
>>>
>>> Should I be looking at Torque instead for a queue manager?
>>>
>> Hi Douglas
>>
>> Yes, works like a charm along with OpenMPI.
>> I also have MVAPICH2 and MPICH2, no integration w/ Torque,
>> but no conflicts either.
> 
> Thanks, Gus.
> 
> After some lurking and reading, I plan this:
>   Debian (lenny)
>   + fai                   - for compute-node operating system install
>   + Torque                - job scheduler/manager
>   + MPI (Intel MPI)       - for the application
>   + MPI (OpenMP)          - alternative MPI
> 
> Does anyone see holes in this plan?
> 
> Thanks,
> Douglas


Hi Douglas

I never used Debian, fai, or Intel MPI.

We have two clusters with cluster management software, i.e.,
mostly the operating system install stuff.

I made a toy Rocks cluster out of old computers.
Rocks is a minimum-hassle way to deploy and maintain a cluster.
Of course you can do the same from scratch, or do more, or do better,
which makes some people frown at Rocks.
However, Rocks works fine, particularly if your network(s)
is (are) Gigabit Ethernet,
and if you don't mix different processor architectures (i.e. only i386 
or only x86_64, although there is some support for mixed stuff).
It is developed/maintained by UCSD under an NSF grant (I think).
It's been around for quite a while too.

You may want to take a look, perhaps experiment with a subset of your
nodes before you commit:

http://www.rocksclusters.org/wordpress/

There is a decent user guide:

http://www.rocksclusters.org/roll-documentation/base/5.3/

and additional documentation/tutorials:

http://www.rocksclusters.org/wordpress/?page_id=4

The basic software comes in what they call "rolls".
The (default) OS is actually CentOS.
They only support a few "Red-Hat-type" distributions (IIRR, RHEL and 
Scientific Linux), but CentOS is fine.
You could use the mandatory rolls (Kernel/Boot, Core,
OS disks 1,2.  I would suggest installing all OS disks,
so as to have any packages that you may need later on.
In addition, there a roll with Torque+Maui that you can get
from the Univ. of Tromso, Norway:

ftp://ftp.uit.no/pub/linux/rocks/torque-roll/

If you want to install Torque,
*don't install the SGE (Sun Grid Engine) roll*.
It is either one resource manager or the other (they're incompatible).
I am a big fan and old user of Torque, so my bias is to
recommend Torque, but other people prefer SGE.

The basic software takes care of compute node installation,
administration of user accounts, etc.
It can be customized in several ways
(e.g. if you have two networks, one for MPI, another
for cluster control and I/O, which I would recommend).
It also includes a basic web page for your cluster (via Wordpress),
which you can also customize, and very nice web-based
monitoring of your nodes through Ganglia.
It also has support for upgrades, and they tend to come up with a
new release once a year or so.

There is also a large user base and an active mailing list:

https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion
http://marc.info/?l=npaci-rocks-discussion

You can build OpenMPI (and MPICH2) from source,
with any/all your favorite compilers,
and install any compilers and all external
software (even Matlab, if you are so inclined, or your users demand)
in a NFS mounted directory
(typically /share/apps in Rocks),
so as to make them accessible by the compute nodes.
You could do the same for, say, NetCDF libraries and utilities
(NCO, NCL), etc.

What is the interconnect/network hardware you have for MPI?
Gigabit Ethernet?  Infiniband?  Myrinet? Other?

If Gigabit Ethernet Rocks won't have any problem.
If Infiniband you may need to add the OFED packages, but they may come
with CentOS now, I am not sure.
If Myrinet, I am not sure, Myrinet provided a Rocks roll up to
Rocks 5.0, but I am not sure about the current status (Rocks is now 5.3).

If you are going to handle a variety of different compilers, MPI 
flavors, with various versions, etc, I recommend using the
"Environment module" package.
It is a quite convenient (and consistent) way to allow users to switch
from one environment to another, change compilers, MPI, etc,
allowing good flexibility.
You can install "environment modules" separately (say via yum or RPM)
with no compatibility issues whatsoever with Rocks:

http://modules.sourceforge.net/

I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


From hahn at mcmaster.ca  Fri Jul  9 16:11:18 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 9 Jul 2010 19:11:18 -0400 (EDT)
Subject: [Beowulf] first cluster[B
In-Reply-To: <20100709205726.GA7313@sopalepc>
References: <AANLkTilOZtBNIYgP8BObhp7XUNQHfar5R2L9psHlJ-Ig@mail.gmail.com>
	<2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org>
	<AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
Message-ID: <Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>

>>>  Debian (lenny)
>>
>> why?  centos is generally considered the safest choice,
>> unless you're religiously committed to debian.
>
> Almost religiously.  I have found it a very stable platform for
> everything up to clusters.

OK.  you should know that the stability comes from linux itself
and the underlying user-level packages, which have nothing to do 
with the distro (any of them).

> I believe fai can do a variety of install-types, including diskful,
> and nfs root.  But then, I am still in the planning stage, and have no
> practical experience.

well, the thing about nfs root is that there's almost no installation,
per se.  if you wanted, you could boot the nodes off a live master's 
root filesystem.  normally, master and node images are kept mostly
separate, though, because it's handy to avoid entangling them
(ie, you may not want mysql-server installed on compute nodes, but 
only on the master, etc.  or just different versions.)


From samuel at unimelb.edu.au  Sun Jul 11 22:53:26 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Mon, 12 Jul 2010 15:53:26 +1000
Subject: [Beowulf] shutting down pbs server and maui for half an hour
	willaffect running jobs?
In-Reply-To: <AANLkTin6y-mYJQggkVu-_gJ8BZ5q2d4m1mJriCjl-DIK@mail.gmail.com>
References: <AANLkTin6y-mYJQggkVu-_gJ8BZ5q2d4m1mJriCjl-DIK@mail.gmail.com>
Message-ID: <D45958078CD65C429557B4C5F492B6A6088CD7B9@IS-EX-BEV3.unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/07/10 05:01, akshar bhosale wrote:

> we have maintenance of pbs server so it is going down
> for half an hour ..will it affect running jobs?

It shouldn't, though if they finish during that time they
may not get full information logged about their state in
the pbs_server logs.

cheers,
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw6rdUACgkQO2KABBYQAh/SSACfQYsyhsW51WwgBMwqbk2ILEui
xrIAnR/vtUWXioE1g5OgY++sPmjHaXWa
=XZw+
-----END PGP SIGNATURE-----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100712/294ed313/attachment.html>

From douglas.guptill at dal.ca  Mon Jul 12 10:02:34 2010
From: douglas.guptill at dal.ca (Douglas Guptill)
Date: Mon, 12 Jul 2010 14:02:34 -0300
Subject: [Beowulf] first cluster
In-Reply-To: <Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
Message-ID: <20100712170234.GB6134@sopalepc>


Ah Ha.  I see the point of a non-diskful, or nfs root, install for the
compute nodes.  One image to update/change, instead of a whole bunch.

Thanks,
Douglas.

On Fri, Jul 09, 2010 at 07:11:18PM -0400, Mark Hahn wrote:

> well, the thing about nfs root is that there's almost no installation,
> per se.  if you wanted, you could boot the nodes off a live master's  
> root filesystem.  normally, master and node images are kept mostly
> separate, though, because it's handy to avoid entangling them
> (ie, you may not want mysql-server installed on compute nodes, but only 
> on the master, etc.  or just different versions.)

And from Steve Crusan:

> As for the diskfull install, netbooting and statelite (NFS root) solutions
> are very easy to scale and customize. Diskfull installs seem to be less
> flexible in IMO.


-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From gus at ldeo.columbia.edu  Mon Jul 12 12:02:40 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Mon, 12 Jul 2010 15:02:40 -0400
Subject: [Beowulf] first cluster
In-Reply-To: <20100712170234.GB6134@sopalepc>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>	<20100707191645.GA25781@sopalepc>	<4C35D614.1090607@ldeo.columbia.edu>	<20100709164313.GA25062@sopalepc>	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>	<20100709205726.GA7313@sopalepc>	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc>
Message-ID: <4C3B66D0.2080204@ldeo.columbia.edu>

Hi Doug

Consider disk for:

A) swap space (say, if the user programs are large,
or you can't buy a lot of RAM, etc);
I wonder if swapping over NFS would be efficient for HPC.
Disk may be a simple and cost effective solution.

B) input/output data files that your application programs may require
(if they already work in stagein-stageout mode,
or if they do I/O so often that a NFS mounted file system
may get overwhelmed, hence reading/writing on local disk may be preferred).

C) Would diskless scaling be a real big advantage for
a small/medium size cluster, say up to ~200 nodes?

D) Most current node chassis have hot-swappable disks, not hard to 
replace, in case of failure.

E) booting when the NFS root server is not reachable

Disks don't prevent one to keep a single image and distribute
it consistently across nodes, do they?

In any case, I suppose you could both have disks and boot with NFS root.
But if you have disks, is there really a point in doing so?

I guess there are old threads about this in the list archives.

Just some thoughts.

Gus Correa


Douglas Guptill wrote:
> Ah Ha.  I see the point of a non-diskful, or nfs root, install for the
> compute nodes.  One image to update/change, instead of a whole bunch.
> 
> Thanks,
> Douglas.
> 
> On Fri, Jul 09, 2010 at 07:11:18PM -0400, Mark Hahn wrote:
> 
>> well, the thing about nfs root is that there's almost no installation,
>> per se.  if you wanted, you could boot the nodes off a live master's  
>> root filesystem.  normally, master and node images are kept mostly
>> separate, though, because it's handy to avoid entangling them
>> (ie, you may not want mysql-server installed on compute nodes, but only 
>> on the master, etc.  or just different versions.)
> 
> And from Steve Crusan:
> 
>> As for the diskfull install, netbooting and statelite (NFS root) solutions
>> are very easy to scale and customize. Diskfull installs seem to be less
>> flexible in IMO.
> 
> 


From bill at Princeton.EDU  Mon Jul 12 13:28:29 2010
From: bill at Princeton.EDU (Bill Wichser)
Date: Mon, 12 Jul 2010 16:28:29 -0400
Subject: [Beowulf] IB problem with openmpi 1.2.8
Message-ID: <4C3B7AED.9080606@princeton.edu>

Machine is an older Intel Woodcrest cluster with a two tiered IB 
infrastructure with Topspin/Cisco 7000 switches.  The core switch is a 
SFS-7008P with a single management module which runs the SM manager.  
The cluster runs RHEL4 and was upgraded last week to kernel 
2.6.9-89.0.26.ELsmp.  The openib-1.4 remained the same.  Pretty much stock.

After rebooting, the IB cards in the nodes remained in the INIT state.  
I rebooted the chassis IB switch as it appeared that no SM was running.  
No help.  I manually started an opensm on a compute node telling it to 
ignore other masters as initially it would only come up in STANDBY.  
This turned all the nodes' IB ports to active and I thought that I was done.

ibdiagnet complained that there were two masters.  So I killed the 
opensm and now it was happy.  osmtest -f c/osmtest -f a  comes back with 
OSMTEST: TEST "All Validations" PASS. 

ibdiagnet -ls 2.5 -lw 4x   finds all my switches and nodes with 
everything coming up roses.

The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the 
node count goes over 32 (or maybe 40).  This worked fine in the past, 
before the reboot.  User apps are failing as well as IMB v3.2.  I've 
increased the timeout using the "mpiexec -mca btl_openib_ib_timeout 20" 
which helped for 48 nodes but when increasing to 64 and 128 it didn't 
help at all.  Typical error message follow.

Right now I am stuck.  I'm not sure what or where the problem might be.  
Nor where to go next.  If anyone has a clue, I'd appreciate hearing it!

Thanks,
Bill


typical error messages

[0,1,33][btl_openib_component.c:1371:btl_openib_component_progress] from 
woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
[0,1,36][btl_openib_component.c:1371:btl_openib_component_progress] from 
woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
[0,1,40][btl_openib_component.c:1371:btl_openib_component_progress] from 
woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

    The total number of times that the sender wishes the receiver to
    retry timeout, packet sequence, etc. errors before posting a
    completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).

* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:

     4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
--------------------------------------------------------------------------
--------------------------------------------------------------------------

DIFFERENT RUN:

[0,1,92][btl_openib_component.c:1371:btl_openib_component_progress] from 
woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
...


From samuel at unimelb.edu.au  Mon Jul 12 18:21:26 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Tue, 13 Jul 2010 11:21:26 +1000
Subject: [Beowulf] first cluster
In-Reply-To: <4C3B66D0.2080204@ldeo.columbia.edu>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu>
Message-ID: <D45958078CD65C429557B4C5F492B6A6088CD7BF@IS-EX-BEV3.unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 13/07/10 05:02, Gus Correa wrote:

> I wonder if swapping over NFS would be efficient for HPC.

There are out of tree patches for swap over NFS (and I've
seen assertions that SuSE SLES 11 includes it) which has
been doing the rounds for a few years now (originally by
Peter Zijlstra but now maintained by Suresh Jayaraman)
and appears to have last been updated October 2009.

 http://www.suse.de/~sjayaraman/patches/swap-over-nfs/

The last post (I could find) for it was here, it includes
a diffstat to show which parts of the kernel are touched:

 http://lwn.net/Articles/355350/

This posting of Peter's from 2007 explains a bit more
about the patches and why it is a hard problem:

 http://lwn.net/Articles/256462/

My personal feeling is "here be dragons". ;-)

cheers,
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw7v5YACgkQO2KABBYQAh9J8wCffmBVB8cbBRTCYSAq6XGqBEdB
ngEAnjWKlxtsA9ok7YJvdtX8cTCGl6FL
=3XC7
-----END PGP SIGNATURE-----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100713/f07d60d6/attachment.html>

From samuel at unimelb.edu.au  Mon Jul 12 18:43:42 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Tue, 13 Jul 2010 11:43:42 +1000
Subject: [Beowulf] shutting down pbs server and maui for half an hour will
	affect running jobs?
In-Reply-To: <AANLkTik06M1tM_DEETTdgwBCJvEB6bzf0-hIXnCtWNVG@mail.gmail.com>
References: <AANLkTin6y-mYJQggkVu-_gJ8BZ5q2d4m1mJriCjl-DIK@mail.gmail.com>
	<D45958078CD65C429557B4C5F492B6A6088CD7B9@IS-EX-BEV3.unimelb.edu.au>
	<AANLkTik06M1tM_DEETTdgwBCJvEB6bzf0-hIXnCtWNVG@mail.gmail.com>
Message-ID: <D45958078CD65C429557B4C5F492B6A6088CD7C2@IS-EX-BEV3.unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 13/07/10 01:21, akshar bhosale wrote:

> Thanks for your information, but do i need to change
> anything for increasing timeout if i dont want to kill
> running jobs..

If you have jobs that will hit their walltime whilst the
server is down then they will get killed by the pbs_mom
unless you extend their walltime first.

cheers!
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw7xM4ACgkQO2KABBYQAh9SyACeOfjdCNER7W6GHr53Pm8JN7ks
U+EAnjtF9WvQfFxDU8raIMyPDx7nmXkq
=ISQy
-----END PGP SIGNATURE-----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100713/43b259a0/attachment.html>

From rpnabar at gmail.com  Mon Jul 12 21:04:48 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Mon, 12 Jul 2010 23:04:48 -0500
Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to
	specific addresses instead of a broadcast domain
Message-ID: <AANLkTikvbwdcmlwFEzVIq7e8a_qjTkootFVar77VAMud@mail.gmail.com>

I am puzzled by a bunch of ARP requests on my network that I captured
using tcpdump. Shouldn't ARP discovery requests always be sent to a
broadcast address?

I have requests of the type below which seemingly are addressed to a
specific mAC address.

00:26:b9:58:d7:2f > 00:26:b9:58:eb:b8, ARP, length 42: arp who-has
10.0.0.36 tell 10.0.3.2
00:26:b9:58:eb:b8 > 00:26:b9:58:d7:2f, ARP, length 60: arp reply
10.0.0.36 is-at 00:26:b9:58:eb:b8

Now if mumble:d7:2f already knew that mumble:eb:b8 was 10.0.0.36
(which it indeed is) then why would it send out an ARP discovery
request?
I can see that something doesn't make sense here but I cannot figure
out what is causing the problem. Any ideas? Has anyone seen this
before?

-- 
Rahul


From patrick at myri.com  Mon Jul 12 21:25:16 2010
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 13 Jul 2010 00:25:16 -0400
Subject: [Beowulf] Network problem: Why are ARP discovery requests sent
	to	specific addresses instead of a broadcast domain
In-Reply-To: <AANLkTikvbwdcmlwFEzVIq7e8a_qjTkootFVar77VAMud@mail.gmail.com>
References: <AANLkTikvbwdcmlwFEzVIq7e8a_qjTkootFVar77VAMud@mail.gmail.com>
Message-ID: <4C3BEAAC.6000306@myri.com>

Rahul,

On 7/13/2010 12:04 AM, Rahul Nabar wrote:
> I am puzzled by a bunch of ARP requests on my network that I captured
> using tcpdump. Shouldn't ARP discovery requests always be sent to a
> broadcast address?

No, the kernel regularly refreshes the entries in the ARP cache with 
unicast requests. If that fails, then it sends the expensive broadcasts.

Patrick


From rpnabar at gmail.com  Mon Jul 12 21:29:13 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Mon, 12 Jul 2010 23:29:13 -0500
Subject: [Beowulf] first cluster
In-Reply-To: <4C3B66D0.2080204@ldeo.columbia.edu>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc>
	<4C3B66D0.2080204@ldeo.columbia.edu>
Message-ID: <AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>

On Mon, Jul 12, 2010 at 2:02 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Consider disk for:
>
> A) swap space (say, if the user programs are large,
> or you can't buy a lot of RAM, etc);

Out of curiosity, is there the possibility of running a "swapless"
compute-node? I mean most HPC nodes already have fairly generous RAM
and once swapping to disk starts performance is degraded (severely?).
Are there non-problem scenarios where one does desire swapping to
disks?

> D) Most current node chassis have hot-swappable disks, not hard to replace,
> in case of failure.

Hot-swappable disks are great on head nodes but on compute-nodes
whenever I hear "redundant" or "hot swappable", I see it as an
inefficiency. Or a excessive feature that could be traded off for a
cost saving. (of course, sometimes hands are tied if the server comes
with that feature "standard") What do others think?

-- 
Rahul


From rpnabar at gmail.com  Mon Jul 12 21:48:47 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Mon, 12 Jul 2010 23:48:47 -0500
Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to
	specific addresses instead of a broadcast domain
In-Reply-To: <4C3BEAAC.6000306@myri.com>
References: <AANLkTikvbwdcmlwFEzVIq7e8a_qjTkootFVar77VAMud@mail.gmail.com>
	<4C3BEAAC.6000306@myri.com>
Message-ID: <AANLkTikHSXFxRwJxr9U5yWWvp3WUPZrb2JBdQYaVrTEL@mail.gmail.com>

On Mon, Jul 12, 2010 at 11:25 PM, Patrick Geoffray <patrick at myri.com> wrote:
> Rahul,
>
> On 7/13/2010 12:04 AM, Rahul Nabar wrote:
>>
>> I am puzzled by a bunch of ARP requests on my network that I captured
>> using tcpdump. Shouldn't ARP discovery requests always be sent to a
>> broadcast address?
>
> No, the kernel regularly refreshes the entries in the ARP cache with unicast
> requests. If that fails, then it sends the expensive broadcasts.

Thanks Patrick. I wasn't aware of this. I guess it makes sense now
that I found the correct section of the RFP
(http://tools.ietf.org/html/rfc1122#page-22).

I see the converse situation too: Some ARP replies are being sent to a
broadcast domain instead of a single MAC. Is that normal too?

00:26:b9:58:e5:9f > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
172.16.0.29 is-at 00:26:b9:58:e5:9f
00:26:b9:56:38:71 > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
172.16.0.14 is-at 00:26:b9:56:38:71

I'd have (naively) expected these replies to go to the specific MAC
which had issued an ARP request on 172.16.0.29 or 172.16.0.14.

-- 
Rahul


From tom.ammon at utah.edu  Mon Jul 12 22:04:30 2010
From: tom.ammon at utah.edu (Tom Ammon)
Date: Mon, 12 Jul 2010 23:04:30 -0600
Subject: [Beowulf] Network problem: Why are ARP discovery requests sent
	to	specific addresses instead of a broadcast domain
In-Reply-To: <AANLkTikHSXFxRwJxr9U5yWWvp3WUPZrb2JBdQYaVrTEL@mail.gmail.com>
References: <AANLkTikvbwdcmlwFEzVIq7e8a_qjTkootFVar77VAMud@mail.gmail.com>	<4C3BEAAC.6000306@myri.com>
	<AANLkTikHSXFxRwJxr9U5yWWvp3WUPZrb2JBdQYaVrTEL@mail.gmail.com>
Message-ID: <4C3BF3DE.2030002@utah.edu>

This is called a gratuitous ARP. Used to update the ARP caches of other 
nodes.

On 07/12/2010 10:48 PM, Rahul Nabar wrote:
> On Mon, Jul 12, 2010 at 11:25 PM, Patrick Geoffray<patrick at myri.com>  wrote:
>    
>> Rahul,
>>
>> On 7/13/2010 12:04 AM, Rahul Nabar wrote:
>>      
>>> I am puzzled by a bunch of ARP requests on my network that I captured
>>> using tcpdump. Shouldn't ARP discovery requests always be sent to a
>>> broadcast address?
>>>        
>> No, the kernel regularly refreshes the entries in the ARP cache with unicast
>> requests. If that fails, then it sends the expensive broadcasts.
>>      
> Thanks Patrick. I wasn't aware of this. I guess it makes sense now
> that I found the correct section of the RFP
> (http://tools.ietf.org/html/rfc1122#page-22).
>
> I see the converse situation too: Some ARP replies are being sent to a
> broadcast domain instead of a single MAC. Is that normal too?
>
> 00:26:b9:58:e5:9f>  ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
> 172.16.0.29 is-at 00:26:b9:58:e5:9f
> 00:26:b9:56:38:71>  ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
> 172.16.0.14 is-at 00:26:b9:56:38:71
>
> I'd have (naively) expected these replies to go to the specific MAC
> which had issued an ARP request on 172.16.0.29 or 172.16.0.14.
>
>    

-- 
--------------------------------------------------------------------
Tom Ammon
Network Engineer
Office: 801.587.0976
Mobile: 801.674.9273

Center for High Performance Computing
University of Utah
http://www.chpc.utah.edu


From samuel at unimelb.edu.au  Mon Jul 12 22:55:23 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Tue, 13 Jul 2010 15:55:23 +1000
Subject: [Beowulf] first cluster
In-Reply-To: <AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc> <4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu>
	<AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
Message-ID: <D45958078CD65C429557B4C5F492B6A6088CD7C6@IS-EX-BEV3.unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 13/07/10 14:29, Rahul Nabar wrote:

> Out of curiosity, is there the possibility of running
> a "swapless" compute-node?

Yes of course, it just means that the kernel no longer has
the option of paging out infrequently accessed dirty pages
to free space for active processes.

cheers!
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw7/8sACgkQO2KABBYQAh+AOQCgjIxb/CaTsElcZi0bTiKfjTns
tn8AoISVbA8hwJgwFIs/rADfIJN8FBg/
=NT3M
-----END PGP SIGNATURE-----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100713/28373e00/attachment.html>

From samuel at unimelb.edu.au  Mon Jul 12 22:56:41 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Tue, 13 Jul 2010 15:56:41 +1000
Subject: [Beowulf] shutting down pbs server and maui for half an hour will
	affect running jobs?
In-Reply-To: <AANLkTimxP1vM3oVoRkPj1gJybtW5Hx1_uhnhvUIbSd-q@mail.gmail.com>
References: <AANLkTin6y-mYJQggkVu-_gJ8BZ5q2d4m1mJriCjl-DIK@mail.gmail.com>
	<D45958078CD65C429557B4C5F492B6A6088CD7B9@IS-EX-BEV3.unimelb.edu.au>
	<AANLkTik06M1tM_DEETTdgwBCJvEB6bzf0-hIXnCtWNVG@mail.gmail.com>
	<D45958078CD65C429557B4C5F492B6A6088CD7C2@IS-EX-BEV3.unimelb.edu.au>
	<AANLkTimxP1vM3oVoRkPj1gJybtW5Hx1_uhnhvUIbSd-q@mail.gmail.com>
Message-ID: <D45958078CD65C429557B4C5F492B6A6088CD7C7@IS-EX-BEV3.unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 13/07/10 13:12, akshar bhosale wrote:

> thanks..any other related info ?

Not that comes to mind, I'm afraid!

- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw8ABkACgkQO2KABBYQAh/GPACfel9pCS4/clqgwCFCDB2Fv+pS
Vp4AnR9dxZHjxKpnQMaFjIZxE6t8XaQf
=IvbE
-----END PGP SIGNATURE-----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100713/51ca196b/attachment.html>

From reuti at staff.uni-marburg.de  Tue Jul 13 05:07:27 2010
From: reuti at staff.uni-marburg.de (Reuti)
Date: Tue, 13 Jul 2010 14:07:27 +0200
Subject: [Beowulf] first cluster
In-Reply-To: <AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc>
	<4C3B66D0.2080204@ldeo.columbia.edu>
	<AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
Message-ID: <158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de>

Am 13.07.2010 um 06:29 schrieb Rahul Nabar:

> On Mon, Jul 12, 2010 at 2:02 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>> Consider disk for:
>> 
>> A) swap space (say, if the user programs are large,
>> or you can't buy a lot of RAM, etc);
> 
> Out of curiosity, is there the possibility of running a "swapless"
> compute-node? I mean most HPC nodes already have fairly generous RAM
> and once swapping to disk starts performance is degraded (severely?).
> Are there non-problem scenarios where one does desire swapping to
> disks?

As already said: yes, it's possible and you can even switch swap on and off during normal operation (`swapon` and `swapoff`). Disadvantage is of course, when the system runs out of memory the oom-killer will look for an eligible process to be killed to free up some space. As you mentioned, the application should fit into the physical installed RAM, and you may just want 2 GB or so as a last resort to swap out parts of the OS which are currently not in use.

You may want more swap, when you want to setup some kind of preemption using a job scheduler. E.g. GridEngine can suspend a low priority job once a urgent one comes in, but resources like memory are not freed automatically (the job is still on the node - you would need some kind of checkpointing to free the node completely). When you setup the queuing system that all running applications fit into physical memory, the swap of the suspended application is a one time issue and won't affect the ongoing computation.


>> D) Most current node chassis have hot-swappable disks, not hard to replace,
>> in case of failure.
> 
> Hot-swappable disks are great on head nodes but on compute-nodes
> whenever I hear "redundant" or "hot swappable", I see it as an
> inefficiency. Or a excessive feature that could be traded off for a
> cost saving. (of course, sometimes hands are tied if the server comes
> with that feature "standard") What do others think?

Correct. Often it's included in chassis as default, although you can't make much use of it when you use a e.g. RAID0 on a node for performance reasons and have to reinstall the node anyway. It will just avoid that you have to switch off the node completely and remove it from the rack to access the inner parts of the node. But there might been chassis, where you can access the drive from the front w/o hot-swap capability but with a big label: don't remove under operation.

-- Reuti

> 
> -- 
> Rahul
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Glen.Beane at jax.org  Tue Jul 13 05:09:46 2010
From: Glen.Beane at jax.org (Glen Beane)
Date: Tue, 13 Jul 2010 08:09:46 -0400
Subject: [Beowulf] first cluster
In-Reply-To: <AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
Message-ID: <C861CFCA.66F6%glen.beane@jax.org>


On 7/13/10 12:29 AM, "Rahul Nabar" <rpnabar at gmail.com> wrote:

> On Mon, Jul 12, 2010 at 2:02 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>> Consider disk for:
>> 
>> A) swap space (say, if the user programs are large,
>> or you can't buy a lot of RAM, etc);
> 
> Out of curiosity, is there the possibility of running a "swapless"
> compute-node? I mean most HPC nodes already have fairly generous RAM
> and once swapping to disk starts performance is degraded (severely?).

at a previous job (seems like a million years ago) we had a Top-500 cluster
with completely diskless compute nodes - no local disk for swap or /tmp
space. 

My current cluster has a small amount of swap on each node (~1GB) and we
avoid swapping. Our attitude still is swapping is not an option - it is an
indication that we should be decomposing our problem further and
distributing it across more nodes. This cluster happens to have a local OS
install.

We're deploying a new cluster in the next month or so based on the 8-core
magny-cours with 128GB RAM per node. The nodes will network boot but we have
local disks for /tmp and a very small amount of swap.


-- 
Glen L. Beane
Software Engineer
The Jackson Laboratory
Phone (207) 288-6153


From douglas.guptill at dal.ca  Tue Jul 13 08:57:57 2010
From: douglas.guptill at dal.ca (Douglas Guptill)
Date: Tue, 13 Jul 2010 12:57:57 -0300
Subject: [Beowulf] Re: Beowulf Digest, Vol 77, Issue 14
In-Reply-To: <alpine.DEB.1.10.1007131706300.12424@gramsci.bo.biodec.com>
References: <201007091648.o69Glf7R016032@bluewest.scyld.com>
	<alpine.DEB.1.10.1007131706300.12424@gramsci.bo.biodec.com>
Message-ID: <20100713155757.GA18339@sopalepc>

On Tue, Jul 13, 2010 at 05:11:46PM +0200, Ivan Rossi wrote:
> On Fri, 9 Jul 2010, beowulf-request at beowulf.org wrote:
>
>> After some lurking and reading, I plan this:
>>  Debian (lenny)
>>  + fai                   - for compute-node operating system install
>>  + Torque                - job scheduler/manager
>
> we did a similar cluster for a client and they wanted torque.
> IMHO torque sucks. pbs_moms are no-so-stable and configuration is a pain.
> consider sun grid engine which is also available within debian lenny

Interesting.  Thanks for the suggestion.

>>  + MPI (Intel MPI)       - for the application
>
> only if you use intel compilers, otherwise go openMPI

Yes, we will be using Intel compilers.

Douglas.
-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From bill at Princeton.EDU  Tue Jul 13 13:09:20 2010
From: bill at Princeton.EDU (Bill Wichser)
Date: Tue, 13 Jul 2010 16:09:20 -0400
Subject: [Beowulf] IB problem with openmpi 1.2.8
In-Reply-To: <4C3B7AED.9080606@princeton.edu>
References: <4C3B7AED.9080606@princeton.edu>
Message-ID: <4C3CC7F0.1090308@princeton.edu>

Just some more info.  Went back to the prior kernel with no luck.  
Updated the firmware on the Topspin HBA cards to the latest (final) 
version (fw-25208-4_8_200-MHEL-CF128-T).    Nothing changes.   Still not 
sure where to look.

Bill Wichser wrote:
> Machine is an older Intel Woodcrest cluster with a two tiered IB 
> infrastructure with Topspin/Cisco 7000 switches.  The core switch is a 
> SFS-7008P with a single management module which runs the SM manager.  
> The cluster runs RHEL4 and was upgraded last week to kernel 
> 2.6.9-89.0.26.ELsmp.  The openib-1.4 remained the same.  Pretty much 
> stock.
>
> After rebooting, the IB cards in the nodes remained in the INIT 
> state.  I rebooted the chassis IB switch as it appeared that no SM was 
> running.  No help.  I manually started an opensm on a compute node 
> telling it to ignore other masters as initially it would only come up 
> in STANDBY.  This turned all the nodes' IB ports to active and I 
> thought that I was done.
>
> ibdiagnet complained that there were two masters.  So I killed the 
> opensm and now it was happy.  osmtest -f c/osmtest -f a  comes back 
> with OSMTEST: TEST "All Validations" PASS.
> ibdiagnet -ls 2.5 -lw 4x   finds all my switches and nodes with 
> everything coming up roses.
>
> The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the 
> node count goes over 32 (or maybe 40).  This worked fine in the past, 
> before the reboot.  User apps are failing as well as IMB v3.2.  I've 
> increased the timeout using the "mpiexec -mca btl_openib_ib_timeout 
> 20" which helped for 48 nodes but when increasing to 64 and 128 it 
> didn't help at all.  Typical error message follow.
>
> Right now I am stuck.  I'm not sure what or where the problem might 
> be.  Nor where to go next.  If anyone has a clue, I'd appreciate 
> hearing it!
>
> Thanks,
> Bill
>
>
> typical error messages
>
> [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress] 
> from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY 
> EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
> [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress] 
> from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY 
> EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
> [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress] 
> from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY 
> EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
> -------------------------------------------------------------------------- 
>
> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
>
>    The total number of times that the sender wishes the receiver to
>    retry timeout, packet sequence, etc. errors before posting a
>    completion error.
>
> This error typically means that there is something awry within the
> InfiniBand fabric itself.  You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
>
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
>
> * btl_openib_ib_retry_count - The number of times the sender will
>  attempt to retry (defaulted to 7, the maximum value).
>
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>  to 10).  The actual timeout value used is calculated as:
>
>     4.096 microseconds * (2^btl_openib_ib_timeout)
>
>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> -------------------------------------------------------------------------- 
>
> -------------------------------------------------------------------------- 
>
>
> DIFFERENT RUN:
>
> [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress] 
> from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY 
> EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
> ...
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf


From rpnabar at gmail.com  Tue Jul 13 13:32:30 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Tue, 13 Jul 2010 15:32:30 -0500
Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to
	specific addresses instead of a broadcast domain
In-Reply-To: <4C3BF3DE.2030002@utah.edu>
References: <AANLkTikvbwdcmlwFEzVIq7e8a_qjTkootFVar77VAMud@mail.gmail.com>
	<4C3BEAAC.6000306@myri.com>
	<AANLkTikHSXFxRwJxr9U5yWWvp3WUPZrb2JBdQYaVrTEL@mail.gmail.com>
	<4C3BF3DE.2030002@utah.edu>
Message-ID: <AANLkTinmCiH31yLuxkcHILkA7pgPRAvZwK-dTLsipSLv@mail.gmail.com>

On Tue, Jul 13, 2010 at 12:04 AM, Tom Ammon <tom.ammon at utah.edu> wrote:
> This is called a gratuitous ARP. Used to update the ARP caches of other
> nodes.

Thanks Tom. It is curious that most of my gratuitous ARP is coming
from my IPMI interface and not my main eth stack. Not sure why. Maybe
the Dell IPMI just is more aggressive about it.

-- 
Rahul


From prentice at ias.edu  Tue Jul 13 13:50:06 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 13 Jul 2010 16:50:06 -0400
Subject: [Beowulf] IB problem with openmpi 1.2.8
In-Reply-To: <4C3CC7F0.1090308@princeton.edu>
References: <4C3B7AED.9080606@princeton.edu> <4C3CC7F0.1090308@princeton.edu>
Message-ID: <4C3CD17E.8060708@ias.edu>

Bill,

Have you checked the health of the cables themselves? It could just be
dumb luck that a hardware failure coincided with a software change,
didn't manifest itself until the reboot of the nodes. Did you reboot the
switches, too?

I would try dividing your cluster into small sections and see if the
problem exists across the sections.

Can you disconnect the edge switches from the core switch, so that each
edge switch is it's own, isolated fabric? If so, you could then start an
sm on each fabric and see if the problem is on every smaller IB fabric,
or just one.

The other option would be to disconnect all the nodes and add them back
one  by one, but that wouldn't catch a problem with a switch-to-switch
connection.

How big is the cluster? Would it take hours or days to test each node
like this?

You say the problem occurs when the node count goes over 32 (or 40) do
you mean 32 physical nodes, or 32 processors. How does your scheduler
assign nodes? Would those 32 nodes always be in the same rack or on the
same IB switch, but not when the count increases?

Prentice


Bill Wichser wrote:
> Just some more info.  Went back to the prior kernel with no luck. 
> Updated the firmware on the Topspin HBA cards to the latest (final)
> version (fw-25208-4_8_200-MHEL-CF128-T).    Nothing changes.   Still not
> sure where to look.
> 
> Bill Wichser wrote:
>> Machine is an older Intel Woodcrest cluster with a two tiered IB
>> infrastructure with Topspin/Cisco 7000 switches.  The core switch is a
>> SFS-7008P with a single management module which runs the SM manager. 
>> The cluster runs RHEL4 and was upgraded last week to kernel
>> 2.6.9-89.0.26.ELsmp.  The openib-1.4 remained the same.  Pretty much
>> stock.
>>
>> After rebooting, the IB cards in the nodes remained in the INIT
>> state.  I rebooted the chassis IB switch as it appeared that no SM was
>> running.  No help.  I manually started an opensm on a compute node
>> telling it to ignore other masters as initially it would only come up
>> in STANDBY.  This turned all the nodes' IB ports to active and I
>> thought that I was done.
>>
>> ibdiagnet complained that there were two masters.  So I killed the
>> opensm and now it was happy.  osmtest -f c/osmtest -f a  comes back
>> with OSMTEST: TEST "All Validations" PASS.
>> ibdiagnet -ls 2.5 -lw 4x   finds all my switches and nodes with
>> everything coming up roses.
>>
>> The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the
>> node count goes over 32 (or maybe 40).  This worked fine in the past,
>> before the reboot.  User apps are failing as well as IMB v3.2.  I've
>> increased the timeout using the "mpiexec -mca btl_openib_ib_timeout
>> 20" which helped for 48 nodes but when increasing to 64 and 128 it
>> didn't help at all.  Typical error message follow.
>>
>> Right now I am stuck.  I'm not sure what or where the problem might
>> be.  Nor where to go next.  If anyone has a clue, I'd appreciate
>> hearing it!
>>
>> Thanks,
>> Bill
>>
>>
>> typical error messages
>>
>> [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress]
>> from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY
>> EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
>> [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress]
>> from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY
>> EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
>> [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress]
>> from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY
>> EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
>> --------------------------------------------------------------------------
>>
>> The InfiniBand retry count between two MPI processes has been
>> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
>> (section 12.7.38):
>>
>>    The total number of times that the sender wishes the receiver to
>>    retry timeout, packet sequence, etc. errors before posting a
>>    completion error.
>>
>> This error typically means that there is something awry within the
>> InfiniBand fabric itself.  You should note the hosts on which this
>> error has occurred; it has been observed that rebooting or removing a
>> particular host from the job can sometimes resolve this issue.
>>
>> Two MCA parameters can be used to control Open MPI's behavior with
>> respect to the retry count:
>>
>> * btl_openib_ib_retry_count - The number of times the sender will
>>  attempt to retry (defaulted to 7, the maximum value).
>>
>> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>>  to 10).  The actual timeout value used is calculated as:
>>
>>     4.096 microseconds * (2^btl_openib_ib_timeout)
>>
>>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>> --------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------
>>
>>
>> DIFFERENT RUN:
>>
>> [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress]
>> from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY
>> EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
>> ...
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ


From bill at Princeton.EDU  Tue Jul 13 14:50:35 2010
From: bill at Princeton.EDU (Bill Wichser)
Date: Tue, 13 Jul 2010 17:50:35 -0400
Subject: [Beowulf] IB problem with openmpi 1.2.8
In-Reply-To: <4C3CD17E.8060708@ias.edu>
References: <4C3B7AED.9080606@princeton.edu> <4C3CC7F0.1090308@princeton.edu>
	<4C3CD17E.8060708@ias.edu>
Message-ID: <4C3CDFAB.2000309@princeton.edu>

On 7/13/2010 4:50 PM, Prentice Bisbal wrote:
> Bill,
>
> Have you checked the health of the cables themselves? It could just be
> dumb luck that a hardware failure coincided with a software change,
> didn't manifest itself until the reboot of the nodes. Did you reboot the
> switches, too?
>    
Just looked at all the lights and they all seem fine.

> I would try dividing your cluster into small sections and see if the
> problem exists across the sections.
>
> Can you disconnect the edge switches from the core switch, so that each
> edge switch is it's own, isolated fabric? If so, you could then start an
> sm on each fabric and see if the problem is on every smaller IB fabric,
> or just one.
>    
I've thought about this one.  Non-trivial.  I have a core switch 
connecting 12 leaf switches.  Each switch connects to 16 nodes.  I need 
to use that core switch in order to make the problem appear.
> The other option would be to disconnect all the nodes and add them back
> one  by one, but that wouldn't catch a problem with a switch-to-switch
> connection.
>
> How big is the cluster? Would it take hours or days to test each node
> like this?
>
>    
192 nodes (8 cores each).
> You say the problem occurs when the node count goes over 32 (or 40) do
> you mean 32 physical nodes, or 32 processors. How does your scheduler
> assign nodes? Would those 32 nodes always be in the same rack or on the
> same IB switch, but not when the count increases?
>    
It starts failing at 48 nodes.  PBS allocates as least loaded, round 
robin fashion.  But sequentially, minus the PVFS nodes, which are 
distributed throughout the cluster and allocated last in round robin.  
The 32 nodes definately go through the core.  And it never seems to 
matter where.  I've tried to pinpoint some nodes by keeping lists but 
this happens everywhere.  I was hoping that some tool I'm not aware of 
exists but apparently not.  My next attempt may be to pull the 
management card from the core and just run opensm on nodes themselves, 
like we do for other clusters.  But I can test with osmtest all day and 
never get errors.  This makes me feel very uncomfortable!

Of course, nothing is under warranty anymore.  Divide and conquer seems 
like the only solution.

Thanks,
Bill
> Prentice
>
>
>
> Bill Wichser wrote:
>    
>> Just some more info.  Went back to the prior kernel with no luck.
>> Updated the firmware on the Topspin HBA cards to the latest (final)
>> version (fw-25208-4_8_200-MHEL-CF128-T).    Nothing changes.   Still not
>> sure where to look.
>>
>> Bill Wichser wrote:
>>      
>>> Machine is an older Intel Woodcrest cluster with a two tiered IB
>>> infrastructure with Topspin/Cisco 7000 switches.  The core switch is a
>>> SFS-7008P with a single management module which runs the SM manager.
>>> The cluster runs RHEL4 and was upgraded last week to kernel
>>> 2.6.9-89.0.26.ELsmp.  The openib-1.4 remained the same.  Pretty much
>>> stock.
>>>
>>> After rebooting, the IB cards in the nodes remained in the INIT
>>> state.  I rebooted the chassis IB switch as it appeared that no SM was
>>> running.  No help.  I manually started an opensm on a compute node
>>> telling it to ignore other masters as initially it would only come up
>>> in STANDBY.  This turned all the nodes' IB ports to active and I
>>> thought that I was done.
>>>
>>> ibdiagnet complained that there were two masters.  So I killed the
>>> opensm and now it was happy.  osmtest -f c/osmtest -f a  comes back
>>> with OSMTEST: TEST "All Validations" PASS.
>>> ibdiagnet -ls 2.5 -lw 4x   finds all my switches and nodes with
>>> everything coming up roses.
>>>
>>> The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the
>>> node count goes over 32 (or maybe 40).  This worked fine in the past,
>>> before the reboot.  User apps are failing as well as IMB v3.2.  I've
>>> increased the timeout using the "mpiexec -mca btl_openib_ib_timeout
>>> 20" which helped for 48 nodes but when increasing to 64 and 128 it
>>> didn't help at all.  Typical error message follow.
>>>
>>> Right now I am stuck.  I'm not sure what or where the problem might
>>> be.  Nor where to go next.  If anyone has a clue, I'd appreciate
>>> hearing it!
>>>
>>> Thanks,
>>> Bill
>>>
>>>
>>> typical error messages
>>>
>>> [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress]
>>> from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY
>>> EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
>>> [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress]
>>> from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY
>>> EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
>>> [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress]
>>> from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY
>>> EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
>>> --------------------------------------------------------------------------
>>>
>>> The InfiniBand retry count between two MPI processes has been
>>> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
>>> (section 12.7.38):
>>>
>>>     The total number of times that the sender wishes the receiver to
>>>     retry timeout, packet sequence, etc. errors before posting a
>>>     completion error.
>>>
>>> This error typically means that there is something awry within the
>>> InfiniBand fabric itself.  You should note the hosts on which this
>>> error has occurred; it has been observed that rebooting or removing a
>>> particular host from the job can sometimes resolve this issue.
>>>
>>> Two MCA parameters can be used to control Open MPI's behavior with
>>> respect to the retry count:
>>>
>>> * btl_openib_ib_retry_count - The number of times the sender will
>>>   attempt to retry (defaulted to 7, the maximum value).
>>>
>>> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>>>   to 10).  The actual timeout value used is calculated as:
>>>
>>>      4.096 microseconds * (2^btl_openib_ib_timeout)
>>>
>>>   See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>>
>>> DIFFERENT RUN:
>>>
>>> [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress]
>>> from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY
>>> EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
>>> ...
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>        
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>      
>    


From douglas.guptill at dal.ca  Tue Jul 13 15:05:38 2010
From: douglas.guptill at dal.ca (Douglas Guptill)
Date: Tue, 13 Jul 2010 19:05:38 -0300
Subject: [Beowulf] first cluster [was [OMPI users] trouble using
	openmpi under slurm]
In-Reply-To: <4C37AB5D.9050308@ldeo.columbia.edu>
References: <AANLkTilOZtBNIYgP8BObhp7XUNQHfar5R2L9psHlJ-Ig@mail.gmail.com>
	<2F04DA66-FE62-4131-8F8A-DAFB69668C46@open-mpi.org>
	<AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<4C37AB5D.9050308@ldeo.columbia.edu>
Message-ID: <20100713220538.GA15163@sopalepc>

Hello Gus, list:

On Fri, Jul 09, 2010 at 07:06:05PM -0400, Gus Correa wrote:
> Douglas Guptill wrote:
>> On Thu, Jul 08, 2010 at 09:43:48AM -0400, Gus Correa wrote:
>>> Douglas Guptill wrote:
>>>> On Wed, Jul 07, 2010 at 12:37:54PM -0600, Ralph Castain wrote:
>>>>
>>>>> No....afraid not. Things work pretty well, but there are places
>>>>> where things just don't mesh. Sub-node allocation in particular is
>>>>> an issue as it implies binding, and slurm and ompi have conflicting
>>>>> methods.
>>>>>
>>>>> It all can get worked out, but we have limited time and nobody cares
>>>>> enough to put in the effort. Slurm just isn't used enough to make it
>>>>> worthwhile (too small an audience).
>>>> I am about to get my first HPC cluster (128 nodes), and was
>>>> considering slurm.  We do use MPI.
>>>>
>>>> Should I be looking at Torque instead for a queue manager?
>>>>
>>> Hi Douglas
>>>
>>> Yes, works like a charm along with OpenMPI.
>>> I also have MVAPICH2 and MPICH2, no integration w/ Torque,
>>> but no conflicts either.
>>
>> Thanks, Gus.
>>
>> After some lurking and reading, I plan this:
>>   Debian (lenny)
>>   + fai                   - for compute-node operating system install
>>   + Torque                - job scheduler/manager
>>   + MPI (Intel MPI)       - for the application
>>   + MPI (OpenMP)          - alternative MPI
>>
>> Does anyone see holes in this plan?
>>
>> Thanks,
>> Douglas
>
>
> Hi Douglas
>
> I never used Debian, fai, or Intel MPI.
>
> We have two clusters with cluster management software, i.e.,
> mostly the operating system install stuff.
>
> I made a toy Rocks cluster out of old computers.
> Rocks is a minimum-hassle way to deploy and maintain a cluster.
> Of course you can do the same from scratch, or do more, or do better,
> which makes some people frown at Rocks.
> However, Rocks works fine, particularly if your network(s)
> is (are) Gigabit Ethernet,
> and if you don't mix different processor architectures (i.e. only i386  
> or only x86_64, although there is some support for mixed stuff).
> It is developed/maintained by UCSD under an NSF grant (I think).
> It's been around for quite a while too.
>
> You may want to take a look, perhaps experiment with a subset of your
> nodes before you commit:
>
> http://www.rocksclusters.org/wordpress/

I am sure Rocks suits many, but not me, at first glance.  I am too
much of a tinkerer.  That comes, partially, from starting this
business too earlier; my first computer was a Univac II - vacuum
tubes, no operating system.

> What is the interconnect/network hardware you have for MPI?
> Gigabit Ethernet?  Infiniband?  Myrinet? Other?

Infiniband - QLogic 12300-BS18

> If Infiniband you may need to add the OFED packages,

Gotcha.  Thanks.

> If you are going to handle a variety of different compilers, MPI  
> flavors, with various versions, etc, I recommend using the
> "Environment module" package.

My one user has requested that.

> I hope this helps.

A Big help.  Much appreciated.

Douglas.
-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From samuel at unimelb.edu.au  Tue Jul 13 23:27:03 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Wed, 14 Jul 2010 16:27:03 +1000
Subject: [Beowulf] first cluster
In-Reply-To: <158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>	<20100707191645.GA25781@sopalepc>	<4C35D614.1090607@ldeo.columbia.edu>	<20100709164313.GA25062@sopalepc>	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>	<20100709205726.GA7313@sopalepc>	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>	<20100712170234.GB6134@sopalepc>	<4C3B66D0.2080204@ldeo.columbia.edu>	<AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
	<158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de>
Message-ID: <4C3D58B7.7070506@unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 13/07/10 22:07, Reuti wrote:

> Disadvantage is of course, when the system runs out of
> memory the oom-killer will look for an eligible process
> to be killed to free up some space.

That assumes that you are permitting your compute nodes
to overcommit their memory, if you disable overcommit I
believe that you will instead just get malloc()'s failing
when there is nothing for them to grab.

cheers,
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw9WLYACgkQO2KABBYQAh9jPwCfSFVGCEuLc9kDuNnkpeTmcL7e
MfQAniejMy14z5xZO2wyiE6QAkQzfH2W
=YKCa
-----END PGP SIGNATURE-----


From samuel at unimelb.edu.au  Tue Jul 13 23:31:07 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Wed, 14 Jul 2010 16:31:07 +1000
Subject: [Beowulf] Re: Beowulf Digest, Vol 77, Issue 14
In-Reply-To: <20100713155757.GA18339@sopalepc>
References: <201007091648.o69Glf7R016032@bluewest.scyld.com>	<alpine.DEB.1.10.1007131706300.12424@gramsci.bo.biodec.com>
	<20100713155757.GA18339@sopalepc>
Message-ID: <4C3D59AB.8020004@unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 14/07/10 01:57, Douglas Guptill wrote:

> On Tue, Jul 13, 2010 at 05:11:46PM +0200, Ivan Rossi wrote:
[...]
>>> >>  + MPI (Intel MPI)       - for the application
>> >
>> > only if you use intel compilers, otherwise go openMPI
>
> Yes, we will be using Intel compilers.

Even so I'd suggest benchmarking between OMPI and Intel
MPI to see which does better by your application.

cheers!
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw9WasACgkQO2KABBYQAh9dUACeMesGS4FUk57nU11A8roOSye2
IpYAn2Du/FP8VgoqG6O96cywwTuH8Qk5
=YQKp
-----END PGP SIGNATURE-----


From beat at 0x1b.ch  Tue Jul 13 22:04:48 2010
From: beat at 0x1b.ch (Beat Rubischon)
Date: Wed, 14 Jul 2010 07:04:48 +0200
Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to
	specific addresses instead of a broadcast domain
In-Reply-To: <AANLkTinmCiH31yLuxkcHILkA7pgPRAvZwK-dTLsipSLv@mail.gmail.com>
Message-ID: <C8631210.45DE4%beat@0x1b.ch>

Hello!

Quoting <rpnabar at gmail.com> (13.07.10 22:32):

> It is curious that most of my gratuitous ARP is coming
> from my IPMI interface and not my main eth stack. Not sure why. Maybe
> the Dell IPMI just is more aggressive about it.

This behavour could be controlled by some flags in the BMC:

# ipmitool lan set 1 arp respond on
# ipmitool lan set 1 arp generate on

I had BMCs where gratuitous ARP was needed as the standard ARP responses
were not working. To minimize the impact of the broadcasts I expanded the
delay between those packages up to 127 seconds.

Beat

-- 
     \|/                           Beat Rubischon <beat at 0x1b.ch>
   ( 0^0 )                             http://www.0x1b.ch/~beat/
oOO--(_)--OOo---------------------------------------------------
Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/


From rpnabar at gmail.com  Wed Jul 14 00:15:16 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 14 Jul 2010 02:15:16 -0500
Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to
	specific addresses instead of a broadcast domain
In-Reply-To: <C8631210.45DE4%beat@0x1b.ch>
References: <AANLkTinmCiH31yLuxkcHILkA7pgPRAvZwK-dTLsipSLv@mail.gmail.com>
	<C8631210.45DE4%beat@0x1b.ch>
Message-ID: <AANLkTilDMixcQJaoCGsl9PFnXfLwJa8hOk0EbbK9fsof@mail.gmail.com>

On Wed, Jul 14, 2010 at 12:04 AM, Beat Rubischon <beat at 0x1b.ch> wrote:
> I had BMCs where gratuitous ARP was needed as the standard ARP responses
> were not working. To minimize the impact of the broadcasts I expanded the
> delay between those packages up to 127 seconds.

Are  you with Dell servers too? I  can't find any settings that
control this delay. How did you do it?

-- 
Rahul


From beat at 0x1b.ch  Wed Jul 14 00:47:17 2010
From: beat at 0x1b.ch (Beat Rubischon)
Date: Wed, 14 Jul 2010 09:47:17 +0200
Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to
	specific addresses instead of a broadcast domain
In-Reply-To: <AANLkTilDMixcQJaoCGsl9PFnXfLwJa8hOk0EbbK9fsof@mail.gmail.com>
Message-ID: <C8633825.45E41%beat@0x1b.ch>

Hi Nahul!

Quoting <rpnabar at gmail.com> (14.07.10 09:15):

> Are  you with Dell servers too? I  can't find any settings that
> control this delay. How did you do it?

Nope. I'm working for a company selling clusters, servers and workstations.
Wo do you own :-)

Learn to use "ipmitool". It's generic and works with all BMCs today. Use it
remotly with

    ipmitool -H <mgmt ip> -U <user> -P <pass> <command> <options>

or locally after loading the appropriate modules

    modprobe ipmi_si
    modprobe ipmi_devintf
    ipmitool <command> <options>

Beat

-- 
     \|/                           Beat Rubischon <beat at 0x1b.ch>
   ( 0^0 )                             http://www.0x1b.ch/~beat/
oOO--(_)--OOo---------------------------------------------------
Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/


From rpnabar at gmail.com  Wed Jul 14 10:05:04 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 14 Jul 2010 12:05:04 -0500
Subject: [Beowulf] Network problem: Why are ARP discovery requests sent to
	specific addresses instead of a broadcast domain
In-Reply-To: <C8633825.45E41%beat@0x1b.ch>
References: <AANLkTilDMixcQJaoCGsl9PFnXfLwJa8hOk0EbbK9fsof@mail.gmail.com>
	<C8633825.45E41%beat@0x1b.ch>
Message-ID: <AANLkTild6W79UIvW4s_GLt1IfRct3QMzUHHA1qD_k9qp@mail.gmail.com>

On Wed, Jul 14, 2010 at 2:47 AM, Beat Rubischon <beat at 0x1b.ch> wrote:
> Learn to use "ipmitool". It's generic and works with all BMCs today. Use it
> remotly with

THanks! I am already using ipmitool. Just wasn't aware this can set
the ARP interval (should have read that manpage closer). :)

Found the way to do it:

ipmitool -H 172.16.0.1 -U root -f ~/ipmi_pw -I lanplus lan set 1 arp
interval <seconds>

-- 
Rahul


From mdidomenico4 at gmail.com  Thu Jul 15 17:31:13 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Thu, 15 Jul 2010 20:31:13 -0400
Subject: [Beowulf] intel mkl lapack
Message-ID: <AANLkTik-ySx-2ZKZ9mjNobZ0FS0l8VSvG_VmjTL_k3l_@mail.gmail.com>

Does anyone have a specific C-code example of using the zgelss
function with complex numbers, which is part of the lapack libraries.
(i'm using the intel mkl)

I'm unable to locate a specific example for this particular function call.

I followed the (terse) API docs that come with MKL, but I'm unable to
figure what I'm doing wrong in the code.


From douglas.guptill at DAL.CA  Thu Jul 15 17:46:28 2010
From: douglas.guptill at DAL.CA (Douglas Guptill)
Date: Thu, 15 Jul 2010 21:46:28 -0300
Subject: [Beowulf] first cluster
In-Reply-To: <4C3B66D0.2080204@ldeo.columbia.edu>
References: <AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc>
	<4C3B66D0.2080204@ldeo.columbia.edu>
Message-ID: <20100716004628.GB7810@sopalepc>

Hello Gus:

On Mon, Jul 12, 2010 at 03:02:40PM -0400, Gus Correa wrote:
> Hi Doug
>
> Consider disk for:
>
> A) swap space (say, if the user programs are large,
> or you can't buy a lot of RAM, etc);
> I wonder if swapping over NFS would be efficient for HPC.
> Disk may be a simple and cost effective solution.

We have bought enough RAM (6 GB /core) that will I hope prevent swapping.

> B) input/output data files that your application programs may require
> (if they already work in stagein-stageout mode,

Now there you have me.  What is stagein-stageout?

> or if they do I/O so often that a NFS mounted file system
> may get overwhelmed, hence reading/writing on local disk may be preferred).

I am hoping to do that - write to local disk.  Each node has a 1 TB
disk, which I would like to split between the OS and user space.  How
to do that is still an unsolved problem at this point.  The head node
will have (6) 2 TB disks.

> C) Would diskless scaling be a real big advantage for
> a small/medium size cluster, say up to ~200 nodes?

Good question.  The node count is 16 (not 124, as I said previously -
brain fart - 124 is the core count), and seems to me just over the
border of what can be easily maintained as separate, diskful installs.
Our one user has expressed a preference for "refreshing" the nodes
before a job runs.  By that, he means re-install the operating system.

> E) booting when the NFS root server is not reachable
>
> Disks don't prevent one to keep a single image and distribute
> it consistently across nodes, do they?

I like that idea.

> I guess there are old threads about this in the list archives.

I looked in the beowulf archives, and only found very old (+years)
articles.  Is there another archive I should be looking at?

> Just some thoughts.

Much appreciated,
Douglas.
-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From hahn at mcmaster.ca  Thu Jul 15 18:29:59 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 15 Jul 2010 21:29:59 -0400 (EDT)
Subject: [Beowulf] first cluster
In-Reply-To: <4C3D58B7.7070506@unimelb.edu.au>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu> <20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc> <4C3B66D0.2080204@ldeo.columbia.edu>
	<AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
	<158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de>
	<4C3D58B7.7070506@unimelb.edu.au>
Message-ID: <Pine.LNX.4.64.1007141521230.13272@coffee.psychology.mcmaster.ca>

>> Disadvantage is of course, when the system runs out of
>> memory the oom-killer will look for an eligible process
>> to be killed to free up some space.
>
> That assumes that you are permitting your compute nodes
> to overcommit their memory, if you disable overcommit I
> believe that you will instead just get malloc()'s failing
> when there is nothing for them to grab.

yes.  actually, configuring memory and swap is an interesting topic.
the feature Chris is referring to is, I think, the vm.overcommit_memory
sysctl (and the associated vm.overcommit_ratio.)  every distro I've seen
leaves these at the default seting: vm.overcommit_memory=0.  this 
is basically the traditional setting that tells the kernel to feel free
to allocate way too much memory, and to resolve memory crunches via OOM
killing.  obviously, this isn't great, since it never tells apps to 
conserve memory (malloc returning zero), and often kills processes that
you're rather not be killed (sshd, other system daemons).  on clusters
where a node may be shared across users/jobs, OOM can result serious 
collateral damage...

we've used vm.overcommit_memory=2 fairly often.  in this mode, the kernel
limits its VM allocations to a combination of the size of ram and swap.
this is reflected in /proc/meminfo:CommitLimit which will be computed 
as /proc/meminfo:SwapTotal + vm.overcommit_ratio * /proc/meminfo:MemTotal.
/proc/meminfo:Committed_AS is the kernel's idea of total VM usage.

IMO, it's essential to also run with RLIMIT_AS on all processes.  this is 
basically a VM limit per process (not totalled across processes, though 
of course threads by definition share a single VM.)  you might be thinking
that RLIMIT_RSS would be better - indeed it would, but the kernel doesn't 
implement it.  basically, limiting RSS is a bit tricky because you have to
deal with how to count shared pages, and the limiting logic is going to 
slow down some important hot paths.  (unlike AS (vsz), which only needs logic
during explicit brk/mmap/munmap ops.)

of course, to be useful, this requires users to provide reasonable memory
limits at job-submission time.  (our user population is pretty diverse, and
isn't very good at doing wallclock limits, let alone "wizardly" issues like
VM footprint.)

batch systems often also provide their own resource management systems.
I'm not fond of putting much effort in this direction, since it's usually
based on a load-balancing model (which doesn't work if job memory use 
fluctuates), and upon on-node daemons which are assumed to be able to 
stay alive long enough to kill over-large job processes.  yes, one can harden
such system daemons by locking them into ram, but that's not an unalloyed win:
they'll probably be nontrivial in size, and such memory usage is unswapable,
even if some of the pages are never used...

anyway, back to the topic: it's eminently possible to run nodes without swap,
and reasonably safe to do so if your user community is not totally random,
and if you make smart use of vm.overcommit_memory=2 and RLIMIT_AS.  5 years
ago, running swapless was somewhat risky because the kernel was dramatically
better tested/tuned in a normal swap-able configuration.  my guess is that
the huge embedded ecosystem has made swapless more robust, especially if
you take the time to configure some basic sanity limits on user processes.

regards, mark hahn.


From gus at ldeo.columbia.edu  Thu Jul 15 20:01:03 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Thu, 15 Jul 2010 23:01:03 -0400
Subject: [Beowulf] first cluster
In-Reply-To: <20100716004628.GB7810@sopalepc>
References: <AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc>
	<4C3B66D0.2080204@ldeo.columbia.edu>
	<20100716004628.GB7810@sopalepc>
Message-ID: <4C3FCB6F.4040504@ldeo.columbia.edu>

Hi Douglas

Douglas Guptill wrote:
> Hello Gus:
> 
> On Mon, Jul 12, 2010 at 03:02:40PM -0400, Gus Correa wrote:
>> Hi Doug
>>
>> Consider disk for:
>>
>> A) swap space (say, if the user programs are large,
>> or you can't buy a lot of RAM, etc);
>> I wonder if swapping over NFS would be efficient for HPC.
>> Disk may be a simple and cost effective solution.
> 
> We have bought enough RAM (6 GB /core) that will I hope prevent swapping.
> 

Sure, of course swapping is a disaster for HPC, and for MPI.
Your memory configuration sounds great,
specially compared to my meager 2GB/core, the most we could afford.  :)

>> B) input/output data files that your application programs may require
>> (if they already work in stagein-stageout mode,
> 
> Now there you have me.  What is stagein-stageout?
> 

Old fashioned term for copying input files to the compute nodes before 
the program starts, then the output files back to the head node after 
the program ends.  You can still find this service in Torque/PBS,
maybe other resource managers, but it can also be done through scripts.

>> or if they do I/O so often that a NFS mounted file system
>> may get overwhelmed, hence reading/writing on local disk may be preferred).
> 
> I am hoping to do that - write to local disk.  

Actually, we seldom do this here.
Most programs we run are ocean/atmosphere/climate, with other Earth 
Science applications also.
Since you are in oceanography (am I right?) I would guess you would
be running ocean models, and they tend to do a moderate amount of I/O,
or better, to have a moderate I/O-to-computation ratio.
Hence, they normally don't require local disk for I/O,
which can be done in a central NFS mounted directory.

We don't have a big cluster, so we use Infinband (in the past it was 
Myrinet) for MPI and Gigabit Ethernet for control and I/O.
We have a separate file server with a RAID array,
where the home directories and scratch file systems live,
and are NFS mounted on the nodes.
I think this setup is more or less standard for small clusters.


I mentioned local disk for scratch space because this was common
when Ethernet 100Mb/s was the interconnect, and would barely handle
MPI, so it was preferred to do I/O locally, and 'stagein/stageout' the 
files.
On the other hand, as per several postings in this and other mailing 
lists, some computational chemistry and genome sequencing programs
apparently do I/O so often that they cannot live without local disk,
or a more expensive parallel file system.

> Each node has a 1 TB
> disk, which I would like to split between the OS and user space.  

We have much less, 80GB or 250GB disk on compute nodes,
which is more than enough for the OS and the scratch space
(seldom used).
Somebody mentioned that you also need the local disk for /tmp,
besides possible (not desirable) swap.
And of course you can have local /scratch, if you want.

> How
> to do that is still an unsolved problem at this point.  
 > The head node
> will have (6) 2 TB disks.
> 

Have you considered a separate storage node, NAS, whatever, with RAID,
to put home directories, scratch space, and mount them on the nodes via NFS.
The head node can also play this role, hosting the storage.

Given your total investment, this may not be so expensive.
Since you have only a few users, you could even use the head node for 
this, to avoid extra cost.
Buy a decent Gigabit Ethernet switch (or switches), and connect this
storage to it via 10Gbit Ethernet card.  Most good switches have modules
for that.


>> C) Would diskless scaling be a real big advantage for
>> a small/medium size cluster, say up to ~200 nodes?
> 
> Good question.  The node count is 16 (not 124, as I said previously -
> brain fart - 124 is the core count), 

OK, with 16 nodes you could certainly centralize home and scratch
directories in a single server (say the head node) with RAID (say, 
RAID6), for better performance, and mount them on the nodes via NFS,
even on a Gigabit Etherenet network.
(I would suggest having one network for control & I/O, another for MPI).

I would rather put smaller disks on the nodes, save the money to buy
a decent RAID controller, a head node chassis with hot-swappable
disk bays, enterprise class SATA disks of 2TB, and you would have
a central storage in the head node with, say 16-24TB (nominal),
with RAID6, xfs file system, for /home, /scratch[1,2,3...],
all NFS mounted on the nodes.
Easier to administer than separate home directories for each user
on the nodes, and probably not noticeably
slower (from the user standpoint) than the local disks.
I suppose this is a very common setup.

You could still create local /scratch on the compute nodes,
for those users that like to read/write on local disk,
and perhaps have a cleanup cron script to wipe off the
excess of old local /scratch files.


> and seems to me just over the
> border of what can be easily maintained as separate, diskful installs.
> Our one user has expressed a preference for "refreshing" the nodes
> before a job runs.  By that, he means re-install the operating system.
> 

Why?
I reinstall when I detect a problem.
Rocks (which you already declined to use :) ) reinstalls on any
hard reboot or power failure, assuming that those can lead to 
inconsistencies across the compute nodes.
This is default, but you can change that.
I think that even this is too much.

However, reinstalling before every new job  starts
sounds like washing your hands before you strike any new key
on the keyboard. You can't write an email this way,
and you cant extract useful work from the cluster if you have to
reinstall the nodes so often.

Even rebooting the node before a job starts is already too much.
You can do it periodically, to refresh the system, but before every job,
I never heard of anybody that does this.

>> E) booting when the NFS root server is not reachable
>>
>> Disks don't prevent one to keep a single image and distribute
>> it consistently across nodes, do they?
> 
> I like that idea.

That has been working fine here and in many many places.

> 
>> I guess there are old threads about this in the list archives.
> 
> I looked in the beowulf archives, and only found very old (+years)
> articles.  Is there another archive I should be looking at?
> 


In general, since many discussions in this list go astray,
the subject/title may have very little relation to the actual arguments
in the thread.
I am not criticizing this, I like it.
Some of the best discussions here started with a simple question
that was hijacked for a worthy cause,
and turned into a completely new dimension.

It is going to be hard to find anything searching the subject line.
You can try to search the message bodies with keywords like
"diskless", "ram disk", etc.
Google advanced search may help in this regard.

Unlikely that you will find much about diskless clusters
in the Rocks archive, as they are
diskfull clusters.
However, there may have been some discussions there too.

>> Just some thoughts.
> 
> Much appreciated,
> Douglas.

Best of luck with your new cluster!
Gus


From samuel at unimelb.edu.au  Thu Jul 15 21:09:36 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Fri, 16 Jul 2010 14:09:36 +1000
Subject: [Beowulf] first cluster
In-Reply-To: <Pine.LNX.4.64.1007141521230.13272@coffee.psychology.mcmaster.ca>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc>
	<4C3B66D0.2080204@ldeo.columbia.edu>
	<AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
	<158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de>
	<4C3D58B7.7070506@unimelb.edu.au>
	<Pine.LNX.4.64.1007141521230.13272@coffee.psychology.mcmaster.ca>
Message-ID: <4C3FDB80.5010706@unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 16/07/10 11:29, Mark Hahn wrote:

> every distro I've seen leaves these at the default seting:
> vm.overcommit_memory=0.  this is basically the traditional
> setting that tells the kernel to feel free to allocate way
> too much memory, and to resolve memory crunches via OOM

Looking at the kernel code if you set it vm.overcommit_memory
to 0 (OVERCOMMIT_GUESS) then the kernel allows *each process*
to allocate up to 97% of the total of RAM+swap (the last 3% is
reserved for root, or processes with CAP_SYS_ADMIN).   The catch
is that (as highlighted) the limit is a per process one, not a
system wide one.

With it set to 1 (OVERCOMMIT_ALWAYS) there are no checks at all,
it just returns 0 (OK) so any process can allocate as much as it
wants, just that you don't know who or what will get OOM'd when
you want to use it.. ;-)

With 2 (OVERCOMMIT_NEVER) you can never specify more than
your entire RAM+swap and the limit is applied across the
system.

We enforce RLIMIT_AS for MPI and single CPU processes by
setting pvmem limits in Torque in the default queue.

That doesn't work for SMP jobs so we have an 'smp' queue
for then which sets mem= instead, this means that pbs_mom
monitors the children and kills them if they go over their
limits.

cheers!
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw/238ACgkQO2KABBYQAh88PQCfdmVZjYE2GznidzDNPOJ2zO6U
DbIAnjKaviRyxIIsNVmsS3zfgbM0M7uZ
=eLad
-----END PGP SIGNATURE-----


From rpnabar at gmail.com  Thu Jul 15 21:10:31 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Thu, 15 Jul 2010 23:10:31 -0500
Subject: [Beowulf] first cluster
In-Reply-To: <20100716004628.GB7810@sopalepc>
References: <AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc>
	<4C3B66D0.2080204@ldeo.columbia.edu>
	<20100716004628.GB7810@sopalepc>
Message-ID: <AANLkTilk2FAVsrxU6HTeXzMVnxMp1QUFSyBz2_R_uhVP@mail.gmail.com>

On Thu, Jul 15, 2010 at 7:46 PM, Douglas Guptill <douglas.guptill at dal.ca> wrote:
> Good question. ?The node count is 16 (not 124, as I said previously -
> brain fart - 124 is the core count), and seems to me just over the
> border of what can be easily maintained as separate, diskful installs.

Not really the limit. We have ~300 nodes (300x8 cores) and they are
all maintained "diskful" (if I understand the usage correctly). i.e.
They each have  a local disk for the OS and scratch space. Of course,
the OS image are all essentially identical and installation is
automated via PXE.

 --
Rahul


From rpnabar at gmail.com  Thu Jul 15 21:30:17 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Thu, 15 Jul 2010 23:30:17 -0500
Subject: [Beowulf] first cluster
In-Reply-To: <Pine.LNX.4.64.1007141521230.13272@coffee.psychology.mcmaster.ca>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc>
	<4C3B66D0.2080204@ldeo.columbia.edu>
	<AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
	<158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de>
	<4C3D58B7.7070506@unimelb.edu.au>
	<Pine.LNX.4.64.1007141521230.13272@coffee.psychology.mcmaster.ca>
Message-ID: <AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>

On Thu, Jul 15, 2010 at 8:29 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
> yes. ?actually, configuring memory and swap is an interesting topic.
> the feature Chris is referring to is, I think, the vm.overcommit_memory
> sysctl (and the associated vm.overcommit_ratio.) ?every distro I've seen
> leaves these at the default seting: vm.overcommit_memory=0. ?this is

Is it possible to know how much over-committed my OS was, say in the
last one day. Or at least instantaneously. I want to see how good or
bad my user apps have been at requesting memory. Thus, if I were to
take the strict approach of memory assignment I can know in advance if
or not a lot of malloc calls are going to get a zero returned.

> basically the traditional setting that tells the kernel to feel free
> to allocate way too much memory, and to resolve memory crunches via OOM
> killing. ?obviously, this isn't great, since it never tells apps to conserve
> memory (malloc returning zero), and often kills processes that
> you're rather not be killed (sshd, other system daemons). ?on clusters

Ah! this might explain why once in a while I have a node with sshd
dead. Is it possible to tell the kernel that certain processes are
"privileged" and when it seeks to find random processes to kill it
should not select these "privileged" processes? Some candidates that
come to my mind are sshd, nagios and pbs_mom

-- 
Rahul


From samuel at unimelb.edu.au  Thu Jul 15 22:31:25 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Fri, 16 Jul 2010 15:31:25 +1000
Subject: [Beowulf] first cluster
In-Reply-To: <AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com>
	<2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org>
	<AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com>
	<45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org>
	<20100707191645.GA25781@sopalepc>
	<4C35D614.1090607@ldeo.columbia.edu>
	<20100709164313.GA25062@sopalepc>
	<Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca>
	<20100709205726.GA7313@sopalepc>
	<Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca>
	<20100712170234.GB6134@sopalepc>
	<4C3B66D0.2080204@ldeo.columbia.edu>
	<AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com>
	<158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de>
	<4C3D58B7.7070506@unimelb.edu.au>
	<Pine.LNX.4.64.1007141521230.13272@coffee.psychology.mcmaster.ca>
	<AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>
Message-ID: <4C3FEEAD.2030909@unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 16/07/10 14:30, Rahul Nabar wrote:

> Is it possible to know how much over-committed my OS was,
> say in the last one day. Or at least instantaneously.

I would suggest that you may not want to run your systems
overcommitted, I feel that it's much nicer for an application
to have a malloc() fail than for the OOM killer to get invoked.

On the topic of memory usage, the Linux kernel has been
(until fairly recently) rather bad at reporting that
reliably (or at least usefully).  There were some recent
patches that improved its memory accounting and there's
a tool called "smem" which gives an interesting way of
looking at things (packaged in Debian and Ubuntu):

http://www.selenic.com/smem/

Not sure if it'll work on RHEL 5 though, the kernel
is likely too ancient for it.

> Ah! this might explain why once in a while I have a node with sshd
> dead. Is it possible to tell the kernel that certain processes are
> "privileged" and when it seeks to find random processes to kill it
> should not select these "privileged" processes? Some candidates that
> come to my mind are sshd, nagios and pbs_mom

You're in luck, there was an LWN article last year which touched on
this:

http://lwn.net/Articles/317814/

# Users and system administrators have often asked for ways to
# control the behavior of the OOM killer. To facilitate control,
# the /proc/<pid>/oom_adj knob was introduced to save important
# processes in the system from being killed, and define an order
# of processes to be killed. The possible values of oom_adj
# range from -17 to +15. The higher the score, more likely the
# associated process is to be killed by OOM-killer. If oom_adj
# is set to -17, the process is not considered for OOM-killing.

cheers!
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw/7q0ACgkQO2KABBYQAh+5dwCdH7FvlO6Fv1XP0f58r1q+0cVC
YV4AniFwSLScUnqkgmE/crX+htauzx2P
=DnRX
-----END PGP SIGNATURE-----


From john.hearns at mclaren.com  Fri Jul 16 02:02:53 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 16 Jul 2010 10:02:53 +0100
Subject: [Beowulf] first cluster
In-Reply-To: <AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>
References: <AANLkTikaMTwPsvQvEDkygu0d-Yyl2kXZAnATI5CeEtl-@mail.gmail.com><2E2CC0F2-9A55-42EA-9C3A-CE2A782B6B1D@open-mpi.org><AANLkTinMuKdDKLJjpD396hUJ9OLLiRLYjDPvrcxxxl3B@mail.gmail.com><45B036FD-F4A1-4499-AFC2-BE6EE20A64E1@open-mpi.org><20100707191645.GA25781@sopalepc><4C35D614.1090607@ldeo.columbia.edu><20100709164313.GA25062@sopalepc><Pine.LNX.4.64.1007091416030.19553@coffee.psychology.mcmaster.ca><20100709205726.GA7313@sopalepc><Pine.LNX.4.64.1007091906190.19553@coffee.psychology.mcmaster.ca><20100712170234.GB6134@sopalepc><4C3B66D0.2080204@ldeo.columbia.edu><AANLkTinZrI-7djqMoQQVnics4zezJZobuu2KIowH58FA@mail.gmail.com><158EF01F-2DAD-4E44-948F-6A5D4B21236E@staff.uni-marburg.de><4C3D58B7.7070506@unimelb.edu.au><Pine.LNX.4.64.1007141521230.13272@coffee.psychology.mcmaster.ca>
	<AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>
Message-ID: <68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com>


> 
> Is it possible to know how much over-committed my OS was, say in the
> last one day. Or at least instantaneously. I want to see how good or
> bad my user apps have been at requesting memory. Thus, if I were to
> take the strict approach of memory assignment I can know in advance if
> or not a lot of malloc calls are going to get a zero returned.
> 
I'm a bit busy this morning - Tube line was down, then my Bnew Brompton
had a mechanical.

Performance Copilot will give you very detailed plots of various types
of memory use

http://oss.sgi.com/projects/pcp/

Also worth looking at your Ganglia plots - probably easier to install.


As an aside, my two pence worth on this thread.
To the original poster - you have done your research on what is needed
for a first cluster.
Take may advice, and that of a lot of people on this list, and contact a
cluster vendor in your area.
You will be surprised at how competitive the price is versus sourcing
the parts yourself.
And remember - people who build clusters are specialists in that task,
you are a specialist in oceanography.
Get on with doing your science, and let the cluster people get on with
building you a brilliant cluster and looking after it.


John Hearns
McLaren Racing


The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From douglas.guptill at dal.ca  Fri Jul 16 07:59:09 2010
From: douglas.guptill at dal.ca (Douglas Guptill)
Date: Fri, 16 Jul 2010 11:59:09 -0300
Subject: [Beowulf] first cluster
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com>
References: <AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <20100716145909.GB9850@sopalepc>

On Fri, Jul 16, 2010 at 10:02:53AM +0100, Hearns, John wrote:

> As an aside, my two pence worth on this thread.

I agree, the topic seems to have shifted...

> To the original poster - you have done your research on what is needed
> for a first cluster.
> Take may advice, and that of a lot of people on this list, and contact a
> cluster vendor in your area.
> You will be surprised at how competitive the price is versus sourcing
> the parts yourself.
> And remember - people who build clusters are specialists in that task,
> you are a specialist in oceanography.

We have ordered the cluster from a local builder, and expect delivery
in about 4 weeks.  I am a servant of the Oceanographers, my specialty
is software.

Thanks for the advice.  I conclude there are no magic bullet
solutions.  One must do research, make an educated guess, and then
hold one's nose and jump in at the deep end.

There is one question that perplexes me, to which I have not found an
answer.

How does the presence of a job scheduler interact with the ability of a user to
  ssh to <head>, 
  ssh to <compute-node-n>, and then type 
  mpirun -np 64 my_application

Intuition tells me there has to be something in a cluster setup, when
it has a scheduler, that prevents a user from circumventing the
scheduler by doing something like the above.

Any hints?

> John Hearns McLaren Racing

BTW, congratulations on a great season this year.

Regards,
Douglas.
-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From dag at sonsorol.org  Fri Jul 16 10:01:01 2010
From: dag at sonsorol.org (Chris Dagdigian)
Date: Fri, 16 Jul 2010 13:01:01 -0400
Subject: [Beowulf] first cluster
In-Reply-To: <20100716145909.GB9850@sopalepc>
References: <AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>	<68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com>
	<20100716145909.GB9850@sopalepc>
Message-ID: <4C40904D.2010307@sonsorol.org>

You want the honest answer?

There are technical things you can do to to prevent users from bypassing 
the scheduler and resource allocation policies. One of the cooler things 
I've seen in Grid Engine environments was a cron job that did a "kill 
-9" against any user process that was not a child of a sge_shepherd 
daemon. Very effective.

Other people play games with pam settings and the like.

The honest truth is that technical countermeasures are mostly a waste of 
time. A motivated user always has more time and effort to spend trying 
to game the system than an overworked administrator.

My recommendation is to subject users to a cluster acceptable use 
policy. Any abuses of the policy are treated as a teamwork and human 
resources issue. The first time you screw up you get a warning, the 
second time you get caught I'll send a note to your manager. After that 
any abuses are treated with a loss of cluster access and a referral to 
human resources for further action.

Simply put -- you don't have enough time in the day to deal with users 
who want to game/abuse the system. It's far easier for all concerned to 
have everyone agree on a fair use policy and treat any infractions via 
management rather than cluster settings.

This is another reason why having a cluster governance body helps a lot. 
A committee of cluster power users and IT staff is a great way to get 
consensus on queue setup, cluster policies, disk quotas and the like. 
They can also come down hard with peer pressure on pissy users.

my $.02

-Chris


Douglas Guptill wrote:
> How does the presence of a job scheduler interact with the ability of a user to
>    ssh to<head>,
>    ssh to<compute-node-n>, and then type
>    mpirun -np 64 my_application
>
> Intuition tells me there has to be something in a cluster setup, when
> it has a scheduler, that prevents a user from circumventing the
> scheduler by doing something like the above.


From douglas.guptill at dal.ca  Fri Jul 16 10:11:29 2010
From: douglas.guptill at dal.ca (Douglas Guptill)
Date: Fri, 16 Jul 2010 14:11:29 -0300
Subject: [Beowulf] first cluster
In-Reply-To: <C8660665.89CB%scrusan@ur.rochester.edu>
References: <20100716145909.GB9850@sopalepc>
	<C8660665.89CB%scrusan@ur.rochester.edu>
Message-ID: <20100716171129.GB10537@sopalepc>

On Fri, Jul 16, 2010 at 12:51:49PM -0400, Steve Crusan wrote:
> We use a PAM module (pam_torque) to stop this behavior. Basically, if you
> your job isn't currently running on a node, you cannot SSH into a node.
> 
> 
> http://www.rpmfind.net/linux/rpm2html/search.php?query=torque-pam
> 
> That way one is required to use the queuing system for jobs, so the cluster
> isn't like the wild wild west...

Ah Ha!.  The key.

Thanks,
Douglas.

> On 7/16/10 10:59 AM, "Douglas Guptill" <douglas.guptill at dal.ca> wrote:
> 
> > On Fri, Jul 16, 2010 at 10:02:53AM +0100, Hearns, John wrote:
> > 
> >> As an aside, my two pence worth on this thread.
> > 
> > I agree, the topic seems to have shifted...
> > 
> >> To the original poster - you have done your research on what is needed
> >> for a first cluster.
> >> Take may advice, and that of a lot of people on this list, and contact a
> >> cluster vendor in your area.
> >> You will be surprised at how competitive the price is versus sourcing
> >> the parts yourself.
> >> And remember - people who build clusters are specialists in that task,
> >> you are a specialist in oceanography.
> > 
> > We have ordered the cluster from a local builder, and expect delivery
> > in about 4 weeks.  I am a servant of the Oceanographers, my specialty
> > is software.
> > 
> > Thanks for the advice.  I conclude there are no magic bullet
> > solutions.  One must do research, make an educated guess, and then
> > hold one's nose and jump in at the deep end.
> > 
> > There is one question that perplexes me, to which I have not found an
> > answer.
> > 
> > How does the presence of a job scheduler interact with the ability of a user
> > to
> >   ssh to <head>, 
> >   ssh to <compute-node-n>, and then type
> >   mpirun -np 64 my_application
> > 
> > Intuition tells me there has to be something in a cluster setup, when
> > it has a scheduler, that prevents a user from circumventing the
> > scheduler by doing something like the above.
> > 
> > Any hints?
> > 
> >> John Hearns McLaren Racing
> > 
> > BTW, congratulations on a great season this year.
> > 
> > Regards,
> > Douglas.
> 
> 
> 
> ----------------------
> Steve Crusan
> System Administrator
> Center for Research Computing
> University of Rochester
> https://www.crc.rochester.edu/
> 
> 

-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From prentice at ias.edu  Fri Jul 16 10:13:55 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Fri, 16 Jul 2010 13:13:55 -0400
Subject: [Beowulf] first cluster
In-Reply-To: <20100716145909.GB9850@sopalepc>
References: <AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>	<68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com>
	<20100716145909.GB9850@sopalepc>
Message-ID: <4C409353.60302@ias.edu>


> 
> There is one question that perplexes me, to which I have not found an
> answer.
> 
> How does the presence of a job scheduler interact with the ability of a user to
>   ssh to <head>, 
>   ssh to <compute-node-n>, and then type 
>   mpirun -np 64 my_application
> 
> Intuition tells me there has to be something in a cluster setup, when
> it has a scheduler, that prevents a user from circumventing the
> scheduler by doing something like the above.

That is definitely a problem that must be dealt with. No point in having
a schedluer if everyone bypasses it. There are a few ways you could do
it. Here's how I do it:

My cluster mounts the same NFS file systems (/home directories,
/usr/local, etc.) as all the user workstations, and our more powerful
multi-user 64-bit servers with lots of RAM. We call the latter 'compute
servers' (just to avoid confusion on the list with compute nodes in the
cluster). The compute servers are outside the cluster network, but can
communicate with the head node.

I use SGE, which separates the roles of compute host, submission host,
and administration host (not sure if other resource managers behave the
same way. A member of an SGE cluster can be any combination of these 3
things. Our computer servers are set up as submit hosts, so users can
use them to compile their programs and submit jobs, and check on the
status, without ever actually logging into any cluster node at all. The
SSH configuration in the head node prevents anyone other than the
administrative staff from logging in, and the rest of the cluster is on
a private network, so the only way to run a job in this was is through
submitting a batch job through SGE.

The only drawback of this system, is that users cannot request
interactive jobs on my cluster, but I don't see that as a very big
problem, since most cluster jobs are batch jobs anyway.

Not sure if you can do the same thing with other resource managers, SGE
is the only one I've used. Not sure if other resource managers still
rely on rsh/ssh to start jobs. If they do, that can add some complexity
to allowing the cluster nodes to allow jobs to run, but disallow
interactive logins.

-- 
Prentice


From john.hearns at mclaren.com  Fri Jul 16 10:34:48 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 16 Jul 2010 18:34:48 +0100
Subject: [Beowulf] first cluster
References: <AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>	<68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com>
	<20100716145909.GB9850@sopalepc> <4C40904D.2010307@sonsorol.org>
Message-ID: <68A57CCFD4005646957BD2D18E60667B09ECFECD@milexchmb1.mil.tagmclarengroup.com>


-----Original Message-----
From: beowulf-bounces at beowulf.org on behalf of Chris Dagdigian


This is another reason why having a cluster governance body helps a lot. 
A committee of cluster power users and IT staff is a great way to get 
consensus on queue setup, cluster policies, disk quotas and the like. 


Talking about disk space, 'agedu' is a great tool.

http://www.chiark.greenend.org.uk/~sgtatham/agedu/
I have spent a fun day running agedu and beating users over the head.

Apologies if I have already flagged this one up.

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From gus at ldeo.columbia.edu  Fri Jul 16 10:50:04 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Fri, 16 Jul 2010 13:50:04 -0400
Subject: [Beowulf] first cluster
In-Reply-To: <4C40904D.2010307@sonsorol.org>
References: <AANLkTinhk6j3V2OTbIFIAUsaZlwZrB4lNPHwv85B6DIA@mail.gmail.com>	<68A57CCFD4005646957BD2D18E60667B1126D59D@milexchmb1.mil.tagmclarengroup.com>	<20100716145909.GB9850@sopalepc>
	<4C40904D.2010307@sonsorol.org>
Message-ID: <4C409BCC.9040202@ldeo.columbia.edu>

Chris Dagdigian wrote:
> You want the honest answer?
> 
> There are technical things you can do to to prevent users from bypassing 
> the scheduler and resource allocation policies. One of the cooler things 
> I've seen in Grid Engine environments was a cron job that did a "kill 
> -9" against any user process that was not a child of a sge_shepherd 
> daemon. Very effective.
> 
> Other people play games with pam settings and the like.
> 
> The honest truth is that technical countermeasures are mostly a waste of 
> time. A motivated user always has more time and effort to spend trying 
> to game the system than an overworked administrator.
> 
> My recommendation is to subject users to a cluster acceptable use 
> policy. Any abuses of the policy are treated as a teamwork and human 
> resources issue. The first time you screw up you get a warning, the 
> second time you get caught I'll send a note to your manager. After that 
> any abuses are treated with a loss of cluster access and a referral to 
> human resources for further action.
> 
> Simply put -- you don't have enough time in the day to deal with users 
> who want to game/abuse the system. It's far easier for all concerned to 
> have everyone agree on a fair use policy and treat any infractions via 
> management rather than cluster settings.
> 
> This is another reason why having a cluster governance body helps a lot. 
> A committee of cluster power users and IT staff is a great way to get 
> consensus on queue setup, cluster policies, disk quotas and the like. 
> They can also come down hard with peer pressure on pissy users.
> 
> my $.02
> 
> -Chris
> 
> 
Hi Chris, Douglas, list

Very wise words, and match my experience here,
particularly to have a small cluster committee
to share the responsibility of policies and their enforcement.

As Chris said, this is a not a technical issue, this is about hacking.
Resource managers rely on ssh, you can tweak with it, with IP tables,
with pam, launch cron jobs to kill recalcitrant behavior, etc,
to prevent some stuff, but there will always be a back door
to be found by those so inclined.
Also, too many restrictions on the technical side
may become hurdles to legitimate use of the cluster.

Since Douglas is in an university, I would suggest also
that when you set up new accounts,
have the user agree to the general IT policies of your
university.  Or, as I do, just send an email to the new user
telling the account is up and adding something like this:

"By accepting this account you are automatically
agreeing with the general IT regulations of our university, which I
encourage you to read at http://your.univ.it.regulations ,
and to abide by any other policies established by the cluster
committee and system administrators."

It is a bit like that lovely paradigm of Realpolitik:

"Speak softly and carry a big stick ..."   (Teddy Roosevelt)  :)

Gus Correa

> 
> Douglas Guptill wrote:
>> How does the presence of a job scheduler interact with the ability of 
>> a user to
>>    ssh to<head>,
>>    ssh to<compute-node-n>, and then type
>>    mpirun -np 64 my_application
>>
>> Intuition tells me there has to be something in a cluster setup, when
>> it has a scheduler, that prevents a user from circumventing the
>> scheduler by doing something like the above.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf


From peter.st.john at gmail.com  Fri Jul 16 11:26:41 2010
From: peter.st.john at gmail.com (Peter St. John)
Date: Fri, 16 Jul 2010 14:26:41 -0400
Subject: [Beowulf] intel mkl lapack
In-Reply-To: <AANLkTik-ySx-2ZKZ9mjNobZ0FS0l8VSvG_VmjTL_k3l_@mail.gmail.com>
References: <AANLkTik-ySx-2ZKZ9mjNobZ0FS0l8VSvG_VmjTL_k3l_@mail.gmail.com>
Message-ID: <AANLkTinoMGve3gQ6idmiJuKX3cEhtwmUzvgLLtQWx3JT@mail.gmail.com>

Michael,
Hopefully someone has an example call in C, but the fortran source is pretty
well commented and might be helpful:
http://www.netlib.org/lapack/explore-html/a01461_source.html
Peter

On Thu, Jul 15, 2010 at 8:31 PM, Michael Di Domenico <mdidomenico4 at gmail.com
> wrote:

> Does anyone have a specific C-code example of using the zgelss
> function with complex numbers, which is part of the lapack libraries.
> (i'm using the intel mkl)
>
> I'm unable to locate a specific example for this particular function call.
>
> I followed the (terse) API docs that come with MKL, but I'm unable to
> figure what I'm doing wrong in the code.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100716/2fe63c57/attachment.html>

From samuel at unimelb.edu.au  Sun Jul 18 20:27:22 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Mon, 19 Jul 2010 13:27:22 +1000
Subject: [Beowulf] first cluster
In-Reply-To: <20100716171129.GB10537@sopalepc>
References: <20100716145909.GB9850@sopalepc>
	<C8660665.89CB%scrusan@ur.rochester.edu>
	<20100716171129.GB10537@sopalepc>
Message-ID: <4C43C61A.5000800@unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 17/07/10 03:11, Douglas Guptill wrote:

> On Fri, Jul 16, 2010 at 12:51:49PM -0400, Steve Crusan wrote:
>
>> > We use a PAM module (pam_torque) to stop this behavior. Basically, if you
>> > your job isn't currently running on a node, you cannot SSH into a node.
>> > 
>> > http://www.rpmfind.net/linux/rpm2html/search.php?query=torque-pam
>
> Ah Ha!.  The key.

It's worth noting that comes with Torque, you just need
to configure it with the --with-pam directive.

The code is in src/pam and there's a README there too.

cheers,
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxDxhoACgkQO2KABBYQAh+3jQCeJ2VTXrFqiq3RJAShaSjG0n+d
mdoAnRMlt56m/6hMdIxXabxqL+prjN5p
=FNZ7
-----END PGP SIGNATURE-----


From tjrc at sanger.ac.uk  Mon Jul 19 01:54:28 2010
From: tjrc at sanger.ac.uk (Tim Cutts)
Date: Mon, 19 Jul 2010 09:54:28 +0100
Subject: [Beowulf] first cluster
In-Reply-To: <20100716171129.GB10537@sopalepc>
References: <20100716145909.GB9850@sopalepc>
	<C8660665.89CB%scrusan@ur.rochester.edu>
	<20100716171129.GB10537@sopalepc>
Message-ID: <553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk>


On 16 Jul 2010, at 6:11 pm, Douglas Guptill wrote:

> On Fri, Jul 16, 2010 at 12:51:49PM -0400, Steve Crusan wrote:
>> We use a PAM module (pam_torque) to stop this behavior. Basically, if you
>> your job isn't currently running on a node, you cannot SSH into a node.
>> 
>> 
>> http://www.rpmfind.net/linux/rpm2html/search.php?query=torque-pam
>> 
>> That way one is required to use the queuing system for jobs, so the cluster
>> isn't like the wild wild west...
> 
> Ah Ha!.  The key.

It's a very neat idea, but it has the disadvantage - unless I'm misunderstanding - that if the job fails, and leaves droppings in, say, /tmp on the cluster node, the user can't log in to diagnose things or clean up after themselves.

Tim

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From reuti at staff.uni-marburg.de  Mon Jul 19 03:46:03 2010
From: reuti at staff.uni-marburg.de (Reuti)
Date: Mon, 19 Jul 2010 12:46:03 +0200
Subject: [Beowulf] first cluster
In-Reply-To: <553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk>
References: <20100716145909.GB9850@sopalepc>
	<C8660665.89CB%scrusan@ur.rochester.edu>
	<20100716171129.GB10537@sopalepc>
	<553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk>
Message-ID: <BE6A7802-B040-4773-8C2C-24EFE13966A1@staff.uni-marburg.de>

Am 19.07.2010 um 10:54 schrieb Tim Cutts:

> 
> On 16 Jul 2010, at 6:11 pm, Douglas Guptill wrote:
> 
>> On Fri, Jul 16, 2010 at 12:51:49PM -0400, Steve Crusan wrote:
>>> We use a PAM module (pam_torque) to stop this behavior. Basically, if you
>>> your job isn't currently running on a node, you cannot SSH into a node.
>>> 
>>> 
>>> http://www.rpmfind.net/linux/rpm2html/search.php?query=torque-pam
>>> 
>>> That way one is required to use the queuing system for jobs, so the cluster
>>> isn't like the wild wild west...
>> 
>> Ah Ha!.  The key.
> 
> It's a very neat idea, but it has the disadvantage - unless I'm misunderstanding - that if the job fails, and leaves droppings in, say, /tmp on the cluster node, the user can't log in to diagnose things or clean up after themselves.

Yep. With GridEngine the $TMPDIR will be removed automatically, at least when the user honors the variable. I disable ssh and rsh in my clusters except for admin staff. Normal users can use an interactive job in SGE, which is limited to a cpu time of 60 sec., if they really want to peak on the nodes.

-- Reuti

> 
> Tim
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE. 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at mcmaster.ca  Mon Jul 19 06:47:53 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Mon, 19 Jul 2010 09:47:53 -0400 (EDT)
Subject: [Beowulf] first cluster
In-Reply-To: <553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk>
References: <20100716145909.GB9850@sopalepc>
	<C8660665.89CB%scrusan@ur.rochester.edu>
	<20100716171129.GB10537@sopalepc>
	<553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk>
Message-ID: <Pine.LNX.4.64.1007190939010.1418@coffee.psychology.mcmaster.ca>

> It's a very neat idea, but it has the disadvantage - unless I'm
>misunderstanding - that if the job fails, and leaves droppings in, say, /tmp
>on the cluster node, the user can't log in to diagnose things or clean up
>after themselves.

my organization has ~4k users (~3-500 active at any time), and does not 
attempt to prevent access to compute nodes by users.  it just doesn't 
seem like a real, worth-solving problem.  heck, we have more trouble 
with users running jobs on _login_ nodes, rather than compute notes.
(many of our systems came with a pam-slurm module which did this;
we remove it.)

I don't think this is at all surprising.  if a user groks clusters 
at all, they'll know that cheating is not very effective (and not very
scalable) and stands a good chance of bringing trouble.

those who don't grok wind up running on the login nodes 
(where we have fairly tight RLIMIT_AS and CPU...)

regards, mark hahn.


From hahn at mcmaster.ca  Tue Jul 20 09:07:32 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 20 Jul 2010 12:07:32 -0400 (EDT)
Subject: [Beowulf] compilers vs mpi?
Message-ID: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>

Hi all,
I'm interested in hearing about experiences with mixing compilers
between the application and MPI.  that is, I would like to be able
to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly 
with apps compiled with other compilers.  I guess I'm reasoning by analogy 
to normal distro libs.

the OpenMPI FAQ has this comment:

NOTE: The Open MPI team recommends using a single compiler suite whenever
possible. Unexpeced or undefined behavior can occur when you mix compiler
suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 compilers
between different compiler suites is almost guaranteed not to work).

and there are complaints elsewhere in the FAQ about f90 bindings.  I'd 
appreciate it if someone could help a humble C/C++/perl hacker understand 
the issues here...

thanks, mark hahn.
PS: we have a large and diverse user base, so tend to have to support 
gcc, intel, pathscale and pgi.  we even have people who want to use 
intel's damned synthetic 128b FP over MPI :(


From siegert at sfu.ca  Tue Jul 20 10:46:55 2010
From: siegert at sfu.ca (Martin Siegert)
Date: Tue, 20 Jul 2010 10:46:55 -0700
Subject: [Beowulf] compilers vs mpi?
In-Reply-To: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
Message-ID: <20100720174655.GG24917@stikine.its.sfu.ca>

Hi Mark,

we do exactly what you describe: compile OpenMPI with the gcc suite
and then use it with gcc, intel and open64 compilers.
This works out-of-the-box, almost.
The problem is the f90 module mpi.mod. This is (usually) a binary
file and specific to the f90 compiler that was used to compile 
OpenMPI. But there is a way to solve this problem:

1. compile openmpi using the gcc compilers, i.e., gfortran as the Fortran
   compiler and install it in /usr/local/openmpi
2. move the Fortran module to the directory
   /usr/local/openmpi/include/gfortran. In that directory
   create softlinks to the files in /usr/local/openmpi/include.
3. compile openmpi using ifort and install the Fortran module (and only
   the Fortran module!) in /usr/local/openmpi/include/intel. In that
   directory create softlinks to the files in /usr/local/openmpi/include.
4. in /usr/local/openmpi/bin create softlinks mpif90.ifort
   and mpif90.gfortran pointing to opal_wrapper. Remove the
   mpif90 softlink.
5. Move /usr/local/openmpi/share/openmpi/mpif90-wrapper-data.txt
   to /usr/local/openmpi/share/openmpi/mpif90.ifort-wrapper-data.txt.
   Change the line includedir=${includedir} to:
   includedir=${includedir}/intel
   Copy the file to
   /usr/local/openmpi/share/openmpi/mpif90.gfortran-wrapper-data.txt
   and change the line includedir=${includedir} to
   includedir=${includedir}/gfortran
6. Create a wrapper script /usr/local/openmpi/bin/mpif90:

#!/bin/bash
OMPI_WRAPPER_FC=`basename $OMPI_FC 2> /dev/null`
if [ "$OMPI_WRAPPER_FC" = 'gfortran' ]; then
   exec $0.gfortran "$@"
else
   exec $0.ifort "$@"
fi

The reason we use gfortran in step 1 is that otherwise you get those
irritating error messages from the Intel libraries, cf.
http://www.open-mpi.org/faq/?category=building#intel-compiler-wrapper-compiler-w
arnings

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid/ComputeCanada Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6

On Tue, Jul 20, 2010 at 12:07:32PM -0400, Mark Hahn wrote:
> Hi all,
> I'm interested in hearing about experiences with mixing compilers
> between the application and MPI.  that is, I would like to be able
> to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly 
> with apps compiled with other compilers.  I guess I'm reasoning by analogy 
> to normal distro libs.
>
> the OpenMPI FAQ has this comment:
>
> NOTE: The Open MPI team recommends using a single compiler suite whenever
> possible. Unexpeced or undefined behavior can occur when you mix compiler
> suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 compilers
> between different compiler suites is almost guaranteed not to work).
>
> and there are complaints elsewhere in the FAQ about f90 bindings.  I'd 
> appreciate it if someone could help a humble C/C++/perl hacker understand 
> the issues here...
>
> thanks, mark hahn.
> PS: we have a large and diverse user base, so tend to have to support gcc, 
> intel, pathscale and pgi.  we even have people who want to use intel's 
> damned synthetic 128b FP over MPI :(
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gus at ldeo.columbia.edu  Tue Jul 20 10:48:11 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Tue, 20 Jul 2010 13:48:11 -0400
Subject: [Beowulf] compilers vs mpi?
In-Reply-To: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
Message-ID: <4C45E15B.8020702@ldeo.columbia.edu>

HI Mark

Mark Hahn wrote:
> Hi all,
> I'm interested in hearing about experiences with mixing compilers
> between the application and MPI.  that is, I would like to be able
> to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly 
> with apps compiled with other compilers.  I guess I'm reasoning by 
> analogy to normal distro libs.
> 

I haven't built OpenMPI this way,
but you may try to link statically with commercial compiler libraries
(say -static-intel, -Bstatic_pgi),
to avoid too much mess with the user environment,
when they are use a different compiler than the one underlying the MPI 
wrappers.

> the OpenMPI FAQ has this comment:
> 
> NOTE: The Open MPI team recommends using a single compiler suite whenever
> possible. Unexpeced or undefined behavior can occur when you mix compiler
> suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 
> compilers
> between different compiler suites is almost guaranteed not to work).
> 

Yes, they do recommend compiler homogeneity.
However, I have built hybrids gcc+ifort
and gcc+pgf90 and both work fine.
(I have the homogeneous versions also.)

I do not even use a different Fortran77 compiler,
it is the same as Fortran90 (F77=FC).

In any case, my experience is that many applications come with such
messy configure/Makefile scheme (which often times refuses
to use the MPI compiler wrappers properly), that no matter what
you do to provide a variety of MPI builds,
there are always problems to build some applications,
and you need to lend a hand to the user.

> and there are complaints elsewhere in the FAQ about f90 bindings.  

In my experience they are not perfect but work reasonably well.
For instance, they do not check
all interfaces (or whatever F90 calls the analog of C function 
prototypes), but if the program is correctly written,
and doesn't go OO-verboard in relying that such checks will be done,
things work.
Fortran77 never had these features anyway, and I guess
mpif77 doesn't check if you are passing an integer
where it should be a real, or if your argument list is shorter
than the function requires.

> I'd 
> appreciate it if someone could help a humble C/C++/perl hacker 
> understand the issues here...
> 
> thanks, mark hahn.
> PS: we have a large and diverse user base, so tend to have to support 
> gcc, intel, pathscale and pgi.  

... and don't forget Open64!  :)

we even have people who want to use
> intel's damned synthetic 128b FP over MPI :(

It's hard to keep the customer satisfied.
You give them the sky, they want the universe.

I hope this helps,
Gus Correa

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at pbm.com  Tue Jul 20 10:52:25 2010
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 20 Jul 2010 10:52:25 -0700
Subject: [Beowulf] compilers vs mpi?
In-Reply-To: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
Message-ID: <20100720175224.GC22136@bx9.net>

On Tue, Jul 20, 2010 at 12:07:32PM -0400, Mark Hahn wrote:

> I'm interested in hearing about experiences with mixing compilers
> between the application and MPI.  that is, I would like to be able
> to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly  
> with apps compiled with other compilers.  I guess I'm reasoning by 
> analogy to normal distro libs.

Everyone's C compilers are pretty much compatible. C++ and Fortran,
not at all.

-- greg


From prentice at ias.edu  Tue Jul 20 11:05:51 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 20 Jul 2010 14:05:51 -0400
Subject: [Beowulf] compilers vs mpi?
In-Reply-To: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
Message-ID: <4C45E57F.2070802@ias.edu>

Mark Hahn wrote:
> Hi all,
> I'm interested in hearing about experiences with mixing compilers
> between the application and MPI.  that is, I would like to be able
> to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly
> with apps compiled with other compilers.  I guess I'm reasoning by
> analogy to normal distro libs.
> 
> the OpenMPI FAQ has this comment:
> 
> NOTE: The Open MPI team recommends using a single compiler suite whenever
> possible. Unexpeced or undefined behavior can occur when you mix compiler
> suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90
> compilers
> between different compiler suites is almost guaranteed not to work).
> 
> and there are complaints elsewhere in the FAQ about f90 bindings.  I'd
> appreciate it if someone could help a humble C/C++/perl hacker
> understand the issues here...
> 
> thanks, mark hahn.
> PS: we have a large and diverse user base, so tend to have to support
> gcc, intel, pathscale and pgi.  we even have people who want to use
> intel's damned synthetic 128b FP over MPI :(


Mark,

I'm not a developer, but I do spend a lot of my time compiling codes
(mostly C and Fortran) for users, and I've dealt with the problem plent
of times.

Here's the library compatibility problem as I understand it:

The C programming language standard defines the syntax for symbol names
in libraries, so if you have a function named printf, when compiled into
a library, the symbol for it will always be printf.

>From what I've heard, For C++ the standard isn't as strict as C, but
usually doesn't cause any problems. I rarely compile C++ code, so I have
no real experience with this.

For Fortran, the standard doesn't define a standard naming convention
for library symbols, so the compilers have more freedom with symbol
naming. Usually, the symbol names will have 0, 1, or two underscores
before/after the symbol name, and it can be in all caps or all lower case.

This is why gfortran has these options:

-fno-underscoring
-fsecond-underscore
-fcase-lower

In theory, these options should work, but they only really work if your
using one compiler to link against libraries compiled by only one other
compilers (linking with gfortran to ifort-compiled libraries, for
example). Once you have libraries from two different compilers, the
libraries from one compiler might need --fno_underscoring, while the
other needs -fcase-lower and fsecond-underscore.

I highly recommend you read the man pages for g77 and/or gfortran where
these switches are explained. There is a much better explanation of why
they're needed there.

>From my experience, it's easier to just compile the libraries again
using a different compiler suite, and put them in a separate location to
make it clear.  For example, I have compiled Open MPI with GNU, Intel,
and PGI compilers,

$ pwd
/usr/local/openmpi

$ ls -ld *
lrwxrwxrwx 1 root root    9 Jul 17  2009 gcc -> gcc-4.1.2
drwxr-xr-x 3 root root 4096 Feb 10  2009 gcc-4.1.2
lrwxrwxrwx 1 root root    8 Jul 17  2009 intel -> intel-11
drwxr-xr-x 3 root root 4096 Feb  5  2009 intel-11
lrwxrwxrwx 1 root root    7 Jul 17  2009 pgi -> pgi-8.0
drwxr-xr-x 3 root root 4096 Jan 28  2009 pgi-8.0

As long as users specify the correct paths to include and library files
in their compiler commands, they can compile using whatever compiler
they want. To save work, I only do this for libraries that I absolutely
know that users will be using with Fortran.

-- 
Prentice


From siegert at sfu.ca  Tue Jul 20 12:13:03 2010
From: siegert at sfu.ca (Martin Siegert)
Date: Tue, 20 Jul 2010 12:13:03 -0700
Subject: [Beowulf] compilers vs mpi?
In-Reply-To: <Pine.LNX.4.64.1007201412100.29234@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
	<20100720174655.GG24917@stikine.its.sfu.ca>
	<Pine.LNX.4.64.1007201412100.29234@coffee.psychology.mcmaster.ca>
Message-ID: <20100720191303.GI24917@stikine.its.sfu.ca>

On Tue, Jul 20, 2010 at 02:25:32PM -0400, Mark Hahn wrote:
>> we do exactly what you describe: compile OpenMPI with the gcc suite
>> and then use it with gcc, intel and open64 compilers.
>
> nice.
>
>> This works out-of-the-box, almost.
>> The problem is the f90 module mpi.mod. This is (usually) a binary
>> file and specific to the f90 compiler that was used to compile
>> OpenMPI. But there is a way to solve this problem:
>
> ah, .mod files.  we have for the most part ignored them entirely.
> (what are we losing by doing that?)

For starters, a program with a "use mpi" statement won't compile, if
you don't have mpi.mod.

>> 1. compile openmpi using the gcc compilers, i.e., gfortran as the Fortran
>>   compiler and install it in /usr/local/openmpi
>
> this is perhaps a tangent, but we install everything we support
> under /opt/sharcnet/$packagebasename/$ver.  for openmpi, we've had to bodge 
> the compiler flavor onto that (/opt/sharcnet/openmpi/1.4.2/intel).

(I simplified the description: we install in /usr/local/openmpi-version
and then create a softlink /usr/local/openmpi that points to the current
default version)

>> 2. move the Fortran module to the directory
>>   /usr/local/openmpi/include/gfortran. In that directory
>>   create softlinks to the files in /usr/local/openmpi/include.
>> 3. compile openmpi using ifort and install the Fortran module (and only
>>   the Fortran module!) in /usr/local/openmpi/include/intel. In that
>>   directory create softlinks to the files in /usr/local/openmpi/include.
>
> I guess I'm surprised this works - aren't you effectively assuming that the 
> intel and gfortran interfaces are compatible here?  that is, the app
> compiles with the compiler-specific module, which basically promises a 
> particular type-safe interface (signature) for MPI functions, but then the 
> linker just glues them together without any way to verify the signature 
> compatibility...
>
> am I misunderstanding?

The (default) name mangling scheme of gfortran, ifort, openf90 is the same:
append a single underscore. As soon as a user uses -fno-underscoring or
-fsecond-underscore nothing would work anymore. So, don't do that.
We have had not a single user who tried to change the default name mangling
scheme, thus this is not really a problem. We no longer support
g77 (which does have a different default).

As far as I understand the type checking with respect to mpi.mod is done
at compile time. That's why you need to have the correct fortran module
when compiling. At link time the linker only needs to find the library
routines like mpi_send_ which is a C wrapper routine anyway that just calls
MPI_Send.

>> 4. in /usr/local/openmpi/bin create softlinks mpif90.ifort
>>   and mpif90.gfortran pointing to opal_wrapper. Remove the
>>   mpif90 softlink.
>> 5. Move /usr/local/openmpi/share/openmpi/mpif90-wrapper-data.txt
>>   to /usr/local/openmpi/share/openmpi/mpif90.ifort-wrapper-data.txt.
>>   Change the line includedir=${includedir} to:
>>   includedir=${includedir}/intel
>>   Copy the file to
>>   /usr/local/openmpi/share/openmpi/mpif90.gfortran-wrapper-data.txt
>>   and change the line includedir=${includedir} to
>>   includedir=${includedir}/gfortran
>> 6. Create a wrapper script /usr/local/openmpi/bin/mpif90:
>>
>> #!/bin/bash
>> OMPI_WRAPPER_FC=`basename $OMPI_FC 2> /dev/null`
>> if [ "$OMPI_WRAPPER_FC" = 'gfortran' ]; then
>>   exec $0.gfortran "$@"
>> else
>>   exec $0.ifort "$@"
>> fi
>
> this is a tangent, but perhaps interesting.  we don't use the wrappers from 
> the MPI package, but rather our own single wrapper which has some
> built-in intelligence (augmented by info from the compiler's (environment)
> module.)

Again, I simplified. We use, e.g., compiler env modules to set env. variables
like OMPI_FC.

>> The reason we use gfortran in step 1 is that otherwise you get those
>> irritating error messages from the Intel libraries, cf.
>> http://www.open-mpi.org/faq/?category=building#intel-compiler-wrapper-compiler-warnings
>
> hmm.  we work around those by manipulating the link arguments in our wrapper.

Does that work when using gfortran to link with libraries compiled with
icc/ifort? Anyway, it appeared to be an unnecessary complication as the
fortran compiler has no affect on the performance of the MPI distribution;
it is only really needed for compiling the f90 module (and for configure
to determine the name mangling scheme).

- Martin


From hahn at mcmaster.ca  Tue Jul 20 11:54:59 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 20 Jul 2010 14:54:59 -0400 (EDT)
Subject: [Beowulf] compilers vs mpi?
In-Reply-To: <4C45E15B.8020702@ldeo.columbia.edu>
References: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
	<4C45E15B.8020702@ldeo.columbia.edu>
Message-ID: <Pine.LNX.4.64.1007201425430.29234@coffee.psychology.mcmaster.ca>

>> between the application and MPI.  that is, I would like to be able
>> to compile MPI (say, OpenMPI) with gcc, and expect it to work correctly 
>> with apps compiled with other compilers.  I guess I'm reasoning by analogy 
>> to normal distro libs.
>> 
>
> I haven't built OpenMPI this way,
> but you may try to link statically with commercial compiler libraries
> (say -static-intel, -Bstatic_pgi),

I'd rather build with gcc if possible.  I guess I'd be surprised if 
there were compute-intensive-enough parts of MPI to justify using some
other compiler.  (please, if anyone has any quantitative observations 
on the quality of current compilers, let me/list know!)

> Yes, they do recommend compiler homogeneity.
> However, I have built hybrids gcc+ifort
> and gcc+pgf90 and both work fine.
> (I have the homogeneous versions also.)

oh.  so the idea here is that the C part of OpenMPI has an ABI
which is compatible with basically all the other C compilers,
such as would be used to compile app-side code.  but that the fortran 
side has to be matched, library and app sides?  if that's the case,
then would it make sense to factor out the fortran interface?

> Fortran77 never had these features anyway, and I guess
> mpif77 doesn't check if you are passing an integer
> where it should be a real, or if your argument list is shorter
> than the function requires.

so if I have f90 code that uses an mpi header (not .mod interface),
does that mean there's no function signature checking at all?
as far as I know, my organization has never done .mod-based MPI,
so maybe this is why we're facing the issue now, after 10 years 
and 4k users ;)

>> PS: we have a large and diverse user base, so tend to have to support gcc, 
>> intel, pathscale and pgi. 
>
> ... and don't forget Open64!  :)

well, that's an interesting point.  I haven't quite figured out who is doing
the canonical release for Open64 nowadays (highest ver number seems to be 
from AMD).  have you done any comparisons?

>> we even have people who want to use
>> intel's damned synthetic 128b FP over MPI :(
>
> It's hard to keep the customer satisfied.
> You give them the sky, they want the universe.

for me, the real problem is knowing whether the user understands that 
synthetic 128b FP is drastically slower than 64b hardware FP.  has anyone
tried to do a comparison?

thanks, mark.


From hahn at mcmaster.ca  Tue Jul 20 11:25:32 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 20 Jul 2010 14:25:32 -0400 (EDT)
Subject: [Beowulf] compilers vs mpi?
In-Reply-To: <20100720174655.GG24917@stikine.its.sfu.ca>
References: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
	<20100720174655.GG24917@stikine.its.sfu.ca>
Message-ID: <Pine.LNX.4.64.1007201412100.29234@coffee.psychology.mcmaster.ca>

> we do exactly what you describe: compile OpenMPI with the gcc suite
> and then use it with gcc, intel and open64 compilers.

nice.

> This works out-of-the-box, almost.
> The problem is the f90 module mpi.mod. This is (usually) a binary
> file and specific to the f90 compiler that was used to compile
> OpenMPI. But there is a way to solve this problem:

ah, .mod files.  we have for the most part ignored them entirely.
(what are we losing by doing that?)

> 1. compile openmpi using the gcc compilers, i.e., gfortran as the Fortran
>   compiler and install it in /usr/local/openmpi

this is perhaps a tangent, but we install everything we support
under /opt/sharcnet/$packagebasename/$ver.  for openmpi, we've had 
to bodge the compiler flavor onto that (/opt/sharcnet/openmpi/1.4.2/intel).

> 2. move the Fortran module to the directory
>   /usr/local/openmpi/include/gfortran. In that directory
>   create softlinks to the files in /usr/local/openmpi/include.
> 3. compile openmpi using ifort and install the Fortran module (and only
>   the Fortran module!) in /usr/local/openmpi/include/intel. In that
>   directory create softlinks to the files in /usr/local/openmpi/include.

I guess I'm surprised this works - aren't you effectively assuming that 
the intel and gfortran interfaces are compatible here?  that is, the app
compiles with the compiler-specific module, which basically promises 
a particular type-safe interface (signature) for MPI functions, but 
then the linker just glues them together without any way to verify 
the signature compatibility...

am I misunderstanding?


> 4. in /usr/local/openmpi/bin create softlinks mpif90.ifort
>   and mpif90.gfortran pointing to opal_wrapper. Remove the
>   mpif90 softlink.
> 5. Move /usr/local/openmpi/share/openmpi/mpif90-wrapper-data.txt
>   to /usr/local/openmpi/share/openmpi/mpif90.ifort-wrapper-data.txt.
>   Change the line includedir=${includedir} to:
>   includedir=${includedir}/intel
>   Copy the file to
>   /usr/local/openmpi/share/openmpi/mpif90.gfortran-wrapper-data.txt
>   and change the line includedir=${includedir} to
>   includedir=${includedir}/gfortran
> 6. Create a wrapper script /usr/local/openmpi/bin/mpif90:
>
> #!/bin/bash
> OMPI_WRAPPER_FC=`basename $OMPI_FC 2> /dev/null`
> if [ "$OMPI_WRAPPER_FC" = 'gfortran' ]; then
>   exec $0.gfortran "$@"
> else
>   exec $0.ifort "$@"
> fi

this is a tangent, but perhaps interesting.  we don't use the wrappers 
from the MPI package, but rather our own single wrapper which has some
built-in intelligence (augmented by info from the compiler's (environment)
module.)

> The reason we use gfortran in step 1 is that otherwise you get those
> irritating error messages from the Intel libraries, cf.
> http://www.open-mpi.org/faq/?category=building#intel-compiler-wrapper-compiler-warnings

hmm.  we work around those by manipulating the link arguments in our wrapper.

thanks!
-mark


From niftyompi at niftyegg.com  Tue Jul 20 13:28:43 2010
From: niftyompi at niftyegg.com (Nifty Tom Mitchell)
Date: Tue, 20 Jul 2010 13:28:43 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <76097BB0C025054786EFAB631C4A2E3C09331921@MERCMBX04R.na.SAS.com>
References: <386oFdNJv5776S01.1277903447@web01.cms.usa.net>
	<76097BB0C025054786EFAB631C4A2E3C09331921@MERCMBX04R.na.SAS.com>
Message-ID: <20100720202843.GA20023@tosh2egg.ca.sanfran.comcast.net>

On Wed, Jun 30, 2010 at 03:11:54PM +0000, Bill Rankin wrote:
> 
> > I think the money part will be difficult to get (it is like a
> > politically
> > incorrect question).
> 
> Joe addressed this pretty well.  For the large systems, it's almost always under NDA.
> 

And then there is always the issue of timing.

Some groups or departments obtain hardware used from another
project often at $1.00 and other consideration prices.   Divide
by zero rules and NAN specters should scare ya.

Then there are the aggressive pricing curves that processors
suffer.   Six months later the same hardware can sometimes
be purchased well discounted such that two identical 
clusters could have a price per teraflops that is 40% of
an identical system.

And they the question is in dollars.  With international 
exchange rates other computations come to play making historic
comparisons interesting.


-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?


From gus at ldeo.columbia.edu  Tue Jul 20 13:43:47 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Tue, 20 Jul 2010 16:43:47 -0400
Subject: [Beowulf] compilers vs mpi?
In-Reply-To: <Pine.LNX.4.64.1007201425430.29234@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
	<4C45E15B.8020702@ldeo.columbia.edu>
	<Pine.LNX.4.64.1007201425430.29234@coffee.psychology.mcmaster.ca>
Message-ID: <4C460A83.4010906@ldeo.columbia.edu>

HI Mark

Mark Hahn wrote:
>>> between the application and MPI.  that is, I would like to be able
>>> to compile MPI (say, OpenMPI) with gcc, and expect it to work 
>>> correctly with apps compiled with other compilers.  I guess I'm 
>>> reasoning by analogy to normal distro libs.
>>>
>>
>> I haven't built OpenMPI this way,
>> but you may try to link statically with commercial compiler libraries
>> (say -static-intel, -Bstatic_pgi),
> 
> I'd rather build with gcc if possible.  I guess I'd be surprised if 
> there were compute-intensive-enough parts of MPI to justify using some
> other compiler.  

You are probably right, and gcc is so germane to Linux, so why bother 
with other C compilers, unless they are significantly faster.
I have builds with icc and pgcc to avoid trouble and too much digging
into Makefiles, etc, to fix things
when the code or the user prefers to use the commercial compiler.

> (please, if anyone has any quantitative observations on 
> the quality of current compilers, let me/list know!)
> 

I guess this candid question may flare yet another war.

I don't have direct comparisons.
I remember some discussion some time ago I don't remember where,
on how memcpy is done in gcc vs. icc, and how efficient each one is.
This is presumably important for MPI, as memcpy is likely to be on
the base of all "intra-node" MPI communication.

>> Yes, they do recommend compiler homogeneity.
>> However, I have built hybrids gcc+ifort
>> and gcc+pgf90 and both work fine.
>> (I have the homogeneous versions also.)
> 
> oh.  so the idea here is that the C part of OpenMPI has an ABI
> which is compatible with basically all the other C compilers,
> such as would be used to compile app-side code.  but that the fortran 
> side has to be matched, library and app sides?  if that's the case,
> then would it make sense to factor out the fortran interface?

I don't know the guts of OpenMPI, but I believe the Fortran 77 and 90
interfaces build on top of the C interface.
I don't really know if the OpenMPI
ABI is compatible across all C compilers.

Since most of the code here is Fortran, with a few tidbits of C,
I try to provide a variety of MPI builds for the commercial
compilers around (plus gfortran, and now openf90, which I have yet to test).
Some programs just refuse to compile with one commercial compiler,
but may compile with the other.
Short from modifying the code (which we often have to do),
this creates the need for several MPI
compiler wrappers, built with different Fortran compilers.
I haven't seen this happen with C programs, but as I said,
there is not much C code in our area.

I didn't mean to factor Fortran out, although your interpretation
of it is interesting.
The gcc+some_fortran hybrids I built were mostly because:
1) we didn't have icc for a while (only funds to buy ifort), although
more recently we bought the whole compiler suite;
2) pgcc for a while had trouble building OpenMPI, although the problem 
is now gone.
Very mundane reasons, but the hybrids work.

> 
>> Fortran77 never had these features anyway, and I guess
>> mpif77 doesn't check if you are passing an integer
>> where it should be a real, or if your argument list is shorter
>> than the function requires.
> 
> so if I have f90 code that uses an mpi header (not .mod interface),
> does that mean there's no function signature checking at all?
> as far as I know, my organization has never done .mod-based MPI,
> so maybe this is why we're facing the issue now, after 10 years and 4k 
> users ;)
> 

There is quite a bit of code here written with Fortran90 constructs,
but that has "#include mpif.h", instead of "use mpi",
and some with "use mpi".
I think this is for historic reasons, because the MPI F90 interface
may not have been very good in the past.
The mpif90 wrapper compiles both cases.
If I remember right, the mpif77
(as long as it is built with F77=FC=[ifort,pgf90,gfortran])
also compiles the first case (because the underlying compiler is 
actually a F90 compiler), but not the second.
Of course, you need to build the MPI Fortran90 interface to do
"use mpi", and you must use mpif90 in this case.

If I remember right (somebody please correct me if I am wrong),
the MPI subroutine/function calls are the same in F77 and F90,
the main difference is that the MPI F90 bindings add
type and interface checking, the OO-universe that is absent in F77.

>>> PS: we have a large and diverse user base, so tend to have to support 
>>> gcc, intel, pathscale and pgi. 
>>
>> ... and don't forget Open64!  :)
> 
> well, that's an interesting point.  I haven't quite figured out who is 
> doing
> the canonical release for Open64 nowadays (highest ver number seems to 
> be from AMD).  have you done any comparisons?
> 

Well, I built OpenMPI 1.4.2 with Open64, tested the basic functionality,
but I have yet to find time to compile and run one or two 
atmosphere/climate/ocean codes here with the Open64 OpenMPI
wrappers to see if it outperforms builds with the other compilers.
I am curious about this one because we have Opteron quad-core,
and I was wondering if the AMD-sponsored compiler would do better
than the Intel compiler (which doesn't let me use anything beyond
SSE2, -xW on the Opterons, if I remember right).

Unfortunately, this type of comparison can take quite some time,
if you try to tweak with optimization, check if the results are OK
(in IA64 I had some bad surprises with hidden/bundled
optimization flags), test also with MVAPICH2, and so on.
I can't possibly test everything, I have production runs to do,
I am also one of my users!

>>> we even have people who want to use
>>> intel's damned synthetic 128b FP over MPI :(
>>
>> It's hard to keep the customer satisfied.
>> You give them the sky, they want the universe.
> 
> for me, the real problem is knowing whether the user understands that 
> synthetic 128b FP is drastically slower than 64b hardware FP.  
 > has anyone
 > tried to do a comparison?
 >
 > thanks, mark.


 From what I observe here, the primary level of astonishment and
satisfaction for most users is: "It works!".
"It runs faster than on my laptop." comes later, if ever.
Only a few users try to compare.
If you gave them a functional synthetic 128b FP you may
have already accomplished a lot.

In the specific case of 128 bit arithmetic I wonder if you can
make it run fast in 64 bit machines.
There was a discussion, maybe here, a few weeks ago about this, right?
Or was it in one of the MPI lists?
It was about why one would need 128 bit arithmetic,
and whether this would be more of an issue with
a possibly poor/noisy algorithm/numerics.

Long ago I banned Matlab from our old cluster, because of abusive
behavior on the head node, etc.
Recently I set it up again to run in batch mode on the compute nodes.
I thought that would be a reasonable compromise, and drive some
people to run their heavy Matlab calculations in the cluster,
instead of on their desktops.
(A lot of post-processing of climate data is done in Matlab.)
Well, nobody got interested.
Trying to please users beyond their strict requests, specially when
this requires changing their habits, may not necessarily work.

My $0.02
Gus Correa


From niftyompi at niftyegg.com  Tue Jul 20 15:29:32 2010
From: niftyompi at niftyegg.com (Nifty Tom Mitchell)
Date: Tue, 20 Jul 2010 15:29:32 -0700
Subject: [Beowulf] compilers vs mpi?
In-Reply-To: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1007201117130.28124@coffee.psychology.mcmaster.ca>
Message-ID: <20100720222932.GA22712@tosh2egg.ca.sanfran.comcast.net>

On Tue, Jul 20, 2010 at 12:07:32PM -0400, Mark Hahn wrote:
> 
> Hi all,
> I'm interested in hearing about experiences with mixing compilers
> between the application and MPI.  that is, I would like to be able
> to compile MPI (say, OpenMPI) with gcc, and expect it to work
> correctly with apps compiled with other compilers.  I guess I'm
> reasoning by analogy to normal distro libs.
> 
> the OpenMPI FAQ has this comment:
> 
> NOTE: The Open MPI team recommends using a single compiler suite whenever
> possible. Unexpected or undefined behavior can occur when you mix compiler
> suites in unsupported ways (e.g., mixing Fortran 77 and Fortran 90 compilers
> between different compiler suites is almost guaranteed not to work).
> 
> and there are complaints elsewhere in the FAQ about f90 bindings.
> I'd appreciate it if someone could help a humble C/C++/perl hacker
> understand the issues here...
> 
> thanks, mark hahn.
> PS: we have a large and diverse user base, so tend to have to
> support gcc, Intel, pathscale and pgi.  we even have people who want
> to use intel's damned synthetic 128b FP over MPI :(


Some of this is historic and has been addressed transparently in  OpenMPI.
OFED pulls from OpenMPI I believe...
MPICH, MPICH2 is unknown to me.  Different compilers have the option of 
representing things differently ...

One is Fortran's notion of True/False   There are two conventions
you can test your compiler set with a debugger and short test code.
 0:1
 0:-1
i.e. a choice was made in the space -1:0:1  Depending on
some logic reductions some things might work in code
that breaks at a different optimization level.  

Two other places to double check are: strings and arg() handling.

The older Pathscale/QLogic MPI had libs with symbol handling magic
that could make most of this transparent via mpicc and friends.   
The OpenMPI folk did the same thing differently if I recall.    Synthetic
128b is unknown to me.   C++ bindings can be more difficult and
each compiler should be used to generate bindings as needed
perhaps based on the OpenMPI source/makefiles.   Some caution
is justified as the link line gets longer and longer and 
the users pull this GCC bit, a PGI built lib, an Intel-lib, goto-BLAS (pick one)
etc...

Summary:  boolian, args(), strings times a list of compilers
can generate a pile of permutations....


-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?


From mathog at caltech.edu  Wed Jul 21 13:26:17 2010
From: mathog at caltech.edu (David Mathog)
Date: Wed, 21 Jul 2010 13:26:17 -0700
Subject: [Beowulf] Re: OT: recoverable optical media archive format?
Message-ID: <E1Obfrt-0005GN-FT@mendel.bio.caltech.edu>

I wasn't thrilled with the limitations of rsbep and eventually wrote
a program rsbd (Reed-Solomon for Block Devices) to do this.  rsbd reuses
the RS encoding/decoding routines from the rsbep distribution (written
by Phil Karn) but the rest is new code.  rsbd uses message digests
(SHA1) that let it skip the RS decode step on data that is not
corrupted, which speeds up "decode" a lot.  It also keeps track of
erasures and so can restore 32 erasures (rsbep is limited to 16) in a
block of 255 bytes.  rsbd does everything it can to verify data
integrity, and either aborts on error or optionally slogs on anyway
while noting the locations of bad output.  I do not believe there are
any cases where it will output bad data and not tell you.  (Could be a
bug somewhere though.)

rsbd can be retrieved from rsbd.sourceforge.net.

Here is an example (uses one core on a dual Opteron 280, single
SATA disk on the machine) that corresponds roughly to the size of a DVD-R:

% cat test | time rsbd -e >test.rsbd
90.98user 14.36system 3:04.17elapsed 57%CPU
% cat test.rsbd | time rsbd -d -c >restored
Output size:                                    4056879025
input  size:                                    4648980480
Input blocks total:                             9080040
Input blocks erased:                            0
Neighborhoods processed:                        4451
Sections processed:                             35602
Sections Spec. Blk verified:       ddgst good:  35602
Sections Spec. Blk verified:       ddgst bad:   0
Sections Spec. Blk verified:   RS: ddgst good:  0
Sections Spec. Blk verified:   RS: ddgst bad:   0
Sections Spec. Blk reverified:     ddgst good:  0
Sections Spec. Blk reverified:     ddgst bad:   0
Sections Spec. Blk reverified: RS: ddgst good:  0
Sections Spec. Blk reverified: RS: ddgst bad:   0
Sections Spec. Blk corrupt:        ddgst good:  0
Sections Spec. Blk corrupt:        ddgst bad:   0
Sections Spec. Blk corrupt:    RS: ddgst good:  0
Sections Spec. Blk corrupt:    RS: ddgst bad:   0
RSblks total:                                   18192622
RSblks clean:                                   18192622
RSblks corrected:                               0
RSblks excess erasures:                         0
RSblks uncorrectable:                           0
RSblks avg corr. bytes:                         0
RSblks max corr. bytes:                         0
50.24user 12.43system 2:28.60elapsed 42%CPU
% cat test.rsbd  | \
 pockmark  -bs 512 -maxgap 4000 -maxrun 40 > test.rsbd.pox
% cat test.rsbd.pox | time rsbd -d -c >restored
Output size:                                    4056879025
input  size:                                    4648980480
Input blocks total:                             9080040
Input blocks erased:                            91639
Neighborhoods processed:                        4451
Sections processed:                             35602
Sections Spec. Blk verified:       ddgst good:  12608
Sections Spec. Blk verified:       ddgst bad:   17152
Sections Spec. Blk verified:   RS: ddgst good:  17152
Sections Spec. Blk verified:   RS: ddgst bad:   0
Sections Spec. Blk reverified:     ddgst good:  0
Sections Spec. Blk reverified:     ddgst bad:   5842
Sections Spec. Blk reverified: RS: ddgst good:  5842
Sections Spec. Blk reverified: RS: ddgst bad:   0
Sections Spec. Blk corrupt:        ddgst good:  0
Sections Spec. Blk corrupt:        ddgst bad:   0
Sections Spec. Blk corrupt:    RS: ddgst good:  0
Sections Spec. Blk corrupt:    RS: ddgst bad:   0
RSblks total:                                   18192622
RSblks clean:                                   6454984
RSblks corrected:                               11737638
RSblks excess erasures:                         0
RSblks uncorrectable:                           0
RSblks avg corr. bytes:                         3.7
RSblks max corr. bytes:                         19
246.49user 11.73system 5:10.14elapsed 83%CPU

% md5sum test restored
b22a361554771045df4424e547eaa558  restored
b22a361554771045df4424e547eaa558  test

>From the above one can see that it doesn't waste time doing RS decoding
unless it needs to.  Consequently the decode on a file which isn't
corrupt runs faster than the encode.

Somewhat more on subject for this group, the current version of rsbd
is completely single threaded.  There is plenty of room here for
parallelization.  For instance, for each "neighborhood" the sha1 digests
are independent and can be done 8 at a time, the RS encode/decode are
performed 8*511 times (all of which are completely independent), the XOR
step is performed on blocks of 512 consecutive bytes 4096*255/512 times,
all independent.  However, the [255,4096] <-> [4096,255] transpose of a
byte array, once per neighborhood, isn't going to be as trivial to split
into threads, and that could be rate limiting.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From madskaddie at gmail.com  Sat Jul 24 11:18:15 2010
From: madskaddie at gmail.com (madskaddie at gmail.com)
Date: Sat, 24 Jul 2010 19:18:15 +0100
Subject: [Beowulf] first cluster
In-Reply-To: <Pine.LNX.4.64.1007190939010.1418@coffee.psychology.mcmaster.ca>
References: <20100716145909.GB9850@sopalepc>
	<C8660665.89CB%scrusan@ur.rochester.edu>
	<20100716171129.GB10537@sopalepc>
	<553F0871-8E44-4D4C-A788-4254A5E031A5@sanger.ac.uk>
	<Pine.LNX.4.64.1007190939010.1418@coffee.psychology.mcmaster.ca>
Message-ID: <AANLkTim7XvjQnrVXg69rz-LGg0FNz7VYt1my5=SyPSjn@mail.gmail.com>

I manage a small cluster with a central image for the execution hosts
(fully decoupled from the master/ login nodes). To deal with direct
access to nodes:
 - Every user has an "*" on the password field of the /etc/shadow file
in the execution hosts images
 - Access through ssh to the exec hosts is enabled to work only with
passwords (no certificate files)
 - Direct access to nodes: gridengine's (GE) qrsh
 - MPI via GE parallel environments

Things to be solved:
 - Monitoring of the resources usage; Now is only possible to query by
using GE qhost or looking at ganglia. But the latency is quite high :/
(anything above instantaneous is high latency)
 - Administration can be boring sometimes because I need to input the
password. I'll study a bit of PAM rules to bypass or learn the tcl
Expect tool (or equivalent libs in other  languages)


Gil Brand?o

-- 
"
It can't continue forever. The nature of exponentials is that you push
them out and eventually disaster happens.
"
Gordon Moore? (Intel co-founder and author of the Moore's law)


From john.hearns at mclaren.com  Mon Jul 26 07:40:26 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Mon, 26 Jul 2010 15:40:26 +0100
Subject: [Beowulf] Top of the Green 500
Message-ID: <68A57CCFD4005646957BD2D18E60667B114134DA@milexchmb1.mil.tagmclarengroup.com>

http://www.engadget.com/2010/07/13/tokyo-universitys-grape-dr-supercompu
ter-is-a-tangled-green-pow/

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From hahn at mcmaster.ca  Mon Jul 26 10:24:06 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Mon, 26 Jul 2010 13:24:06 -0400 (EDT)
Subject: [Beowulf] Top of the Green 500
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B114134DA@milexchmb1.mil.tagmclarengroup.com>
References: <68A57CCFD4005646957BD2D18E60667B114134DA@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <Pine.LNX.4.64.1007261218180.19623@coffee.psychology.mcmaster.ca>

> http://www.engadget.com/2010/07/13/tokyo-universitys-grape-dr-supercompu
> ter-is-a-tangled-green-pow/

kind of weird - it would be interesting to get some first-person commentary.
I've got nothing against wire racks, not even against somewhat messy cabling.
(the goal of cabling is to _work_, while being reasonably maintainable, etc.
labeling is great, but I've seen some _abominations_ that were very neat 
but impossible to maintain.  not to mention cables so tidily tie-wrapped
that they exceeded their min mend radius...)

the open-air cooling is also pretty amateur-looking.  I guess one benefit
of going green is that you can be sloppy about airflow ;)


From mathog at caltech.edu  Mon Jul 26 12:46:30 2010
From: mathog at caltech.edu (David Mathog)
Date: Mon, 26 Jul 2010 12:46:30 -0700
Subject: [Beowulf] Re: Top of the Green 500
Message-ID: <E1OdTd8-0002MH-JZ@mendel.bio.caltech.edu>

John Hearns wrote:
> 
> http://www.engadget.com/2010/07/13/tokyo-universitys-grape-dr-supercompu
> ter-is-a-tangled-green-pow/
> 

Mysterious ventilation in that room.  Near as I can tell the floor,
ceiling, and back (left) wall are solid. The only other wall looks a bit
like a shoji screen, but maybe that is filter material and air moves in
or out through that wall?  (Arguing against that is the book case which
would then be impeding the flow, and of course also that the racks would
then be parallel to the flow.)  The one visible air duct is a narrow
slit above the right most rack, which appears to be about 10cm. high by
1m wide.

Maybe the heat is carried away by the rats nest of cables?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From rpnabar at gmail.com  Wed Jul 28 10:42:52 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 28 Jul 2010 12:42:52 -0500
Subject: [Beowulf] Any recommendations for a small (~4 port) KVM-over-IP
	switch
Message-ID: <AANLkTimtpcTCjbX_if6YKqCEG1fSL7dYQv5WAs+MsgwB@mail.gmail.com>

Are there any small (~4 port) KVM-over-IP switches out there? All the
one I know have at least 16 ports or so. But I just need KVM over IP
ability for my 3 head nodes so that's a lot of wasted ports (and
money!).

I did see some 2 / 4 / 6 port KVMs but those are all without the "over
IP" facility.  Just curious if anyone knows of a suitable
product......

-- 
Rahul


From crhea at mayo.edu  Wed Jul 28 12:12:34 2010
From: crhea at mayo.edu (Cris Rhea)
Date: Wed, 28 Jul 2010 14:12:34 -0500
Subject: [Beowulf] Re: Any recommendations for a small (~4 port)...
In-Reply-To: <201007281900.o6SJ0DTS028058@bluewest.scyld.com>
References: <201007281900.o6SJ0DTS028058@bluewest.scyld.com>
Message-ID: <20100728191234.GA1821@kaizen.mayo.edu>

> Subject: [Beowulf] Any recommendations for a small (~4 port)
> 	KVM-over-IP	switch
> 
> Are there any small (~4 port) KVM-over-IP switches out there? All the
> one I know have at least 16 ports or so. But I just need KVM over IP
> ability for my 3 head nodes so that's a lot of wasted ports (and
> money!).
> 
> I did see some 2 / 4 / 6 port KVMs but those are all without the "over
> IP" facility.  Just curious if anyone knows of a suitable
> product......
> 
> -- 
> Rahul

You might look at Avocent's MPU104E (4 port) or MPU108E (8 port).


-- 
 Cristopher J. Rhea                     
 Mayo Clinic - Research Computing Facility
 200 First St SW, Rochester, MN 55905
 crhea at Mayo.EDU
 (507) 284-0587


From rpnabar at gmail.com  Wed Jul 28 12:25:04 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 28 Jul 2010 14:25:04 -0500
Subject: [Beowulf] Any recommendations for a small (~4 port) KVM-over-IP 
	switch
In-Reply-To: <AANLkTi=oG7tGnJC2Zg3nDQxCDiRt1yY_dc4zhH0vkty+@mail.gmail.com>
References: <AANLkTimtpcTCjbX_if6YKqCEG1fSL7dYQv5WAs+MsgwB@mail.gmail.com>
	<AANLkTi=oG7tGnJC2Zg3nDQxCDiRt1yY_dc4zhH0vkty+@mail.gmail.com>
Message-ID: <AANLkTik_nbVkWysWchZnv7-yGJn7TrGoN4R9mFTaHUkt@mail.gmail.com>

On Wed, Jul 28, 2010 at 1:51 PM, Andrew Latham <lathama at gmail.com> wrote:
> I have found the Lantronix Spider product a great solution for N+1
> issues. ?It has dropped in price a great deal. ?The benefit of the
> added Serial over LAN also helps out. ?Virtual CD / Floppy has become
> standard across all new IP KVM solutions...

Thanks Andrew! That  sounds like exactly what I need. But at it's
~$300 / module price tag it means that it's probably just as well to
buy a conventional 16 port KVM (approx. $800 *) and waste 12 ports. I
guess the small port count KVM (over IP) market just doesn't exist.

[*]=http://www.belkin.com/iwcatproductpage.process?product_id=291685

-- 
Rahul


From rpnabar at gmail.com  Wed Jul 28 14:28:34 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 28 Jul 2010 16:28:34 -0500
Subject: [Beowulf] Any recommendations for a small (~4 port) KVM-over-IP 
	switch
In-Reply-To: <AANLkTinvTNBa9QRcNRC=TgRH8Vjkdjs90awr4bLOnoO5@mail.gmail.com>
References: <AANLkTimtpcTCjbX_if6YKqCEG1fSL7dYQv5WAs+MsgwB@mail.gmail.com>
	<AANLkTi=oG7tGnJC2Zg3nDQxCDiRt1yY_dc4zhH0vkty+@mail.gmail.com>
	<AANLkTik_nbVkWysWchZnv7-yGJn7TrGoN4R9mFTaHUkt@mail.gmail.com>
	<AANLkTinvTNBa9QRcNRC=TgRH8Vjkdjs90awr4bLOnoO5@mail.gmail.com>
Message-ID: <AANLkTi=uR8HSro_BBB3BHQchkLhyYOGRzYMHe--KuRFg@mail.gmail.com>

On Wed, Jul 28, 2010 at 3:02 PM, Andrew Latham <lathama at gmail.com> wrote:
> I keep a Lantronix Spider in my laptop bag. ?I paid $550 USD when they
> were new. ?I would say that at ~$270 @ Provantage that they are great
> deals. ?They are also zero units (They come with a mounting kit that
> can attach to most any cable management arm or rack rail).
>
> The Spider is not the solution for everything but when you need just
> one more node it is an option.
>
> I was reminded by a peer to mention that the Spider can authenticate
> against Radius, LDAP, and Microsoft Active Directory etc...

Thanks Andrew for the additional tips! Yup, I'm impressed with it too
and might buy one for my laptop bag. Especially the fact that it
totally does away with the intermediate KVM-switch that's in most
other alternatives is pretty nice.

I'm curious, can this actually replace, say a crash cart? Could one
have one of these handy and then whenever a compute-node goes bust in
a faraway hot-aisle, just plug the Lantronix into a spare eth port and
then go back and play with the node from a central console? [I guess
the issue is does it have to be preplugged or can it be plugged into
the monitor port AFTER a crash]

Or will there be situations where a crash cart is still needed?

-- 
Rahul


From john.hearns at mclaren.com  Fri Jul 30 05:00:01 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 30 Jul 2010 13:00:01 +0100
Subject: [Beowulf] Scale modl Cray-1
Message-ID: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com>

Enjoy.  http://www.theregister.co.uk/2010/07/29/cray_1_replica/

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From gus at ldeo.columbia.edu  Fri Jul 30 08:15:55 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Fri, 30 Jul 2010 11:15:55 -0400
Subject: [Beowulf] Scale modl Cray-1
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com>
References: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <4C52ECAB.1020502@ldeo.columbia.edu>

Hearns, John wrote:
> Enjoy.  http://www.theregister.co.uk/2010/07/29/cray_1_replica/
> 
> The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


In this age of virtualization,
I was wondering if there are simulators in software (say, for Linux)
of famous old computers: PDP-11, VAX, Cray-1, IBM 1130, IBM/360,
CDC 6600, even the ENIAC perhaps.
 From instruction set, to OS, to applications.

Any references?

Thanks,
Gus Correa


From douglas.guptill at dal.ca  Fri Jul 30 09:02:44 2010
From: douglas.guptill at dal.ca (Douglas Guptill)
Date: Fri, 30 Jul 2010 13:02:44 -0300
Subject: [Beowulf] Scale modl Cray-1
In-Reply-To: <4C52ECAB.1020502@ldeo.columbia.edu>
References: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com>
	<4C52ECAB.1020502@ldeo.columbia.edu>
Message-ID: <20100730160244.GA18848@sopalepc>

On Fri, Jul 30, 2010 at 11:15:55AM -0400, Gus Correa wrote:
> Hearns, John wrote:
>> Enjoy.  http://www.theregister.co.uk/2010/07/29/cray_1_replica/
>>
> In this age of virtualization,
> I was wondering if there are simulators in software (say, for Linux)
> of famous old computers: PDP-11, VAX, Cray-1, IBM 1130, IBM/360,
> CDC 6600, even the ENIAC perhaps.
> From instruction set, to OS, to applications.
>
> Any references?

I once (early 1990s) wrote an emulator for something like an H316
(Honeywell 16-bit mini-computer).  In Lisp.  It was a graduate course
project.  I expect that I still have the source, on floppies...I
wonder if they are still readable?

Cheers,
Douglas.
-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From james.p.lux at jpl.nasa.gov  Fri Jul 30 09:09:15 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Fri, 30 Jul 2010 09:09:15 -0700
Subject: [Beowulf] Scale modl Cray-1
In-Reply-To: <4C52ECAB.1020502@ldeo.columbia.edu>
Message-ID: <C878473B.11851%James.P.Lux@jpl.nasa.gov>

Certainly, there are simulators for the 1130, PDP-11, PDP-8... They run in simulation faster on a PC than on the original machine..  But you don't have the thwap of the cards in the reader, or the whine of the chain in the printer, or need to swap disks between passes for the Fortran compiler.


On 7/30/10 8:15 AM, "Gus Correa" <gus at ldeo.columbia.edu> wrote:

Hearns, John wrote:
> Enjoy.  http://www.theregister.co.uk/2010/07/29/cray_1_replica/
>
> The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


In this age of virtualization,
I was wondering if there are simulators in software (say, for Linux)
of famous old computers: PDP-11, VAX, Cray-1, IBM 1130, IBM/360,
CDC 6600, even the ENIAC perhaps.
 From instruction set, to OS, to applications.

Any references?

Thanks,
Gus Correa
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100730/fe5e67a8/attachment.html>

From dnlombar at ichips.intel.com  Fri Jul 30 12:38:19 2010
From: dnlombar at ichips.intel.com (David N. Lombard)
Date: Fri, 30 Jul 2010 12:38:19 -0700
Subject: [Beowulf] Scale modl Cray-1
In-Reply-To: <4C52ECAB.1020502@ldeo.columbia.edu>
References: <68A57CCFD4005646957BD2D18E60667B1154A77D@milexchmb1.mil.tagmclarengroup.com>
	<4C52ECAB.1020502@ldeo.columbia.edu>
Message-ID: <20100730193819.GA19935@nlxcldnl2.cl.intel.com>

On Fri, Jul 30, 2010 at 08:15:55AM -0700, Gus Correa wrote:
> Hearns, John wrote:
> > Enjoy.  http://www.theregister.co.uk/2010/07/29/cray_1_replica/
> > 
> In this age of virtualization,
> I was wondering if there are simulators in software (say, for Linux)
> of famous old computers: PDP-11, VAX, Cray-1, IBM 1130, IBM/360,
> CDC 6600, even the ENIAC perhaps.
>  From instruction set, to OS, to applications.

http://www.ibm1130.org

You can run DMS R2 V12 along with ASM, IBM FORTRAN (not EMU FORTRAN), APL, RPG...
I play with this at home...  It's a hoot!

Check out http://ibm1130.org/sim/other where various other sims are
listed, all but CDC and Cray from your list above.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.