From landman at scalableinformatics.com  Sun Nov  1 19:25:46 2009
From: landman at scalableinformatics.com (Joe Landman)
Date: Sun, 01 Nov 2009 22:25:46 -0500
Subject: [Beowulf] Storage recommendations?
In-Reply-To: <1256916391.6856.225.camel@moelwyn.maths.qmul.ac.uk>
References: <1256916391.6856.225.camel@moelwyn.maths.qmul.ac.uk>
Message-ID: <4AEE513A.6050504@scalableinformatics.com>

Robert Horton wrote:
> Hi,
> 
> I'm looking for some recommendations for a new "scratch" file server for
> our cluster. Rough requirements are:
> 
> - Around 20TB of storage
> - Good performance with multiple nfs writes (it's quite a mixed workload
> so hard to characterise further)
> - Data security not massively important as it's just for scratch /
> temporary data.
> 
> It'll just be a single server serving nfs, I'm not looking to go down
> the luster / pvfs route.


a) what network fabric (IB, 10GbE, GbE, ...)

b) roughly how many simultaneous writers ... large block streaming, or 
small block random?  How much sustained IO (MB/s) do you need to support 
your worker machines?

c) looking to build it your self or buy units that work?


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From robh at dongle.org.uk  Mon Nov  2 06:03:35 2009
From: robh at dongle.org.uk (Robert Horton)
Date: Mon, 02 Nov 2009 14:03:35 +0000
Subject: [Beowulf] Storage recommendations?
In-Reply-To: <4AEE513A.6050504@scalableinformatics.com>
References: <1256916391.6856.225.camel@moelwyn.maths.qmul.ac.uk>
	<4AEE513A.6050504@scalableinformatics.com>
Message-ID: <1257170615.6802.51.camel@moelwyn.maths.qmul.ac.uk>

On Sun, 2009-11-01 at 22:25 -0500, Joe Landman wrote:
> Robert Horton wrote:
> > Hi,
> > 
> > I'm looking for some recommendations for a new "scratch" file server for
> > our cluster. Rough requirements are:
> > 
> > - Around 20TB of storage
> > - Good performance with multiple nfs writes (it's quite a mixed workload
> > so hard to characterise further)
> > - Data security not massively important as it's just for scratch /
> > temporary data.
> > 
> > It'll just be a single server serving nfs, I'm not looking to go down
> > the luster / pvfs route.
> 
> 
> a) what network fabric (IB, 10GbE, GbE, ...)

Currently using the GigE network for the NFS traffic. There is a DDR
Infiniband network in place for the MPI traffic which could potentially
be used for storage, however my impression is that the current
bottleneck is the disk io (or iops) rather than network, so I'm not sure
that this would be worth the mucking about.

> b) roughly how many simultaneous writers ... large block streaming, or 
> small block random?  How much sustained IO (MB/s) do you need to support 
> your worker machines?

The jobs typically use 32 nodes with 8 processes on each node, so I
guess ~250. Averaged over 5 minutes, maximum IO is around 15MB/s read
and 0.5MB/s for write, however I've seen short periods of around 30MB/s
write.

The problem, basically, is that under certain heavy IO conditions which
I haven't managed to reliably reproduce, the whole machine becomes very
slow, in some cases making interactive work pretty much impossible.
Hence the desire to move the storage off the headnode.

> 
> c) looking to build it your self or buy units that work?
> 

Preferable the second option, although I've not completely ruled out
building something.


From prentice at ias.edu  Tue Nov  3 09:09:07 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 03 Nov 2009 12:09:07 -0500
Subject: [Beowulf] Fortran Array size question
Message-ID: <4AF063B3.3050508@ias.edu>

This question is a bit off-topic, but since it involves Fortran minutia,
I figured this would be the best place to ask. This code may eventually
run on my cluster, so it's not completely off topic!

Question: What is the maximum number of elements you can have in a
double-precision array in Fortran? I have someone creating a
4-dimensional double-precision array. When they increase the dimenions
of the array to ~200 million elements, they get this error:

compilation aborted (code 1).

I'm sure they're hitting a Fortran limit, but I need to prove it. I
haven't been able to find anything using The Google.

-- 
Prentice


From eugen at leitl.org  Tue Nov  3 09:37:17 2009
From: eugen at leitl.org (Eugen Leitl)
Date: Tue, 3 Nov 2009 18:37:17 +0100
Subject: [Beowulf] A look at the 100-core Tilera Gx
Message-ID: <20091103173717.GT17686@leitl.org>


http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/

A look at the 100-core Tilera Gx

It's all about the network(s)

by Charlie Demerjian

October 29, 2009

TILERA IS CLAIMING to have the first commercial CPU to reach 100 cores, and
while this is true, the real interesting technology is in the interconnects.
The overall chip is quite a marvel, and it is unlike any mainstream CPU you
have ever heard of.

Making a lot of cores on a chip isn't very hard. Larrabee for example has 32
Pentium (P54) cores, heavily modified, as the basis of the GPU. If Intel
wanted to, it could put hundreds of cores on a die, that part is actually
quite easy. Keeping those cores fed is the most important problem of modern
chipmaking, and that part is not easy.

Large caches, wide memory busses, ring busses on chip, stacking, and optical
interfaces all are attempts to feed the beast. Everyone thought Intel's
Polaris, also known as the 80 core, 1 TeraFLOPS part from a few years ago,
was about packing cores onto a die. It wasn't, it was a test of routing
algorithms and structures. Routing is where the action is now, packing cores
in is not a big deal.

Routing is where Tilera shines. It has put a great deal of thought into
getting data from core to core with minimal latency and problems. Its rather
unique approach involves five different interconnect networks, programmable
partitioning, accelerators, and simply tons of I/O. Together, these allow
Tilera's third generation Tile-Gx CPUs to scale from 16 to 100 cores without
choking on congestion. They may not have the same single-threaded performance
of a Nehalem or Shanghai core, but they make up for it with volume.

100 core diagram

Tilera 100 core chip

The basic structure is a square array of small cores, 4x4, 6x6, 8x8 or 10x10,
each connected via five (5) on-chip networks, and flanked by some very
interesting accelerators. The cores themselves are a proprietary 32-bit ISA
in the first two generations of Tilera chips, and in the Gx, it is extended
to 64-bit. There are 75 new instructions in the Gx, 20 of which are SIMD, and
the memory controller now sees 64 bits as well.

In previous generations, there was no floating-point (FP) hardware in Tilera
products. The company strongly recommended against using FP code because it
had to be emulated taking hundreds or thousands of cycles. With the new Gx
series chips, FP code is still frowned upon, but there is some FP hardware to
catch the odd instruction without a huge speed hit. The 100 core part can do
50 GigaFLOPS of FP which may sound like a large number, but that is only
about 1/50th of what an ATI Cypress HD5870 chip can do.

The majority of the new instructions are aimed at what the Tilera chips do
best, integer calculations. Things like shuffle and DSP-like
multiply-and-accumulate (MAC) functions, including a quad MAC unit, are where
these new chips shine. Basically, the Gx moves information around very
quickly while twiddling bits here and there with integer functions.

While the cores might not be overly complex, the on-chip busses are. Each Gx
core has 64K of L1 cache, 32K data and 32K instruction, along with a unified
8-way 256KB L2 cache. The cache is totally non-blocking, completely coherent,
and the cache subsystem can reorder requests to other caches or DRAM. On top
of this, the core supports cache pinning to keep often used data or
instructions in cache. On the 100 core model, the Gx has 32MB of cache.

Tiles are the name Tilera uses for for a basic unit of repetition. The 16
core Gx has 16 tiles, the 64 core Gx has 64, etc. A tile consists of a core,
the L1 and L2 caches, and something Tilera calls the Terabit Switch. More
than anything, this switch is the heart of the chip.

Tile diagram

A Tilera tile

Remember when we said that cramming 100 cores on a die is not a big problem,
but feeding them is? The Terabit Switch is how Tilera solves the problem, and
it is a rather unique solution. Instead of one off-core bus, there are five.
Each of them has a dedicated purpose, and that not only gives huge bandwidth,
it also goes a fair way towards minimizing contention. Cache traffic will
never be stepped on by user data, and so on.

The five networks are called QDN, RDN, FDN, IDN and UDN. In the last two
generations of Tilera chips, all of these networks were 32 bits wide, but on
the Gx, the widths vary to give each one more or less bandwidth depending on
their functions.

QDN is called the reQuest Dynamic Network, and it is used for memory and
cache. QDN is 64 bits wide. RDN is Response Dynamic Network, and it is used
to feed memory reads back to the chips. RDN is 112 bits wide, an odd number,
64 + 48 from the look of it.

FDN is the widest at 128 bits, and it is used for cache to cache transfers
and cache coherency. Given the critical nature of cache transactions like
this, the width is no surprise. The last two IDN and UDN are both 32 bits
wide. IDN is I/O Dunamic Network, and passes data on and off the chip. With a
dedicated channel for off-chip transfers, you can see that reaching
theoretical numbers was a priority at Tilera.

The last network UDN is for User Dynamic Network, basically the one users get
to send stuff around on. QDN, RDN, FDN and IDN are basically housekeeping,
they work in the background. If you want to send things from point A to point
B, you send it across the UDN.

Although Tilera didn't explicitly state it, each hop from router to router
takes one cycle. This means that in a pathological case, corner core to
memory on the far corner, it could take 19 cycles to go from request to
memory, plus the memory round trip time, and then another 19 cycles to get
back. That is what you call a long time in computer speak. Even in an
'average' case, you have a 10 cycle latency, which is very long as well.

To be fair, the Tilera architecture is not made to run general purpose code.
As it was described when the first generation came out, workloads are meant
to be chunked up, so a single tile does a function, then the data gets passed
to the next tile for more work, and so on and so forth. If your program has
20 steps, you use 20 tiles and pipeline the work.

This solves many of the problems with variable latency and multi-hop traffic.
The other more elegant solution is the ability to section off chunks of the
chip into sub-units. There is a hypervisor that can partition each Gx chip
into programmable blocks.

Chunking tiles

Sub-sections of tiles

As you can see in the diagram above, each Gx is broken up into sub-chips in
software. You can give each process as much CPU power as it needs, and
arrange it so the output of one block feeds into the input of the next in a
single clock. This example has two Apache web server instances, an intrusion
prevention system (IPS), a secure sockets layer (SSL) stack, a network stack
and a few other processes running next to each other.

The Apache instances have their own memory controller, as do the IPS and the
SSL stack. The network stack is sitting on top of the memory controller for
decreased latency. Basically, the programmer can choose where to put each
process to minimize latency. It doesn't take much to figure out how to apply
these concepts to a database plus web server scenario, or a three-tiered
SAP-like workload.

Basically, Tilera allows you to explicitly place the data and compute
resources where, when and how you need them. The chunks are done at roughly
the same level as hardware VMs are in x86 CPUs, running below the level that
a process can affect. This creates hardware walls to segregate data
transfers, cache coherency traffic, and other tile to tile transfers. If done
correctly, it can minimize latency a lot in addition to keeping processes
from stepping on each other.

Now that you know how the cores work, talk, and are partitioned, what about
the 'uncore'? Talk about that starts with the memory controllers - four
DDR3-2133MHz banks on the 64 and 100 core Gx, two on the 16 and 36 core
models. For the keen eyed out there, this means Tilera has two different
socket configurations, one for the 64 and 100 core chips, and another one for
the 16 and 36 core chips.

DDR3-2133MHz memory is very fast, hugely fast in fact. The math says 17GBps
per contr. Basically, this chip has a lot of available bandwidth. As you
might imagine, on the 16 and 36 core variants, there are only half the
controllers, so half the bandwidth.

In addition, you have a generic controller for USB, UARTs, JTAG and I2C
controllers. Given that Tilera chips are basically embedded, these are not
likely to be used for much more than booting and diagnostics.

On the core diagram above, there are two other blocks, the orange MiCA and
mPIPE accelerators. These are where the other parts of the Tilera Gx 'magic'
happen. MiCA stands for Multistream iMesh Crypto Accelerator, while mPIPE is
short for multicore Programmable Intelligent Packet Engine. If it isn't
blindingly obvious, the MiCA does the crypto and the mPIPE speeds up I/O.

The mPIPE does a lot of interesting things, all supposedly at wire speed. It
has a programmable packet classification engine, said to be usable at 80Gbps
or 120M packets per second. It can twiddle headers and do other evil things
that would make Comcast drool with the potential for 'network management'
extortion payements.

In addition, it can also load balance across the various I/O lanes, and
redirect tile to tile 'I/O' in a somewhat intelligent fashion. On top of
that, the mPIPE manages buffer sizes, queues, and other housekeeping to keep
latencies low. Think of it as a programmable housekeeping offload engine.

The most interesting bit is that the mPIPE can tag a packet with a 32 bit
header before it sends it onto the internal network. This is where the
programmable part shines. You can set up fields in the I/O packet itself to
pass along pre-decode information and other time-saving tidbits. Since I/O is
fully virtualizable, you could theoretically tag the packets with VM data, or
just about anything else a bored programmer can think of.

The MiCA engines, two on the 64/100 core, one on 16/36 cores, are crypto
offload engines. They can work either 'inline' or as ull blown offload
engines, that is up to the programmer. The MiCA can pull data directly from
caches or main memory without CPU overhead, basically fire and forget.

If you like acronyms, the MiCA on the Gx can support AES, 3DES, ARC4, Kasumi
and Snow for crypto, SHA-1, SHA-2, MD5, HMAC and AES-GMAC for hashes, RSA,
DSA, Diffie-Hellman, and Elliptic Curve for public key work, and it has a
true random number generator (RNG). WTF, LOL, ROFL and other netspeak can be
encrypted along with any other text that uses correct grammar. RLY.

Tilera claims that the MiCA engine can do wire speed 40Gbps crypto with full
duplex on the 100 core Gx, and 1024b key RSA at 50K keys per second on the
100 core, 20K keys per second for the 36 core. Not bad at all. In addition,
the MiCA supports a hardware compression engine that uses the tried and true
Deflate algorithm.

The last piece of the puzzle is something that Tilera calls external
acceleration interfaces. This could be as simple as plugging in a PCIe card,
but that lacks elegance. The interesting part is a field programmable gate
array (FPGA) interface. You can take up to 8 lanes of PCIe and connect the
FPGA to the serial deserial unit (SerDes) to enable basically direct and low
latency 32Gbps transfers. Direct transfers to cache and multiple contexts are
supported, meaning you can do quite a bit with an FPGA and a Tilera-Gx chip.

In the end, you have a monster chip for I/O and packet processing. It doesn't
do single-threaded applications all that fast, but it really isn't meant to.
The chip itself is not out yet, nor is there even silicon yet. The first
version out will be the 36 core Gx in Q4 of 2010, followed by the 16 core
later in Q4 or possibly Q1 of 2011. These both share the same socket
configuration and a 35*35mm package.

In Q1 of 2011, the 100 core chip will come out on a new socket and in a
45*45mm package. A bit after that, the 64 core will hit the market. Power
ranges from 10W for the 16 core to 55W for the 100 core, but you can get
power optimized variants that will only suck 35W. Given the programmability
of the parts, power use is likely more dependent on the programs running on
it.

The last bit of information is clock speeds. The 64 and 100 core models will
come in versions that run at 1.25GHz and 1.5GHz, not bad considering how much
there is to synchronize and keep going. The 36 core models will come in
1.0GHz, 1.25GHz and 1.5GHz versions, and the 16 core models will only come in
1.0GHz or 1.25GHz versions. Given the core count, internal interconnections,
memory and I/O capabilities, Tilera will pack a lot of power into these small
packages.S|A


From kus at free.net  Tue Nov  3 09:41:52 2009
From: kus at free.net (Mikhail Kuzminsky)
Date: Tue, 03 Nov 2009 20:41:52 +0300
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <4AF063B3.3050508@ias.edu>
Message-ID: <web-2752104@free.net>

In message from Prentice Bisbal <prentice at ias.edu> (Tue, 03 Nov 2009 
12:09:07 -0500):
>This question is a bit off-topic, but since it involves Fortran 
>minutia,
>I figured this would be the best place to ask. This code may 
>eventually
>run on my cluster, so it's not completely off topic!
>
>Question: What is the maximum number of elements you can have in a
>double-precision array in Fortran? I have someone creating a
>4-dimensional double-precision array. When they increase the 
>dimenions
>of the array to ~200 million elements, they get this error:
>
>compilation aborted (code 1).
>
>I'm sure they're hitting a Fortran limit, but I need to prove it. I
>haven't been able to find anything using The Google.

It is not Fortran restriction. It may be some compiler restriction. 
64-bit ifort for EM64t allow you to use, for example, 400 millions
elements.

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow 

>
>-- 
>Prentice
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
>Computing
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf
>
>-- 
>??? ????????? ???? ????????? ?? ??????? ? ??? ???????
>? ????? ???????? ??????????? ???????????
>MailScanner, ? ?? ????????
>??? ??? ?? ???????? ???????????? ????.
>


From prentice at ias.edu  Tue Nov  3 10:17:02 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 03 Nov 2009 13:17:02 -0500
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <web-2752104@free.net>
References: <web-2752104@free.net>
Message-ID: <4AF0739E.8030700@ias.edu>

Mikhail Kuzminsky wrote:
> In message from Prentice Bisbal <prentice at ias.edu> (Tue, 03 Nov 2009
> 12:09:07 -0500):
>> This question is a bit off-topic, but since it involves Fortran minutia,
>> I figured this would be the best place to ask. This code may eventually
>> run on my cluster, so it's not completely off topic!
>>
>> Question: What is the maximum number of elements you can have in a
>> double-precision array in Fortran? I have someone creating a
>> 4-dimensional double-precision array. When they increase the dimenions
>> of the array to ~200 million elements, they get this error:
>>
>> compilation aborted (code 1).
>>
>> I'm sure they're hitting a Fortran limit, but I need to prove it. I
>> haven't been able to find anything using The Google.
> 
> It is not Fortran restriction. It may be some compiler restriction.
> 64-bit ifort for EM64t allow you to use, for example, 400 millions
> elements.
> 

That's exactly the compiler I'm using, and it's failing at ~200 million
elements. I'm  digging through the Intel documentation. Haven't found an
answer yet.

--
Prentice


From prentice at ias.edu  Tue Nov  3 10:25:33 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 03 Nov 2009 13:25:33 -0500
Subject: [Beowulf] A look at the 100-core Tilera Gx
In-Reply-To: <20091103173717.GT17686@leitl.org>
References: <20091103173717.GT17686@leitl.org>
Message-ID: <4AF0759D.9070503@ias.edu>


Eugen Leitl wrote:
> http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/
> 
> A look at the 100-core Tilera Gx
> 
> It's all about the network(s)
> 
> by Charlie Demerjian
> 
> October 29, 2009

> In previous generations, there was no floating-point (FP) hardware in Tilera
> products. The company strongly recommended against using FP code because it
> had to be emulated taking hundreds or thousands of cycles. With the new Gx
> series chips, FP code is still frowned upon, but there is some FP hardware to
> catch the odd instruction without a huge speed hit. The 100 core part can do
> 50 GigaFLOPS of FP which may sound like a large number, but that is only
> about 1/50th of what an ATI Cypress HD5870 chip can do.

I imagine this short-coming will limit the Tilera Gx's value to most of
HPC community. This doesn't even mention DP performance.

--
Prentice


From lindahl at pbm.com  Tue Nov  3 10:39:27 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 3 Nov 2009 10:39:27 -0800
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <4AF0739E.8030700@ias.edu>
References: <web-2752104@free.net> <4AF0739E.8030700@ias.edu>
Message-ID: <20091103183927.GC16399@bx9.net>

On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote:

> That's exactly the compiler I'm using, and it's failing at ~200 million
> elements. I'm  digging through the Intel documentation. Haven't found an
> answer yet.

Your bug report was incomplete: it really matters if the array is
automatic or not, or if it's initialized.

-- greg


From prentice at ias.edu  Tue Nov  3 11:24:00 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 03 Nov 2009 14:24:00 -0500
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <20091103183927.GC16399@bx9.net>
References: <web-2752104@free.net> <4AF0739E.8030700@ias.edu>
	<20091103183927.GC16399@bx9.net>
Message-ID: <4AF08350.8060107@ias.edu>

Greg Lindahl wrote:
> On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote:
> 
>> That's exactly the compiler I'm using, and it's failing at ~200 million
>> elements. I'm  digging through the Intel documentation. Haven't found an
>> answer yet.
> 
> Your bug report was incomplete: it really matters if the array is
> automatic or not, or if it's initialized.
> 
You're right -  I should have included a code snippet. It's not my code,
so I don't know if I can share all of it. Here's the line where the
problem occurs:

dimension vstore(1:4,0:4,5000000,2),fstore(0:4,5000000,2)

If he reduces the 5000000 to a smaller number, it compiles. As shown, he
gets this error:

ifort adaptnew2.for
...
...
compilation aborted for adaptnew2.for (code 1)

The compiler is Intel's ifort 11.0.074

I'm not a Fortran programmer, so I'm a little out of my element here. If
it was bash or perl, or even C/C++, that'd be a different story.

--
Pretnice


From richard.walsh at comcast.net  Tue Nov  3 12:01:32 2009
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Tue, 3 Nov 2009 20:01:32 +0000 (UTC)
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <1391956104.3640921257278279065.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>
Message-ID: <1225611180.3642631257278492593.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net>


>----- Original Message ----- 
>From: "Prentice Bisbal" <prentice at ias.edu> 
>To: "Beowulf Mailing List" <beowulf at beowulf.org> 
>Sent: Tuesday, November 3, 2009 1:24:00 PM GMT -06:00 US/Canada Central 
>Subject: Re: [Beowulf] Fortran Array size question 
> 
>Greg Lindahl wrote: 
>> On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote: 
>> 
>>> That's exactly the compiler I'm using, and it's failing at ~200 million 
>>> elements. I'm digging through the Intel documentation. Haven't found an 
>>> answer yet. 
>> 
>> Your bug report was incomplete: it really matters if the array is 
>> automatic or not, or if it's initialized. 
>> 
>You're right - I should have included a code snippet. It's not my code, 
>so I don't know if I can share all of it. Here's the line where the 
>problem occurs: 
> 
>dimension vstore(1:4,0:4,5000000,2),fstore(0:4,5000000,2) 
> 
>If he reduces the 5000000 to a smaller number, it compiles. As shown, he 
>gets this error: 
> 
>ifort adaptnew2.for 
>... 
>... 
>compilation aborted for adaptnew2.for (code 1) 


Prentice, 


I do not think the Fortran standard limits the size of one dimension 
in an array, although you can have only 7 dimensions. This to me must 
be a limit internal to their compiler. There may be an environmental variable 
to reset. I would try another Fortran (maybe gfortran) to see if you get similar 
behavior or find another (different) limit. 


Limits should really be operating system imposed based on the 
size of the address space. Intel says as much on the website 
describing their compiler. 


Regards, 


rbw 
Thrashing River Computing 
_______________________________________________ 
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing 
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091103/30c7dcbb/attachment.html>

From mathog at caltech.edu  Tue Nov  3 12:05:05 2009
From: mathog at caltech.edu (David Mathog)
Date: Tue, 03 Nov 2009 12:05:05 -0800
Subject: [Beowulf] Re:Fortran Array size question 
Message-ID: <E1N5Pcn-0007Q0-MU@mendel.bio.caltech.edu>

Prentice Bisbal <prentice at ias.edu> wrote:

> Question: What is the maximum number of elements you can have in a
> double-precision array in Fortran? I have someone creating a
> 4-dimensional double-precision array. When they increase the dimenions
> of the array to ~200 million elements, they get this error:
> 
> compilation aborted (code 1).

The two things that come immediately to mind are:

1.  The compiler ran out of memory.  (In addition to the size of the
memory in the machine, check ulimit.)

2.  The compiler is trying to build the program with 32 bit pointers and
it cannot address this array, or perhaps all memory accessed, with a
pointer of that size.  If that is the issue using 64 bit pointers should
solve the problem, but I can't tell you what compiler switches are
needed to do this.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From atp at piskorski.com  Tue Nov  3 12:27:53 2009
From: atp at piskorski.com (Andrew Piskorski)
Date: Tue, 3 Nov 2009 15:27:53 -0500
Subject: [Beowulf] A look at the 100-core Tilera Gx
In-Reply-To: <20091103173717.GT17686@leitl.org>
References: <20091103173717.GT17686@leitl.org>
Message-ID: <20091103202753.GA32198@piskorski.com>

On Tue, Nov 03, 2009 at 06:37:17PM +0100, Eugen Leitl wrote:
> 
> http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/
> A look at the 100-core Tilera Gx
> It's all about the network(s)
> by Charlie Demerjian

It's ironic that while Tilera's own website points out their heritage
from MIT's RAW project, these external magazine articles generally
don't even mention it.  For actual understanding of the technology
it'd probably be more useful to point to and briefly summarize the
extensive and well-written MIT research papers, and then explain what
the company has actually changed since the academic work, and since
Tilera was last in the news with product announcements two years ago
(c. Oct. 2007).

Btw, is anyone commercializing the (related technology) TRIPS
Polymorphic Processor (EDGE architecture) work from the University of
Texas?  It sounded even more interesting and useful than RAW, but (not
being a chip guy at all myself) I had no idea whether that was just
hot air or not.

  http://groups.csail.mit.edu/cag/raw/
  http://www.cs.utexas.edu/~trips/
  http://www.beowulf.org/archive/2007-February/017414.html
  http://www.beowulf.org/pipermail/beowulf/2007-October/019617.html
  http://www.beowulf.org/pipermail/beowulf/2007-October/019621.html
  http://www.beowulf.org/pipermail/beowulf/2007-October/019677.html

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/


From atp at piskorski.com  Tue Nov  3 12:34:41 2009
From: atp at piskorski.com (Andrew Piskorski)
Date: Tue, 3 Nov 2009 15:34:41 -0500
Subject: [Beowulf] A look at the 100-core Tilera Gx
In-Reply-To: <4AF0759D.9070503@ias.edu>
References: <4AF0759D.9070503@ias.edu>
Message-ID: <20091103203441.GB32198@piskorski.com>

On Tue, Nov 03, 2009 at 01:25:33PM -0500, Prentice Bisbal wrote:

>> With the new Gx series chips, FP code is still frowned upon, but
>> there is some FP hardware to catch the odd instruction without a
>> huge speed hit.

> I imagine this short-coming will limit the Tilera Gx's value to most of
> HPC community. This doesn't even mention DP performance.

I don't remember anything from the MIT RAW papers suggesting that the
technology can't handle floating point, so I assume their integer-only
focus was a business decision.  If their business is successful, I
imagine they'll offer a product intended for floating-point work some
years down the road (if they're still around by then, of course).

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/


From gus at ldeo.columbia.edu  Tue Nov  3 12:38:36 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Tue, 03 Nov 2009 15:38:36 -0500
Subject: [Beowulf] Re:Fortran Array size question
In-Reply-To: <E1N5Pcn-0007Q0-MU@mendel.bio.caltech.edu>
References: <E1N5Pcn-0007Q0-MU@mendel.bio.caltech.edu>
Message-ID: <4AF094CC.40805@ldeo.columbia.edu>

Hi Prentice, list

Intel Fortran (at least the 10. and 11.something versions I have)
has different "memory models" for compilation.  The default is "small".
The PGI compiler has a similar feature, IIRR.

Have you tried -mcmodel=medium or large?
I never used large, but medium helped a few times on x86_64/i64em.

Of course your available RAM may be restriction,
as David pointed out.

An excerpt from "man ifort" is enclosed below.

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

 From "man ifort":


        -mcmodel= <mem_model> (i64em only; L*X only)
               Tells the compiler to use a specific memory  model  to 
generate
               code  and store data.  This option can affect code size 
and per-
               formance.

               You can specify one of the following values for <mem_model>:

               ? small

                 Restricts code and data to the first 2GB of address 
space. All
                 accesses of code and data can be done with Instruction 
Pointer
                 (IP)-relative addressing.  This is the default.

               ? medium

                 Restricts code to the first 2GB; it places no memory 
restric-
                 tion  on  data.  Accesses of code can be done with 
IP-relative
                 addressing, but accesses of data must be  done  with 
absolute
                 addressing.

               ? large

                 Places  no memory restriction on code or data. All 
accesses of
                 code and data must be done with absolute addressing.

               If your program has COMMON blocks and local data  with  a 
  total
               size  smaller  than  2GB, -mcmodel=small is sufficient. 
COMMONs
               larger  than  2GB  require  -mcmodel=medium  or 
-mcmodel=large.
               Allocation  of  memory larger than 2GB can be done with 
any set-
               ting of -mcmodel.

               IP-relative addressing requires only 32 bits,  whereas 
absolute
               addressing requires 64-bits.  IP-relative addressing is 
somewhat
               faster. So, the small memory model has the least impact 
on  per-
               formance.

               Note:  When the medium or large memory models are 
specified, you
               must also specify option -shared-intel to ensure that 
the  cor-
               rect  dynamic versions of the Intel run-time libraries 
are used.

               When shared objects (.so files) are built, 
position-independent
               code  (PIC)  is  specified so that a single .so file can 
support
               all three memory models. The compiler driver adds  option 
  -fpic
               to implement PIC.

               However,  you must specify a memory model for code that 
is to be
               placed in a static library or code that will  be  linked 
  stati-
               cally.


David Mathog wrote:
> Prentice Bisbal <prentice at ias.edu> wrote:
> 
>> Question: What is the maximum number of elements you can have in a
>> double-precision array in Fortran? I have someone creating a
>> 4-dimensional double-precision array. When they increase the dimenions
>> of the array to ~200 million elements, they get this error:
>>
>> compilation aborted (code 1).
> 
> The two things that come immediately to mind are:
> 
> 1.  The compiler ran out of memory.  (In addition to the size of the
> memory in the machine, check ulimit.)
> 
> 2.  The compiler is trying to build the program with 32 bit pointers and
> it cannot address this array, or perhaps all memory accessed, with a
> pointer of that size.  If that is the issue using 64 bit pointers should
> solve the problem, but I can't tell you what compiler switches are
> needed to do this.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From h-bugge at online.no  Tue Nov  3 13:30:05 2009
From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=)
Date: Tue, 3 Nov 2009 22:30:05 +0100
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <4AF08350.8060107@ias.edu>
References: <web-2752104@free.net> <4AF0739E.8030700@ias.edu>
	<20091103183927.GC16399@bx9.net> <4AF08350.8060107@ias.edu>
Message-ID: <F7B38195-A451-4C43-946A-7F24EA361BA8@online.no>

If it takes some time before it aborts, check the available size on  
your temp directory (usually /tmp). That might be the case if you're  
using IPO.


H?kon
On Nov 3, 2009, at 20:24 , Prentice Bisbal wrote:

> Greg Lindahl wrote:
>> On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote:
>>
>>> That's exactly the compiler I'm using, and it's failing at ~200  
>>> million
>>> elements. I'm  digging through the Intel documentation. Haven't  
>>> found an
>>> answer yet.
>>
>> Your bug report was incomplete: it really matters if the array is
>> automatic or not, or if it's initialized.
>>
> You're right -  I should have included a code snippet. It's not my  
> code,
> so I don't know if I can share all of it. Here's the line where the
> problem occurs:
>
> dimension vstore(1:4,0:4,5000000,2),fstore(0:4,5000000,2)
>
> If he reduces the 5000000 to a smaller number, it compiles. As  
> shown, he
> gets this error:
>
> ifort adaptnew2.for
> ...
> ...
> compilation aborted for adaptnew2.for (code 1)
>
> The compiler is Intel's ifort 11.0.074
>
> I'm not a Fortran programmer, so I'm a little out of my element  
> here. If
> it was bash or perl, or even C/C++, that'd be a different story.
>
> --
> Pretnice
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>

Mvh.,

H?kon Bugge
h-bugge at online.no
+47 924 84 514


From Michael.Frese at NumerEx-LLC.com  Tue Nov  3 17:02:02 2009
From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese)
Date: Tue, 03 Nov 2009 18:02:02 -0700
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <4AF08350.8060107@ias.edu>
References: <web-2752104@free.net> <4AF0739E.8030700@ias.edu>
	<20091103183927.GC16399@bx9.net> <4AF08350.8060107@ias.edu>
Message-ID: <6.2.5.6.2.20091103175916.0973d718@NumerEx-LLC.com>

I think Gus has it right.  Your two arrays are 200 million floating 
point words each, perhaps 8 bytes per word, and therefore are 
approximately about 3  gigabytes total.

That's too big for the default 'small' memory model.

Certainly, there are no limits to array sizes in Fortran.


Mike


At 12:24 PM 11/3/2009, Prentice Bisbal wrote:
>Greg Lindahl wrote:
> > On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote:
> >
> >> That's exactly the compiler I'm using, and it's failing at ~200 million
> >> elements. I'm  digging through the Intel documentation. Haven't found an
> >> answer yet.
> >
> > Your bug report was incomplete: it really matters if the array is
> > automatic or not, or if it's initialized.
> >
>You're right -  I should have included a code snippet. It's not my code,
>so I don't know if I can share all of it. Here's the line where the
>problem occurs:
>
>dimension vstore(1:4,0:4,5000000,2),fstore(0:4,5000000,2)
>
>If he reduces the 5000000 to a smaller number, it compiles. As shown, he
>gets this error:
>
>ifort adaptnew2.for
>...
>...
>compilation aborted for adaptnew2.for (code 1)
>
>The compiler is Intel's ifort 11.0.074
>
>I'm not a Fortran programmer, so I'm a little out of my element here. If
>it was bash or perl, or even C/C++, that'd be a different story.
>
>--
>Pretnice
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf


From gerry.creager at tamu.edu  Wed Nov  4 06:03:28 2009
From: gerry.creager at tamu.edu (Gerry Creager)
Date: Wed, 04 Nov 2009 08:03:28 -0600
Subject: [Beowulf] A look at the 100-core Tilera Gx
In-Reply-To: <20091103173717.GT17686@leitl.org>
References: <20091103173717.GT17686@leitl.org>
Message-ID: <4AF189B0.2070200@tamu.edu>

I think it was the recent IEEE Spectrum, where they talk about using the 
Tilera 100-core chips for HPC, tuned to a specific problem using FPGA 
for optimizing the chips to the problem.  The argument is to use a 
lower-power system with huge numbers of cores and efficient on-chip 
switching, to replace the xommon x86(_64) architecture we've come to 
know and hate for its energy consumption and heat generation.

Personal take: conventional systems will win this battle (ask SiCortex: 
<sigh> a great idea overwhelmed by investors who couldn't see its 
longer-term benefits), but that we just might see changes to slower but 
more efficient cores.  Via Epia-10k comes to mind, as do the Atom and 
several other variants.  A little slower switching fabric (gigabit) with 
some changes to the core thinking of integration designers, will be 
required, but I think we could make that 20kcore Atom system using 
gigabit work pretty well compared to a 4k core Nehalem with QDR.

The big thing is reworking our thinking: It costs a LOT (we've said this 
all before) to create the power and cooling infrastructure for serious 
HPC, and I'll posit now that "serious" requires at least 4k x86_64 cores 
in today's logic.  If the cost of powering and cooling all this stuff is 
considered, it's a huge expense, but then, a lot of us are at academic 
institutions, and don't have to consider infrastructure... or didn't 
until recently.  Example: I have no place to expand our HPC, since we've 
maxed out power and cooling in the machine room we're currently in. 
And, in the only reasonable space I can build out to expand into, 
power's $90K and cooling another $100K to expand, allowing an additional 
20 racks.  Of x86_64 and QDR.  In fact, while I'll gain 20 racks of 
space, I'm not sure I can get 20 racks of cooling in place for that. 
I'm reasonably sure I can power the stuff for the $90K figure and even 
add sufficient generator to keep critical elements (cooling at a reduced 
level; HPC generally has no requirement for running during a power 
failure) to continue until a clean shutdown or power's restored.

I like what I've read on the Tilera.  I think it's got some potential, 
but I think it's time we consider taking our breed of HPC toward to 
Maker side of things, and begin hacking minimalist motherboards, 
adopting low-power devices, and generally reinvent the hardware stack as 
we knew it.

gerry

Eugen Leitl wrote:
> http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/
> 
> A look at the 100-core Tilera Gx
> 
> It's all about the network(s)
> 
> by Charlie Demerjian
> 
> October 29, 2009
> 
> TILERA IS CLAIMING to have the first commercial CPU to reach 100 cores, and
> while this is true, the real interesting technology is in the interconnects.
> The overall chip is quite a marvel, and it is unlike any mainstream CPU you
> have ever heard of.
> 
> Making a lot of cores on a chip isn't very hard. Larrabee for example has 32
> Pentium (P54) cores, heavily modified, as the basis of the GPU. If Intel
> wanted to, it could put hundreds of cores on a die, that part is actually
> quite easy. Keeping those cores fed is the most important problem of modern
> chipmaking, and that part is not easy.
> 
> Large caches, wide memory busses, ring busses on chip, stacking, and optical
> interfaces all are attempts to feed the beast. Everyone thought Intel's
> Polaris, also known as the 80 core, 1 TeraFLOPS part from a few years ago,
> was about packing cores onto a die. It wasn't, it was a test of routing
> algorithms and structures. Routing is where the action is now, packing cores
> in is not a big deal.
> 
> Routing is where Tilera shines. It has put a great deal of thought into
> getting data from core to core with minimal latency and problems. Its rather
> unique approach involves five different interconnect networks, programmable
> partitioning, accelerators, and simply tons of I/O. Together, these allow
> Tilera's third generation Tile-Gx CPUs to scale from 16 to 100 cores without
> choking on congestion. They may not have the same single-threaded performance
> of a Nehalem or Shanghai core, but they make up for it with volume.
> 
> 100 core diagram
> 
> Tilera 100 core chip
> 
> The basic structure is a square array of small cores, 4x4, 6x6, 8x8 or 10x10,
> each connected via five (5) on-chip networks, and flanked by some very
> interesting accelerators. The cores themselves are a proprietary 32-bit ISA
> in the first two generations of Tilera chips, and in the Gx, it is extended
> to 64-bit. There are 75 new instructions in the Gx, 20 of which are SIMD, and
> the memory controller now sees 64 bits as well.
> 
> In previous generations, there was no floating-point (FP) hardware in Tilera
> products. The company strongly recommended against using FP code because it
> had to be emulated taking hundreds or thousands of cycles. With the new Gx
> series chips, FP code is still frowned upon, but there is some FP hardware to
> catch the odd instruction without a huge speed hit. The 100 core part can do
> 50 GigaFLOPS of FP which may sound like a large number, but that is only
> about 1/50th of what an ATI Cypress HD5870 chip can do.
> 
> The majority of the new instructions are aimed at what the Tilera chips do
> best, integer calculations. Things like shuffle and DSP-like
> multiply-and-accumulate (MAC) functions, including a quad MAC unit, are where
> these new chips shine. Basically, the Gx moves information around very
> quickly while twiddling bits here and there with integer functions.
> 
> While the cores might not be overly complex, the on-chip busses are. Each Gx
> core has 64K of L1 cache, 32K data and 32K instruction, along with a unified
> 8-way 256KB L2 cache. The cache is totally non-blocking, completely coherent,
> and the cache subsystem can reorder requests to other caches or DRAM. On top
> of this, the core supports cache pinning to keep often used data or
> instructions in cache. On the 100 core model, the Gx has 32MB of cache.
> 
> Tiles are the name Tilera uses for for a basic unit of repetition. The 16
> core Gx has 16 tiles, the 64 core Gx has 64, etc. A tile consists of a core,
> the L1 and L2 caches, and something Tilera calls the Terabit Switch. More
> than anything, this switch is the heart of the chip.
> 
> Tile diagram
> 
> A Tilera tile
> 
> Remember when we said that cramming 100 cores on a die is not a big problem,
> but feeding them is? The Terabit Switch is how Tilera solves the problem, and
> it is a rather unique solution. Instead of one off-core bus, there are five.
> Each of them has a dedicated purpose, and that not only gives huge bandwidth,
> it also goes a fair way towards minimizing contention. Cache traffic will
> never be stepped on by user data, and so on.
> 
> The five networks are called QDN, RDN, FDN, IDN and UDN. In the last two
> generations of Tilera chips, all of these networks were 32 bits wide, but on
> the Gx, the widths vary to give each one more or less bandwidth depending on
> their functions.
> 
> QDN is called the reQuest Dynamic Network, and it is used for memory and
> cache. QDN is 64 bits wide. RDN is Response Dynamic Network, and it is used
> to feed memory reads back to the chips. RDN is 112 bits wide, an odd number,
> 64 + 48 from the look of it.
> 
> FDN is the widest at 128 bits, and it is used for cache to cache transfers
> and cache coherency. Given the critical nature of cache transactions like
> this, the width is no surprise. The last two IDN and UDN are both 32 bits
> wide. IDN is I/O Dunamic Network, and passes data on and off the chip. With a
> dedicated channel for off-chip transfers, you can see that reaching
> theoretical numbers was a priority at Tilera.
> 
> The last network UDN is for User Dynamic Network, basically the one users get
> to send stuff around on. QDN, RDN, FDN and IDN are basically housekeeping,
> they work in the background. If you want to send things from point A to point
> B, you send it across the UDN.
> 
> Although Tilera didn't explicitly state it, each hop from router to router
> takes one cycle. This means that in a pathological case, corner core to
> memory on the far corner, it could take 19 cycles to go from request to
> memory, plus the memory round trip time, and then another 19 cycles to get
> back. That is what you call a long time in computer speak. Even in an
> 'average' case, you have a 10 cycle latency, which is very long as well.
> 
> To be fair, the Tilera architecture is not made to run general purpose code.
> As it was described when the first generation came out, workloads are meant
> to be chunked up, so a single tile does a function, then the data gets passed
> to the next tile for more work, and so on and so forth. If your program has
> 20 steps, you use 20 tiles and pipeline the work.
> 
> This solves many of the problems with variable latency and multi-hop traffic.
> The other more elegant solution is the ability to section off chunks of the
> chip into sub-units. There is a hypervisor that can partition each Gx chip
> into programmable blocks.
> 
> Chunking tiles
> 
> Sub-sections of tiles
> 
> As you can see in the diagram above, each Gx is broken up into sub-chips in
> software. You can give each process as much CPU power as it needs, and
> arrange it so the output of one block feeds into the input of the next in a
> single clock. This example has two Apache web server instances, an intrusion
> prevention system (IPS), a secure sockets layer (SSL) stack, a network stack
> and a few other processes running next to each other.
> 
> The Apache instances have their own memory controller, as do the IPS and the
> SSL stack. The network stack is sitting on top of the memory controller for
> decreased latency. Basically, the programmer can choose where to put each
> process to minimize latency. It doesn't take much to figure out how to apply
> these concepts to a database plus web server scenario, or a three-tiered
> SAP-like workload.
> 
> Basically, Tilera allows you to explicitly place the data and compute
> resources where, when and how you need them. The chunks are done at roughly
> the same level as hardware VMs are in x86 CPUs, running below the level that
> a process can affect. This creates hardware walls to segregate data
> transfers, cache coherency traffic, and other tile to tile transfers. If done
> correctly, it can minimize latency a lot in addition to keeping processes
> from stepping on each other.
> 
> Now that you know how the cores work, talk, and are partitioned, what about
> the 'uncore'? Talk about that starts with the memory controllers - four
> DDR3-2133MHz banks on the 64 and 100 core Gx, two on the 16 and 36 core
> models. For the keen eyed out there, this means Tilera has two different
> socket configurations, one for the 64 and 100 core chips, and another one for
> the 16 and 36 core chips.
> 
> DDR3-2133MHz memory is very fast, hugely fast in fact. The math says 17GBps
> per contr. Basically, this chip has a lot of available bandwidth. As you
> might imagine, on the 16 and 36 core variants, there are only half the
> controllers, so half the bandwidth.
> 
> In addition, you have a generic controller for USB, UARTs, JTAG and I2C
> controllers. Given that Tilera chips are basically embedded, these are not
> likely to be used for much more than booting and diagnostics.
> 
> On the core diagram above, there are two other blocks, the orange MiCA and
> mPIPE accelerators. These are where the other parts of the Tilera Gx 'magic'
> happen. MiCA stands for Multistream iMesh Crypto Accelerator, while mPIPE is
> short for multicore Programmable Intelligent Packet Engine. If it isn't
> blindingly obvious, the MiCA does the crypto and the mPIPE speeds up I/O.
> 
> The mPIPE does a lot of interesting things, all supposedly at wire speed. It
> has a programmable packet classification engine, said to be usable at 80Gbps
> or 120M packets per second. It can twiddle headers and do other evil things
> that would make Comcast drool with the potential for 'network management'
> extortion payements.
> 
> In addition, it can also load balance across the various I/O lanes, and
> redirect tile to tile 'I/O' in a somewhat intelligent fashion. On top of
> that, the mPIPE manages buffer sizes, queues, and other housekeeping to keep
> latencies low. Think of it as a programmable housekeeping offload engine.
> 
> The most interesting bit is that the mPIPE can tag a packet with a 32 bit
> header before it sends it onto the internal network. This is where the
> programmable part shines. You can set up fields in the I/O packet itself to
> pass along pre-decode information and other time-saving tidbits. Since I/O is
> fully virtualizable, you could theoretically tag the packets with VM data, or
> just about anything else a bored programmer can think of.
> 
> The MiCA engines, two on the 64/100 core, one on 16/36 cores, are crypto
> offload engines. They can work either 'inline' or as ull blown offload
> engines, that is up to the programmer. The MiCA can pull data directly from
> caches or main memory without CPU overhead, basically fire and forget.
> 
> If you like acronyms, the MiCA on the Gx can support AES, 3DES, ARC4, Kasumi
> and Snow for crypto, SHA-1, SHA-2, MD5, HMAC and AES-GMAC for hashes, RSA,
> DSA, Diffie-Hellman, and Elliptic Curve for public key work, and it has a
> true random number generator (RNG). WTF, LOL, ROFL and other netspeak can be
> encrypted along with any other text that uses correct grammar. RLY.
> 
> Tilera claims that the MiCA engine can do wire speed 40Gbps crypto with full
> duplex on the 100 core Gx, and 1024b key RSA at 50K keys per second on the
> 100 core, 20K keys per second for the 36 core. Not bad at all. In addition,
> the MiCA supports a hardware compression engine that uses the tried and true
> Deflate algorithm.
> 
> The last piece of the puzzle is something that Tilera calls external
> acceleration interfaces. This could be as simple as plugging in a PCIe card,
> but that lacks elegance. The interesting part is a field programmable gate
> array (FPGA) interface. You can take up to 8 lanes of PCIe and connect the
> FPGA to the serial deserial unit (SerDes) to enable basically direct and low
> latency 32Gbps transfers. Direct transfers to cache and multiple contexts are
> supported, meaning you can do quite a bit with an FPGA and a Tilera-Gx chip.
> 
> In the end, you have a monster chip for I/O and packet processing. It doesn't
> do single-threaded applications all that fast, but it really isn't meant to.
> The chip itself is not out yet, nor is there even silicon yet. The first
> version out will be the 36 core Gx in Q4 of 2010, followed by the 16 core
> later in Q4 or possibly Q1 of 2011. These both share the same socket
> configuration and a 35*35mm package.
> 
> In Q1 of 2011, the 100 core chip will come out on a new socket and in a
> 45*45mm package. A bit after that, the 64 core will hit the market. Power
> ranges from 10W for the 16 core to 55W for the 100 core, but you can get
> power optimized variants that will only suck 35W. Given the programmability
> of the parts, power use is likely more dependent on the programs running on
> it.
> 
> The last bit of information is clock speeds. The 64 and 100 core models will
> come in versions that run at 1.25GHz and 1.5GHz, not bad considering how much
> there is to synchronize and keep going. The 36 core models will come in
> 1.0GHz, 1.25GHz and 1.5GHz versions, and the 16 core models will only come in
> 1.0GHz or 1.25GHz versions. Given the core count, internal interconnections,
> memory and I/O capabilities, Tilera will pack a lot of power into these small
> packages.S|A
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at pbm.com  Wed Nov  4 10:36:09 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 4 Nov 2009 10:36:09 -0800
Subject: [Beowulf] A look at the 100-core Tilera Gx
In-Reply-To: <4AF189B0.2070200@tamu.edu>
References: <20091103173717.GT17686@leitl.org> <4AF189B0.2070200@tamu.edu>
Message-ID: <20091104183609.GA6626@bx9.net>

On Wed, Nov 04, 2009 at 08:03:28AM -0600, Gerry Creager wrote:

> And,
> in the only reasonable space I can build out to expand into, power's $90K 
> and cooling another $100K to expand, allowing an additional 20 racks.

Uh, I'm missing how this is a big problem... $200k or $300k of capital
costs to get 20 more racks. Let's assume that you unfairly have to pay
the whole capital cost up front. How much does the equipment to fill
those racks cost? A lot more than $200k, unless you have a really low
power density, or are buying nodes with small ram, no high speed
network, etc.

So it's annoying, but not a show-stopper. If you had to build a new
building or addition, yeah, that would really hurt.

This is the fundamental problem that low power startups face. They
have a huge advantage when the machineroom is of a fixed capacity, or
if capital costs aren't accounted for properly. They have a modest
advantage for organizations that can capitalize things.

The first market isn't big enough, and the big market in which they
only have a modest advantage won't pay enough of a premium. Game over.

-- greg


From cbergstrom at pathscale.com  Tue Nov  3 17:27:22 2009
From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=)
Date: Tue, 03 Nov 2009 20:27:22 -0500
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <6.2.5.6.2.20091103175916.0973d718@NumerEx-LLC.com>
References: <web-2752104@free.net>
	<4AF0739E.8030700@ias.edu>	<20091103183927.GC16399@bx9.net>
	<4AF08350.8060107@ias.edu>
	<6.2.5.6.2.20091103175916.0973d718@NumerEx-LLC.com>
Message-ID: <4AF0D87A.60501@pathscale.com>

Michael H. Frese wrote:
> I think Gus has it right.  Your two arrays are 200 million floating 
> point words each, perhaps 8 bytes per word, and therefore are 
> approximately about 3  gigabytes total.
>
> That's too big for the default 'small' memory model.
>
> Certainly, there are no limits to array sizes in Fortran.
If you need a 64bit Fortran compiler ping me off list. (PathScale not 
iFort of course)  I was working with some extremely large code a few 
months ago and one of our engineers could probably polish up my patch 
and send out test binaries to anyone interested.  All we would ask for 
in this specific case is feedback.

<marketing>Our performance on floating point code should be better than 
Intel's.  If you find to the contrary that's a bug we're happy to 
address</marketing>

./C


From niftyompi at niftyegg.com  Wed Nov  4 16:36:20 2009
From: niftyompi at niftyegg.com (Nifty Tom Mitchell)
Date: Wed, 4 Nov 2009 16:36:20 -0800
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <4AF0D87A.60501@pathscale.com>
References: <web-2752104@free.net> <4AF0739E.8030700@ias.edu>
	<20091103183927.GC16399@bx9.net> <4AF08350.8060107@ias.edu>
	<6.2.5.6.2.20091103175916.0973d718@NumerEx-LLC.com>
	<4AF0D87A.60501@pathscale.com>
Message-ID: <20091105003620.GA2166@hpegg.wr.niftyegg.com>

On Tue, Nov 03, 2009 at 08:27:22PM -0500, "C. Bergstr?m" wrote:
> Michael H. Frese wrote:
> >I think Gus has it right.  Your two arrays are 200 million
> >floating point words each, perhaps 8 bytes per word, and therefore
> >are approximately about 3  gigabytes total.
> >
> >That's too big for the default 'small' memory model.
> >
> >Certainly, there are no limits to array sizes in Fortran.
> If you need a 64bit Fortran compiler ping me off list. (PathScale
> not iFort of course)  I was working with some extremely large code a
> few months ago and one of our engineers could probably polish up my
> patch and send out test binaries to anyone interested.  All we would
> ask for in this specific case is feedback.
> 
> <marketing>Our performance on floating point code should be better
> than Intel's.  If you find to the contrary that's a bug we're happy
> to address</marketing>

If I recall the pathscale compiler also needs a flag to establish
the memory model at compile time.   

One thing I cannot answer off the top of my head is if there is a need
to also establish the memory model for libraries. Since all libs are not
the same some will and some will not.  If there is any doubt when debugging 
check that the compile flags are consistant.


From prentice at ias.edu  Thu Nov  5 10:33:07 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 05 Nov 2009 13:33:07 -0500
Subject: [Beowulf] Fortran Array size question
In-Reply-To: <4AF063B3.3050508@ias.edu>
References: <4AF063B3.3050508@ias.edu>
Message-ID: <4AF31A63.7040803@ias.edu>


Prentice Bisbal wrote:
> This question is a bit off-topic, but since it involves Fortran minutia,
> I figured this would be the best place to ask. This code may eventually
> run on my cluster, so it's not completely off topic!
> 
> Question: What is the maximum number of elements you can have in a
> double-precision array in Fortran? I have someone creating a
> 4-dimensional double-precision array. When they increase the dimenions
> of the array to ~200 million elements, they get this error:
> 
> compilation aborted (code 1).
> 
> I'm sure they're hitting a Fortran limit, but I need to prove it. I
> haven't been able to find anything using The Google.
> 

Everyone - thanks for all the help. I was inundated with suggestions,
which I've passed along to the researcher I'm helping. I'm sure one of
them will help us get the code to compile and run. Thanks again.

--
Prentice


From deadline at eadline.org  Fri Nov  6 06:22:10 2009
From: deadline at eadline.org (Douglas Eadline)
Date: Fri, 6 Nov 2009 09:22:10 -0500 (EST)
Subject: [Beowulf] The 2009 Beowulf Bash
In-Reply-To: <4AF31A63.7040803@ias.edu>
References: <4AF063B3.3050508@ias.edu> <4AF31A63.7040803@ias.edu>
Message-ID: <46622.192.168.1.213.1257517330.squirrel@mail.eadline.org>


For those of you who have been waiting all year
for this event, check out the following URL

  http://www.xandmarketing.com/beobash09/

Back Story: A certain local Portland company who
must remain anonymous has donated 5 kegs of
local beer they had brewed for SC09. I'm sure if you compile
a list of possible Portland based HPC companies you will
find it is a rather small group ;-)

-- 
Doug


From amjad11 at gmail.com  Fri Nov  6 14:43:38 2009
From: amjad11 at gmail.com (amjad ali)
Date: Fri, 6 Nov 2009 17:43:38 -0500
Subject: [Beowulf] Programming Help needed
Message-ID: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com>

Hi all,

I need/request some help from those who have some experience in
debugging/profiling/tuning parallel scientific codes, specially for
PDEs/CFD.

I have parallelized a Fortran CFD code to run on
Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is that:

Suppose that the grid/mesh is decomposed for n number of processors, such
that each processors has a number of elements that share their side/face
with different processors. What I do is that I start non blocking MPI
communication at the partition boundary faces (faces shared between any two
processors) , and then start computing values on the internal/non-shared
faces. When I complete this computation, I put WAITALL to ensure MPI
communication completion. Then I do computation on the partition boundary
faces (shared-ones). This way I try to hide the communication behind
computation. Is it correct?

IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements)
with an another processor B then it sends/recvs 50 different messages. So in
general if a processors has X number of faces sharing with any number of
other processors it sends/recvs that much messages. Is this way has "very
much reduced" performance in comparison to the possibility that processor A
will send/recv a single-bundle message (containg all 50-faces-data) to
process B. Means that in general a processor will only send/recv that much
messages as the number of processors neighbour to it.  It will send a single
bundle/pack of messages to each neighbouring processor.
Is their "quite a much difference" between these two approaches?

THANK YOU VERY MUCH.
AMJAD.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091106/f5b64c9a/attachment.html>

From stewart at serissa.com  Sat Nov  7 03:11:19 2009
From: stewart at serissa.com (Larry Stewart)
Date: Sat, 7 Nov 2009 06:11:19 -0500
Subject: [Beowulf] Programming Help needed
In-Reply-To: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com>
References: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com>
Message-ID: <77b0285f0911070311y111b4bd4o1d8e49994bc017e2@mail.gmail.com>

On Fri, Nov 6, 2009 at 5:43 PM, amjad ali <amjad11 at gmail.com> wrote:

> Hi all,
>
>
> IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
> elements) with an another processor B then it sends/recvs 50 different
> messages. So in general if a processors has X number of faces sharing with
> any number of other processors it sends/recvs that much messages. Is this
> way has "very much reduced" performance in comparison to the possibility
> that processor A will send/recv a single-bundle message (containg all
> 50-faces-data) to process B. Means that in general a processor will only
> send/recv that much messages as the number of processors neighbour to it.
> It will send a single bundle/pack of messages to each neighbouring
> processor.
> Is their "quite a much difference" between these two approaches?
>
>
It is probably faster to send a single message with all the data, rather
than fifty messages, especially if each item is small.  However, you don't
have to guess.  Just create a small test
program and use MPI_WTIME to measure how long the two cases take.

The usual way to do timing measurements that gets decent results is to
measure the time for one iteration of the two cases, then measure the time
for two iterations, then 4, then 8, and so on
until the time for a run exceeds one second.

The issues that make it likely that one big message is faster than 50 small
ones are that copying the data into a single message on a modern processor
will be much faster than sending the bits over ethernet, and that each
message has a certain overhead, which is probably large compared to the copy
and transmission time of a small datum.

If you will be writing MPI programs for various problems, it might be useful
to download and run something like the Intel MPI Tests, that will give you
performance figures for the various MPI operations and give you a feel for
how expensive different things are on your system.

-L
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091107/4dd28168/attachment.html>

From stewart at serissa.com  Sat Nov  7 03:22:04 2009
From: stewart at serissa.com (Larry Stewart)
Date: Sat, 7 Nov 2009 06:22:04 -0500
Subject: [Beowulf] Programming Help needed
In-Reply-To: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com>
References: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com>
Message-ID: <77b0285f0911070322o5c29c8bbk13f78178b2cef8a0@mail.gmail.com>

On Fri, Nov 6, 2009 at 5:43 PM, amjad ali <amjad11 at gmail.com> wrote:

> Hi all,
>
>
> Suppose that the grid/mesh is decomposed for n number of processors, such
> that each processors has a number of elements that share their side/face
> with different processors. What I do is that I start non blocking MPI
> communication at the partition boundary faces (faces shared between any two
> processors) , and then start computing values on the internal/non-shared
> faces. When I complete this computation, I put WAITALL to ensure MPI
> communication completion. Then I do computation on the partition boundary
> faces (shared-ones). This way I try to hide the communication behind
> computation. Is it correct?
>
>
There are two issues here.  First, correctness.  The data for messages that
arrive while you are computing may be written into memory asynchronously
with respect to your program. Be sure that you are not depending on values
in memory that may be overwritten by data arriving from other
ranks.  Second, overlap is good, but whether you actually get any overlap
depends on the details.
For example, the work of communicating with other ranks and sending messages
and so forth must be done by something. For ethernet, there will be a lot of
work done by the OS kernel and in general by some core on each node. If you
expect to be using all the cores in a node to run your program, who is left
to do the communications work?  Some implementations will timeshare the
processors, giving the appearance of overlap, but not actually running
faster, while other implementations simply won't do any work until the
WAITALL that demands progress. If you have multicore nodes, and you don't
need every last core to run your program, it can help if you only allocate
some of the cores on each node to your program, leaving some "idle" to run
the OS and the communications.  The job control system should have a way to
do this.

You can test to find out if you are getting any overlap, by artificially
reducing the actual communications work to near zero and seeing if the
program runs any faster.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091107/f41bd6e5/attachment.html>

From john.hearns at mclaren.com  Mon Nov  9 10:19:39 2009
From: john.hearns at mclaren.com (Hearns, John)
Date: Mon, 9 Nov 2009 18:19:39 -0000
Subject: [Beowulf] Kingston 40Gbyte SSD
Message-ID: <68A57CCFD4005646957BD2D18E60667B0DC8CE0A@milexchmb1.mil.tagmclarengroup.com>

Might make a nice cluster node drive:
http://www.reghardware.co.uk/2009/11/09/review_storage_kingston_ssd_now_
v_40gb/

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From joshua_mora at usa.net  Fri Nov  6 15:29:58 2009
From: joshua_mora at usa.net (Joshua mora acosta)
Date: Fri, 06 Nov 2009 17:29:58 -0600
Subject: [Beowulf] Programming Help needed
Message-ID: <466NkFXC78668S06.1257550198@cmsweb06.cms.usa.net>

Just try it and you'll understand what it means communication overhead....
most of these apps are network latency dominated: small messages but lots
because of i) many neighbor processors involved and iterative process.
Packing all the faces that need to be exchanges is the right way to go.
You can also think in having a dedicated thread for handling the
communications and the remaining ones for computation at the compute node
level. So you really get good overlapping of computation and commputation. 

Joshua
------ Original Message ------
Received: 04:52 PM CST, 11/06/2009
From: amjad ali <amjad11 at gmail.com>
To: Beowulf Mailing List <beowulf at beowulf.org>
Subject: [Beowulf] Programming Help needed

> Hi all,
> 
> I need/request some help from those who have some experience in
> debugging/profiling/tuning parallel scientific codes, specially for
> PDEs/CFD.
> 
> I have parallelized a Fortran CFD code to run on
> Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is
that:
> 
> Suppose that the grid/mesh is decomposed for n number of processors, such
> that each processors has a number of elements that share their side/face
> with different processors. What I do is that I start non blocking MPI
> communication at the partition boundary faces (faces shared between any two
> processors) , and then start computing values on the internal/non-shared
> faces. When I complete this computation, I put WAITALL to ensure MPI
> communication completion. Then I do computation on the partition boundary
> faces (shared-ones). This way I try to hide the communication behind
> computation. Is it correct?
> 
> IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
elements)
> with an another processor B then it sends/recvs 50 different messages. So
in
> general if a processors has X number of faces sharing with any number of
> other processors it sends/recvs that much messages. Is this way has "very
> much reduced" performance in comparison to the possibility that processor A
> will send/recv a single-bundle message (containg all 50-faces-data) to
> process B. Means that in general a processor will only send/recv that much
> messages as the number of processors neighbour to it.  It will send a
single
> bundle/pack of messages to each neighbouring processor.
> Is their "quite a much difference" between these two approaches?
> 
> THANK YOU VERY MUCH.
> AMJAD.
> 

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
> 


From amjad11 at gmail.com  Tue Nov 10 10:21:08 2009
From: amjad11 at gmail.com (amjad ali)
Date: Tue, 10 Nov 2009 13:21:08 -0500
Subject: [Beowulf] MPI Coding help needed
Message-ID: <428810f20911101021x6421743dx10487632a648d141@mail.gmail.com>

Hi all.
(sorry for duplication, if it is)

I have to parallelize a CFD code using domain/grid/mesh partitioning among
the processes. Before running, we do not know,
(i) How many processes we will use ( np  is unknown)
(ii) A process will have how many neighbouring processes (my_nbrs = ?)
(iii) How many entries a process need to send to a particular neighbouring
process.
But when the code run, I calculate all of this info easily.


The problem is to copy a number of entries to an array then send that array
to a destination process. The same sender has to repeat this work to send
data to all of its neighbouring processes. Is this following code fine:

DO i = 1, my_nbrs
   DO j = 1, few_entries_for_this_neighbour
       send_array(j)   =    my_array(jth_particular_entry)
   ENDDO
   CALL MPI_ISEND(send_array(1:j),j, MPI_REAL8, dest(i), tag,
MPI_COMM_WORLD, request1(i), ierr)
ENDDO

And the corresponding receives, at each process:

DO i = 1, my_nbrs
   k = few_entries_from_this_neighbour
   CALL MPI_IRECV(recv_array(1:k),k, MPI_REAL8, source(i), tag,
MPI_COMM_WORLD, request2(i), ierr)
   DO j = 1, few_from_source(i)
       received_data(j)   =    recv_array(j)
   ENDDO
ENDDO

After the above MPI_WAITALL.


I think this code will not work. Both for sending and receiving. For the
non-blocking sends we cannot use send_array to send data to other processes
like above (as we are not sure for the availability of application buffer
for reuse). Am I right?

Similar problem is with recv array; data from multiple processes cannot be
received in the same array like above. Am I right?


Target is to hide communication behind computation. So need non blocking
communication. As we do know value of np or values of my_nbrs for each
process, we cannot decide to create so many arrays. Please suggest solution.


===================
A more subtle solution that I could assume is following:

cc = 0
DO i = 1, my_nbrs
   DO j = 1, few_entries_for_this_neighbour
       send_array(cc+j)   =    my_array(jth_particular_entry)
   ENDDO
   CALL MPI_ISEND(send_array(cc:cc+j),j, MPI_REAL8, dest(i), tag,
MPI_COMM_WORLD, request1(i), ierr)
   cc = cc  + j
ENDDO

And the corresponding receives, at each process:

cc = 0
DO i = 1, my_nbrs
   k = few_entries_from_this_neighbour
   CALL MPI_IRECV(recv_array(cc+1:cc+k),k, MPI_REAL8, source(i), tag,
MPI_COMM_WORLD, request2(i), ierr)
   DO j = 1, k
       received_data(j)   =    recv_array(cc+j)
   ENDDO
   cc = cc + k
ENDDO

After the above MPI_WAITALL.

Means that,
send_array for all neighbours will have a collected shape:
send_array = [... entries for nbr 1 ..., ... entries for nbr 1 ..., ..., ...
entries for last nbr ...]
And the respective entries will be send to respective neighbours as above.


recv_array for all neighbours will have a collected shape:
recv_array = [... entries from nbr 1 ..., ... entries from nbr 1 ..., ...,
... entries from last nbr ...]
And the entries from the processes will be received at respective
locations/portion in the recv_array.


Is this scheme is quite fine and correct.

I am in search of efficient one.

Request for help.


With best regards,
Amjad Ali.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091110/adf0f1fc/attachment.html>

From a28427 at ua.pt  Tue Nov 10 10:31:41 2009
From: a28427 at ua.pt (Tiago Marques)
Date: Tue, 10 Nov 2009 18:31:41 +0000
Subject: [Beowulf] bizarre scaling behavior on a Nehalem
In-Reply-To: <c4d69730908111157q4e10abdcka6b91b1fbcfe564b@mail.gmail.com>
References: <c4d69730908100941p482606a8r91a283d4aca95aee@mail.gmail.com>
	<a8d96dec0908101048h6ecdda91h40f179725b6c6fd2@mail.gmail.com>
	<c4d69730908101528r4695a7c8nda3857afbc48530c@mail.gmail.com>
	<4A81ACF3.60802@noaa.gov>
	<c4d69730908111157q4e10abdcka6b91b1fbcfe564b@mail.gmail.com>
Message-ID: <b1335fe90911101031y46208314m81da924d4edf0e6c@mail.gmail.com>

Hi all,

Sorry to ressurect this thread after all this time but I just figured
out the problem with VASP by chance.

VASP's INCAR file accepts one parameter that both fixes scalability
problems and increases performance at the same time, even if you still
stick to 6 cores. That parameter is NPAR.
I was recommended to set NPAR=2 for most calculations and it worked
great. Still, I experimented a bit and NPAR=1 and it gave even better
results. It seems VASP, by default, is using NPAR=NCPUS, which
cripples performance if you don't use multiples of 3.

" running on    8 nodes
 distr:  one band on    8 nodes,    1 groups"

This is with NPAR=1

NPAR=2 gives something like:

" running on    8 nodes
 distr:  one band on    4 nodes,    2 groups"

Enjoy the performance increase, if you haven't still. To us it
increased around 33% in conjunction with running 8 CPUs. It seems to
me that groups may be useful to run with more nodes and not just one
machine but I haven't had the chance to test that out.


On Tue, Aug 11, 2009 at 6:57 PM, Rahul Nabar <rpnabar at gmail.com> wrote:
> On Tue, Aug 11, 2009 at 12:40 PM, Craig Tierney<Craig.Tierney at noaa.gov> wrote:
>> What are you doing to ensure that you have both memory and processor
>> affinity enabled?
>>
>
> All I was using now was the flag:
>
> --mca mpi_paffinity_alone 1

I was actually using that on the Xeons 54xx, since the processors
aren't native quad-cores, the kernel would keep threads bouncing from
core to core to achive a proper load balance. This was the best it
could do and I managed to get about 3% better performance from using
that together with disabling some kernel option I don't quite remember
right now, so the threads wouldn't jump around anymore. If you didn't
disabled the load balancing the code would inevitably mis-schedule and
the code would end up running with only 5 cores(or from start) and
calculations would take around 10x longer.
This was only useful in 6 cores per node, as then each processor would
be running precisely 3 threads. With eight I haven't tried it but I
assume the advantage would be negligible.

Best regards,
Tiago Marques

>
> Is there anything else I ought to be doing as well?
>
> --
> Rahul
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From spidus000 at gmail.com  Mon Nov  9 19:57:08 2009
From: spidus000 at gmail.com (SpiduS Okami)
Date: Tue, 10 Nov 2009 01:57:08 -0200
Subject: [Beowulf] Configuring HPCC
Message-ID: <ebe51f0b0911091957r7b31396bx21713ebe9f4f58e8@mail.gmail.com>

Hello To All!

This is my first post in beowulf list. I am looking for a help to
configure my HPC Challenge test. I will be graduating in Computer
science soon and,before that, I need to deliver a graduation work. I
choosed to compare a "supercomputer" with a cluster of old machines,
trying to proove that a cluster of old machines can cost less and
process as much information as a supercomputer. For that I must
maximize the capacity of the information processed by the
supercomputer and the cluster.

My configuration of "supercomputer" is:

AMD Phenom 9600 Quad-Core
4 GB DDR2 800mhz RAM
500 GB HD - 40 GB for UBUNTU 9.10
HPCC 1.3.1 installed by deb package.


Thak you!
Att.,
Victor Bruno Alexander.


From amjad11 at gmail.com  Tue Nov 10 17:30:30 2009
From: amjad11 at gmail.com (amjad ali)
Date: Tue, 10 Nov 2009 20:30:30 -0500
Subject: [Beowulf] array shape difference
Message-ID: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>

HI,

suppose we have four arrays with same number of elements say 60000., but
different dimensions like:

array1(1:60000)
array2(1:2, 1:30000)
array3(1:2, 1:300, 1:100)
array4(1:4, 1:15, 1:10,  1:100)


Does each of these arrays in fortran will occupy same amount of memory?

For sending/receiving each of these, Does MPI has the same (or nearly same)
overhead? or any significantly different overhead is involved in handling
each of these arrays (by MPI)?

with best regards,
Amjad Ali
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091110/d33c6d80/attachment.html>

From stewart at serissa.com  Tue Nov 10 17:43:13 2009
From: stewart at serissa.com (Larry Stewart)
Date: Tue, 10 Nov 2009 20:43:13 -0500
Subject: [Beowulf] array shape difference
In-Reply-To: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>
References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>
Message-ID: <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com>

On Tue, Nov 10, 2009 at 8:30 PM, amjad ali <amjad11 at gmail.com> wrote:

> HI,
>
> suppose we have four arrays with same number of elements say 60000., but
> different dimensions like:
>
> array1(1:60000)
> array2(1:2, 1:30000)
> array3(1:2, 1:300, 1:100)
> array4(1:4, 1:15, 1:10,  1:100)
>
>
> Does each of these arrays in fortran will occupy same amount of memory?
>
> For sending/receiving each of these, Does MPI has the same (or nearly same)
> overhead? or any significantly different overhead is involved in handling
> each of these arrays (by MPI)?
>
> They should take the same amount of space and have nearly identical
transfer times with MPI.
(If you send the whole thing)

-L
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091110/99a27eb0/attachment.html>

From Michael.Frese at NumerEx-LLC.com  Wed Nov 11 06:01:58 2009
From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese)
Date: Wed, 11 Nov 2009 07:01:58 -0700
Subject: [Beowulf] array shape difference
In-Reply-To: <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.co
 m>
References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>
	<77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com>
Message-ID: <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com>

At 06:43 PM 11/10/2009, Larry Stewart wrote:


>On Tue, Nov 10, 2009 at 8:30 PM, amjad ali 
><<mailto:amjad11 at gmail.com>amjad11 at gmail.com> wrote:
>HI,
>
>suppose we have four arrays with same number of elements say 60000., 
>but different dimensions like:
>
>array1(1:60000)
>array2(1:2, 1:30000)
>array3(1:2, 1:300, 1:100)
>array4(1:4, 1:15, 1:10,  1:100)
>
>
>Does each of these arrays in fortran will occupy same amount of memory?
>
>For sending/receiving each of these, Does MPI has the same (or 
>nearly same) overhead? or any significantly different overhead is 
>involved in handling each of these arrays (by MPI)?
>
>They should take the same amount of space and have nearly identical 
>transfer times with MPI.
>(If you send the whole thing)
>
>-L

Those four array descriptors can all apply to exactly the same space, 
via an 'equivalence' statement.  They are all laid out in memory just 
like array1.

Thus, they can each be transmitted by exactly the same MPI send.


Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091111/059fabd3/attachment.html>

From amjad11 at gmail.com  Wed Nov 11 09:04:25 2009
From: amjad11 at gmail.com (amjad ali)
Date: Wed, 11 Nov 2009 12:04:25 -0500
Subject: [Beowulf] MPI Derived datatype + Persistent
Message-ID: <428810f20911110904u5efcd310h215ab67ddcbf917a@mail.gmail.com>

Hi all,

I read that

MPI Derived datatypes may provide efficient way to send data non-contiguous
in the memory.
MPI Persistent communication may provide efficient way in case some
specified/fix communication is performed in an iterative code.

Can we combine both together to get some enhanced benefit/efficiency?

Better if any body can refer to some tutorial/example-code on this.

Thank you for you attention.

With best regards,
Amjad Ali.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091111/e8374b4b/attachment.html>

From peter.st.john at gmail.com  Wed Nov 11 14:40:00 2009
From: peter.st.john at gmail.com (Peter St. John)
Date: Wed, 11 Nov 2009 17:40:00 -0500
Subject: [Beowulf] array shape difference
In-Reply-To: <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com>
References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>
	<77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com>
	<6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com>
Message-ID: <e4d4fd070911111440k5736f810rabd2c2db0552771f@mail.gmail.com>

The difference between:
array1(1:60000)
array2(1:2, 1:30000)

would be reflected in the size of the executable, not the size of the data.
Right?
Peter


On Wed, Nov 11, 2009 at 9:01 AM, Michael H. Frese <
Michael.Frese at numerex-llc.com> wrote:

>  At 06:43 PM 11/10/2009, Larry Stewart wrote:
>
>
> On Tue, Nov 10, 2009 at 8:30 PM, amjad ali <amjad11 at gmail.com> wrote:
>  HI,
>
> suppose we have four arrays with same number of elements say 60000., but
> different dimensions like:
>
> array1(1:60000)
> array2(1:2, 1:30000)
> array3(1:2, 1:300, 1:100)
> array4(1:4, 1:15, 1:10,  1:100)
>
>
> Does each of these arrays in fortran will occupy same amount of memory?
>
> For sending/receiving each of these, Does MPI has the same (or nearly same)
> overhead? or any significantly different overhead is involved in handling
> each of these arrays (by MPI)?
>
> They should take the same amount of space and have nearly identical
> transfer times with MPI.
> (If you send the whole thing)
>
> -L
>
>
> Those four array descriptors can all apply to exactly the same space, via
> an 'equivalence' statement.  They are all laid out in memory just like
> array1.
>
> Thus, they can each be transmitted by exactly the same MPI send.
>
>
> Mike
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091111/f5485a11/attachment.html>

From Michael.Frese at NumerEx-LLC.com  Thu Nov 12 04:18:41 2009
From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese)
Date: Thu, 12 Nov 2009 05:18:41 -0700
Subject: [Beowulf] array shape difference
In-Reply-To: <e4d4fd070911111440k5736f810rabd2c2db0552771f@mail.gmail.co
 m>
References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>
	<77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com>
	<6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com>
	<e4d4fd070911111440k5736f810rabd2c2db0552771f@mail.gmail.com>
Message-ID: <6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com>

That's correct.  The executable size would reflect the extra 
operations required to compute the offset for the doubly dimensioned array.


Mike

At 03:40 PM 11/11/2009, Peter St. John wrote:
>The difference between:
>array1(1:60000)
>array2(1:2, 1:30000)
>
>would be reflected in the size of the executable, not the size of the data.
>Right?
>Peter
>
>
>On Wed, Nov 11, 2009 at 9:01 AM, Michael H. Frese 
><<mailto:Michael.Frese at numerex-llc.com>Michael.Frese at numerex-llc.com> wrote:
>At 06:43 PM 11/10/2009, Larry Stewart wrote:
>
>
>>On Tue, Nov 10, 2009 at 8:30 PM, amjad ali 
>><<mailto:amjad11 at gmail.com>amjad11 at gmail.com> wrote:
>>HI,
>>suppose we have four arrays with same number of elements say 
>>60000., but different dimensions like:
>>array1(1:60000)
>>array2(1:2, 1:30000)
>>array3(1:2, 1:300, 1:100)
>>array4(1:4, 1:15, 1:10,  1:100)
>>
>>Does each of these arrays in fortran will occupy same amount of memory?
>>For sending/receiving each of these, Does MPI has the same (or 
>>nearly same) overhead? or any significantly different overhead is 
>>involved in handling each of these arrays (by MPI)?
>>
>>They should take the same amount of space and have nearly identical 
>>transfer times with MPI.
>>(If you send the whole thing)
>>
>>-L
>
>Those four array descriptors can all apply to exactly the same 
>space, via an 'equivalence' statement.  They are all laid out in 
>memory just like array1.
>
>Thus, they can each be transmitted by exactly the same MPI send.
>
>
>Mike
>
>_______________________________________________
>Beowulf mailing list, 
><mailto:Beowulf at beowulf.org>Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit 
><http://www.beowulf.org/mailman/listinfo/beowulf>http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091112/5a6a4644/attachment.html>

From stuartb at 4gh.net  Thu Nov 12 05:26:13 2009
From: stuartb at 4gh.net (Stuart Barkley)
Date: Thu, 12 Nov 2009 08:26:13 -0500 (EST)
Subject: [Beowulf] array shape difference
In-Reply-To: <6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com>
References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>
	<77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com>
	<6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com>
	<e4d4fd070911111440k5736f810rabd2c2db0552771f@mail.gmail.com>
	<6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com>
Message-ID: <alpine.BSF.2.00.0911120816580.2256@freeman.4gh.net>

At 03:40 PM 11/11/2009, Peter St. John wrote:
>       The difference between:
>       array1(1:60000)
>       array2(1:2, 1:30000)
>
>       would be reflected in the size of the executable, not the size
>       of the data.
>       Right?

On Thu, 12 Nov 2009 at 07:18 -0000, Michael H. Frese wrote:

> That's correct.? The executable size would reflect the extra operations
> required to compute the offset for the doubly dimensioned array.

Or maybe not.

<theory>
If the fortran code is doing virtual subscripts (e.g. array2(i*2 + j))
it would likely generate about the same code as the compiler would
generate for 2 dimensions.  In theory, the compiler can generate
better subscript computation but I suspect in most reasonable (or
simple testing) cases the actual code size difference is a wash.
</theory>

Go with what is most natural for expressing the algorithm.  And ease
the future maintenance.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone

From Michael.Frese at NumerEx-LLC.com  Thu Nov 12 06:50:31 2009
From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese)
Date: Thu, 12 Nov 2009 07:50:31 -0700
Subject: [Beowulf] array shape difference
In-Reply-To: <alpine.BSF.2.00.0911120816580.2256@freeman.4gh.net>
References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>
	<77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com>
	<6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com>
	<e4d4fd070911111440k5736f810rabd2c2db0552771f@mail.gmail.com>
	<6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com>
	<alpine.BSF.2.00.0911120816580.2256@freeman.4gh.net>
Message-ID: <6.2.5.6.2.20091112074734.0a4a2fe8@NumerEx-LLC.com>

At 06:26 AM 11/12/2009, Stuart Barkley wrote:
>At 03:40 PM 11/11/2009, Peter St. John wrote:
> >       The difference between:
> >       array1(1:60000)
> >       array2(1:2, 1:30000)
> >
> >       would be reflected in the size of the executable, not the size
> >       of the data.
> >       Right?
>
>On Thu, 12 Nov 2009 at 07:18 -0000, Michael H. Frese wrote:
>
> > That's correct.  The executable size would reflect the extra operations
> > required to compute the offset for the doubly dimensioned array.
>
>Or maybe not.
>
><theory>
>If the fortran code is doing virtual subscripts (e.g. array2(i*2 + j))
>it would likely generate about the same code as the compiler would
>generate for 2 dimensions.  In theory, the compiler can generate
>better subscript computation but I suspect in most reasonable (or
>simple testing) cases the actual code size difference is a wash.
></theory>
>
>Go with what is most natural for expressing the algorithm.  And ease
>the future maintenance.
>
>Stuart

Agreed.  The code size differences would compiler dependent and 
minimal in any case.  Human readability should determine the choice.


Mike


From amjad11 at gmail.com  Thu Nov 12 07:02:38 2009
From: amjad11 at gmail.com (amjad ali)
Date: Thu, 12 Nov 2009 09:02:38 -0600
Subject: [Beowulf] array shape difference
In-Reply-To: <6.2.5.6.2.20091112074734.0a4a2fe8@NumerEx-LLC.com>
References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>
	<77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com>
	<6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com>
	<e4d4fd070911111440k5736f810rabd2c2db0552771f@mail.gmail.com>
	<6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com>
	<alpine.BSF.2.00.0911120816580.2256@freeman.4gh.net>
	<6.2.5.6.2.20091112074734.0a4a2fe8@NumerEx-LLC.com>
Message-ID: <428810f20911120702n47b91b8cn3ea38b055e1b244e@mail.gmail.com>

Hi all and Thanks you all.

It is making quite good sense.

On Thu, Nov 12, 2009 at 8:50 AM, Michael H. Frese <
Michael.Frese at numerex-llc.com> wrote:

> At 06:26 AM 11/12/2009, Stuart Barkley wrote:
>
>> At 03:40 PM 11/11/2009, Peter St. John wrote:
>> >       The difference between:
>> >       array1(1:60000)
>> >       array2(1:2, 1:30000)
>> >
>> >       would be reflected in the size of the executable, not the size
>> >       of the data.
>> >       Right?
>>
>> On Thu, 12 Nov 2009 at 07:18 -0000, Michael H. Frese wrote:
>>
>> > That's correct.  The executable size would reflect the extra operations
>> > required to compute the offset for the doubly dimensioned array.
>>
>> Or maybe not.
>>
>> <theory>
>> If the fortran code is doing virtual subscripts (e.g. array2(i*2 + j))
>> it would likely generate about the same code as the compiler would
>> generate for 2 dimensions.  In theory, the compiler can generate
>> better subscript computation but I suspect in most reasonable (or
>> simple testing) cases the actual code size difference is a wash.
>> </theory>
>>
>> Go with what is most natural for expressing the algorithm.  And ease
>> the future maintenance.
>>
>> Stuart
>>
>
> Agreed.  The code size differences would compiler dependent and minimal in
> any case.  Human readability should determine the choice.
>
>
>
> Mike
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091112/b7577324/attachment.html>

From rpnabar at gmail.com  Thu Nov 12 16:21:59 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Thu, 12 Nov 2009 18:21:59 -0600
Subject: [Beowulf] bizarre scaling behavior on a Nehalem
In-Reply-To: <b1335fe90911101031y46208314m81da924d4edf0e6c@mail.gmail.com>
References: <c4d69730908100941p482606a8r91a283d4aca95aee@mail.gmail.com>
	<a8d96dec0908101048h6ecdda91h40f179725b6c6fd2@mail.gmail.com>
	<c4d69730908101528r4695a7c8nda3857afbc48530c@mail.gmail.com>
	<4A81ACF3.60802@noaa.gov>
	<c4d69730908111157q4e10abdcka6b91b1fbcfe564b@mail.gmail.com>
	<b1335fe90911101031y46208314m81da924d4edf0e6c@mail.gmail.com>
Message-ID: <c4d69730911121621x48be9a51i470678c51dba1334@mail.gmail.com>

On Tue, Nov 10, 2009 at 12:31 PM, Tiago Marques <a28427 at ua.pt> wrote:
> Hi all,
>
> Enjoy the performance increase, if you haven't still. To us it
> increased around 33% in conjunction with running 8 CPUs. It seems to
> me that groups may be useful to run with more nodes and not just one
> machine but I haven't had the chance to test that out.

THis is very interesting and promising Tiago. I still have not solved
my VASP scaling woes. I am going to give your fix a shot now.

-- 
Rahul


From lindahl at pbm.com  Thu Nov 12 16:46:31 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 12 Nov 2009 16:46:31 -0800
Subject: [Beowulf] array shape difference
In-Reply-To: <alpine.BSF.2.00.0911120816580.2256@freeman.4gh.net>
References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com>
	<77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com>
	<6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com>
	<e4d4fd070911111440k5736f810rabd2c2db0552771f@mail.gmail.com>
	<6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com>
	<alpine.BSF.2.00.0911120816580.2256@freeman.4gh.net>
Message-ID: <20091113004631.GE11974@bx9.net>

On Thu, Nov 12, 2009 at 08:26:13AM -0500, Stuart Barkley wrote:

> <theory>
> If the fortran code is doing virtual subscripts (e.g. array2(i*2 + j))
> it would likely generate about the same code as the compiler would
> generate for 2 dimensions.  In theory, the compiler can generate
> better subscript computation but I suspect in most reasonable (or
> simple testing) cases the actual code size difference is a wash.
> </theory>

Putting my "I used to work near a compiler group" hat on, I suspect a
good compiler guy would tell you that they've worked hard to make sure
both methods generate the same code for address computation. Strength
reduction and the like are elementary optimizations these days.

However, there is an issue that the compiler may have a better idea of
the dimensions of the 2-dimensional array at compile time, leading to
better optimization. That has nothing to do with the address
computations, but everything to do with loop fusion, splitting,
unrolling, pipelining, SIMDizing, cache effects, etc.

-- greg


From rpnabar at gmail.com  Thu Nov 12 16:47:42 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Thu, 12 Nov 2009 18:47:42 -0600
Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the BIOS
Message-ID: <c4d69730911121647m2f538e04v98780934f6b72cbd@mail.gmail.com>

Has anyone tried out UEFI (Unified Extensible Firmware Interface) in
the BIOS? The new servers I am buying come with this option in the
BIOS. Out of curiosity I googled it up.

I am not sure if there were any HPC implications of this and wanted to
double check before I switched to this from my conventional
plain-vanilla BIOS. Any sort of "industry standard" always sounds good
but I thought it safer to check on the group first....

Any advice or pitfalls?

-- 
Rahul


From christiansuhendra at gmail.com  Thu Nov 12 04:33:22 2009
From: christiansuhendra at gmail.com (christian suhendra)
Date: Thu, 12 Nov 2009 04:33:22 -0800
Subject: [Beowulf] ask about mpich
Message-ID: <c761caee0911120433u2a9f27b4u435872577bc9dbee@mail.gmail.com>

halo guys i wants to make a cluster system with mpich in ubuntu,,but i have
troubleshooting with mpich..
but when i run the example program in mpich..it doesn't work in cluster..but
i've registered the node on machine.LINUX..
but still not working
please help me..this is my thesis...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091112/ce251fc0/attachment.html>

From lm.moreira at gmail.com  Thu Nov 12 14:01:17 2009
From: lm.moreira at gmail.com (Leonardo Machado Moreira)
Date: Thu, 12 Nov 2009 20:01:17 -0200
Subject: [Beowulf] Cluster of Linux and Windows
Message-ID: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com>

Hi, I am new on Clusters and have some doubts about them.

I am used to work with Arch Linux. What do you think about it?

And finnaly, I would like to know if Is it possible to get a Cluster Working
with a Server on Arch Linux and the nodes Windows.

Or even better the nodes without a defined SO.

What do you think?, Does it worth?

Thanks in advance.

Leonardo Machado Moreira.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091112/5af873fa/attachment.html>

From john.hearns at mclaren.com  Fri Nov 13 01:19:35 2009
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 13 Nov 2009 09:19:35 -0000
Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the
	BIOS
In-Reply-To: <c4d69730911121647m2f538e04v98780934f6b72cbd@mail.gmail.com>
References: <c4d69730911121647m2f538e04v98780934f6b72cbd@mail.gmail.com>
Message-ID: <68A57CCFD4005646957BD2D18E60667B0E3C2942@milexchmb1.mil.tagmclarengroup.com>


Has anyone tried out UEFI (Unified Extensible Firmware Interface) in
the BIOS? The new servers I am buying come with this option in the
BIOS. 


Not specifically UEFI, but EFI is the standard on Itanium systems, so
has been in use
for a long time. I use it every day. 

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From coutinho at dcc.ufmg.br  Fri Nov 13 05:20:49 2009
From: coutinho at dcc.ufmg.br (Bruno Coutinho)
Date: Fri, 13 Nov 2009 11:20:49 -0200
Subject: [Beowulf] Cluster of Linux and Windows
In-Reply-To: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com>
References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com>
Message-ID: <a8d96dec0911130520l8bcaab2q74399f3b48ca98a9@mail.gmail.com>

2009/11/12 Leonardo Machado Moreira <lm.moreira at gmail.com>

> Hi, I am new on Clusters and have some doubts about them.
>
> I am used to work with Arch Linux. What do you think about it?
>
> And finnaly, I would like to know if Is it possible to get a Cluster
> Working with a Server on Arch Linux and the nodes Windows.
>
> Or even better the nodes without a defined SO.
>

You should have same MPI implementation on all machines (despite windows
adds some network overhead) and choose a method to launch process in slave
machines (ssh , mpd, etc).


>
> What do you think?, Does it worth?
>
> Thanks in advance.
>
> Leonardo Machado Moreira.
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091113/bd895fc3/attachment.html>

From hahn at mcmaster.ca  Fri Nov 13 09:29:35 2009
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 13 Nov 2009 12:29:35 -0500 (EST)
Subject: [Beowulf] Cluster of Linux and Windows
In-Reply-To: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com>
References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0911131222400.23373@coffee.psychology.mcmaster.ca>

> I am used to work with Arch Linux. What do you think about it?

the distro is basically irrelevant.  clustering is just a matter of 
your apps, middleware like mpi (may or may not be provided by the 
cluster), probably a shared filesystem, working kernel, network stack,
job-launch mechanism.  distros are mainly about desktop gunk that 
is completely irrelevant to clusters.

> And finnaly, I would like to know if Is it possible to get a Cluster Working
> with a Server on Arch Linux and the nodes Windows.

sure, but why?  windows is generally inferior as an OS platform,
so I would stay away unless you actually require your apps to run
under windows.  (remember that linux can use windows storage and
authentication just fine.)

> Or even better the nodes without a defined SO.

SO=Significant Other?  oh, maybe "OS".  generally, you want to minimize 
the number of things that can go wrong in your system.  using uniform OS
on nodes/servers is a good start.  but sure, there's no reason you can't
run a cluster where every node is a different OS.  they simply need to 
agree on the network protocol (which doesn't have to be MPI - in fact,
using something more SOA-like might help if the nodes are heterogenous)


From amjad11 at gmail.com  Sat Nov 14 06:47:07 2009
From: amjad11 at gmail.com (amjad ali)
Date: Sat, 14 Nov 2009 09:47:07 -0500
Subject: [Beowulf] Array Declaration approach difference
Message-ID: <428810f20911140647g3af84a6dg78a2ad9d7399ad3e@mail.gmail.com>

Hi All.

I have parallel PDE/CFD code in fortran.
Let we consider it consisting of two parts:

1) Startup part; that  includes input reads, splits, distributions, forming
neighborhood information arrays, grid arrays, and all related. It includes
most of the necessary array declarations.

2) Iterative part; we proceed the solution in time.


Approach One:
============
What I do is that during the Startup phase, I declare the most array
allocatable and then allocate them sizes depending upon the input reads and
domain partitioning. And then In the iterative phase I utilize those arrays.
But I "do not" allocate/deallocate new arrays in the iterative part.


Approach Two:
============
I think that,  what if I first use to run only the start -up phase of my
parallel code having allocatable like things and get the sizes-values
required for array allocations for a specific problem size and partitioning.
Then I use these values as contant in another version of my code in which I
will declare array with the contant values obtained.


So my question is that will there be any significant performance/efficiency
diffrence in the "ITERATIVE part" if the approch two is used (having arrays
declared fixed sizes/values)?


Thank You for your kind attention.

with best regards,
Amjad Ali.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091114/9721f7d1/attachment.html>

From siegert at sfu.ca  Sat Nov 14 16:43:27 2009
From: siegert at sfu.ca (Martin Siegert)
Date: Sat, 14 Nov 2009 16:43:27 -0800
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
Message-ID: <20091115004327.GA12781@stikine.its.sfu.ca>

Hi,

I am running into problems when sending large messages (about
180000000 doubles) over IB. A fairly trivial example program is attached.

# mpicc -g sendrecv.c
# mpiexec -machinefile m2 -n 2 ./a.out
id=1: calling irecv ...
id=0: calling isend ...
[[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3

This is with OpenMPI-1.3.3.
Does anybody know a solution to this problem?

If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
and never returns.
I asked on the openmpi users list but got no response ...

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sendrecv.c
Type: text/x-c++src
Size: 1054 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091114/8a57ca20/attachment.c>

From hahn at mcmaster.ca  Sun Nov 15 12:38:08 2009
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sun, 15 Nov 2009 15:38:08 -0500 (EST)
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <20091115004327.GA12781@stikine.its.sfu.ca>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
Message-ID: <Pine.LNX.4.64.0911151535410.20182@coffee.psychology.mcmaster.ca>

> I am running into problems when sending large messages (about
> 180000000 doubles) over IB. A fairly trivial example program is attached.

sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK
set too low?  (ulimit -l)

> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3

105 looks like it might be an errno to me:
#define ENOBUFS         105     /* No buffer space available */

regards, mark.


From mdidomenico4 at gmail.com  Sun Nov 15 14:29:13 2009
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Sun, 15 Nov 2009 14:29:13 -0800
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <20091115004327.GA12781@stikine.its.sfu.ca>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
Message-ID: <e75d22a90911151429k65154971xf0af97f6d2d50bda@mail.gmail.com>

you might want to ask on the linux-rdma list (was openfabrics).  its
been awhile since i looked at IB error messages, but what
stack/version are you running?

On Sat, Nov 14, 2009 at 4:43 PM, Martin Siegert <siegert at sfu.ca> wrote:
> Hi,
>
> I am running into problems when sending large messages (about
> 180000000 doubles) over IB. A fairly trivial example program is attached.
>
> # mpicc -g sendrecv.c
> # mpiexec -machinefile m2 -n 2 ./a.out
> id=1: calling irecv ...
> id=0: calling isend ...
> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 ?vendor error 105 qp_idx 3
>
> This is with OpenMPI-1.3.3.
> Does anybody know a solution to this problem?
>
> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
> and never returns.
> I asked on the openmpi users list but got no response ...
>
> Cheers,
> Martin
>
> --
> Martin Siegert
> Head, Research Computing
> WestGrid Site Lead
> IT Services ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?phone: 778 782-4691
> Simon Fraser University ? ? ? ? ? ? ? ? ? ?fax: ? 778 782-4242
> Burnaby, British Columbia ? ? ? ? ? ? ? ? ?email: siegert at sfu.ca
> Canada ?V5A 1S6
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From lm.moreira at gmail.com  Fri Nov 13 05:34:13 2009
From: lm.moreira at gmail.com (Leonardo Machado Moreira)
Date: Fri, 13 Nov 2009 11:34:13 -0200
Subject: [Beowulf] Cluster of Linux and Windows
In-Reply-To: <a8d96dec0911130520l8bcaab2q74399f3b48ca98a9@mail.gmail.com>
References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> 
	<a8d96dec0911130520l8bcaab2q74399f3b48ca98a9@mail.gmail.com>
Message-ID: <4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com>

Basicaly, Is a Cluster Implementation just based on these two libraries MPI
on the Server and SSH on the clients??

And a program on tcl/tk for example on server to watch the cluster?

Thanks a lot.

Leonardo Machado Moreira


On Fri, Nov 13, 2009 at 11:20 AM, Bruno Coutinho <coutinho at dcc.ufmg.br>wrote:

>
>
> 2009/11/12 Leonardo Machado Moreira <lm.moreira at gmail.com>
>
>> Hi, I am new on Clusters and have some doubts about them.
>>
>> I am used to work with Arch Linux. What do you think about it?
>>
>> And finnaly, I would like to know if Is it possible to get a Cluster
>> Working with a Server on Arch Linux and the nodes Windows.
>>
>> Or even better the nodes without a defined SO.
>>
>
> You should have same MPI implementation on all machines (despite windows
> adds some network overhead) and choose a method to launch process in slave
> machines (ssh , mpd, etc).
>
>
>>
>> What do you think?, Does it worth?
>>
>> Thanks in advance.
>>
>> Leonardo Machado Moreira.
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091113/15ca147e/attachment.html>

From zenabdin1988 at hotmail.com  Sat Nov 14 04:24:16 2009
From: zenabdin1988 at hotmail.com (Zain elabedin hammade)
Date: Sat, 14 Nov 2009 12:24:16 +0000
Subject: [Beowulf] mpd ..failed ..!
Message-ID: <SNT125-W16C4BBD8FAA5AA6A496BD8D6A70@phx.gbl>


Hello All.
I have a cluster with 4 machines (fedora core 11).
I installed mpich2 - 1.1.1-1.fc11.i586.rpm .
I wrote on every machine :

mpd &
mpdtrace -l

then i wrote  on thr Master :

mpd -h Worker1.cluster.net - p 56128 -n 

I got :
Master.cluster.net_38047 (connect_lhs 944): NOT OK to enter ring; one likely cause: mismatched secretwords
Master.cluster.net_38047 (enter_ring 873): lhs connect failed
Master.cluster.net_38047 (run 256): failed to enter ring

And the same was for other machines : Worker2 and Worker3 .
For information :
I have SSH works on ..

So where is the problem ?
What i have to do ?
I really need your help .

Regarded .
 		 	   		  
_________________________________________________________________
Windows Live Hotmail: Your friends can get your Facebook updates, right from Hotmail?.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091114/18f8ef12/attachment.html>

From becker at scyld.com  Mon Nov 16 01:40:55 2009
From: becker at scyld.com (Donald Becker)
Date: Mon, 16 Nov 2009 01:40:55 -0800 (PST)
Subject: [Beowulf] Final Announcement: 11th Annual Beowulf Bash 9pm Nov 16
	2009
Message-ID: <Pine.LNX.4.44.0911160137410.6955-100000@bluewest.scyld.com>


Final Announcement: 11th Annual Beowulf Bash   9pm Nov 16 2009


  11th Annual Beowulf Bash
        And
      LECCIBG

    9pm  November 16 2009
  The Game, at the Rose Quarter

http://www.xandmarketing.com/beobash09/

It will take place, as usual, with the IEEE SC Conference.

Continuing with recent tradition, we holding the Beowulf Bash
Monday evening just after the Opening Gala.

As in previous years, the primary attraction is the conversations with
other attendees.  We will supplement this with musical entertainment.
We will have drinks and snacks, along with a few give-aways

There will be a short greeting by the sponsors about 10pm
Try to be there by then.


Again:

Monday, November 16 2009  9-11pm  (Immediately after the SC09 Opening 
Gala)
The Game, at the Rose Quarter (Close to the Convention Center)


-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA


From atchley at myri.com  Mon Nov 16 05:04:06 2009
From: atchley at myri.com (Scott Atchley)
Date: Mon, 16 Nov 2009 08:04:06 -0500
Subject: [Beowulf] mpd ..failed ..!
In-Reply-To: <SNT125-W16C4BBD8FAA5AA6A496BD8D6A70@phx.gbl>
References: <SNT125-W16C4BBD8FAA5AA6A496BD8D6A70@phx.gbl>
Message-ID: <9ABA6320-A238-49E7-B71E-C1D4D6D05391@myri.com>

On Nov 14, 2009, at 7:24 AM, Zain elabedin hammade wrote:

> I installed mpich2 - 1.1.1-1.fc11.i586.rpm .

You should ask this on the mpich list at:

https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

> I wrote on every machine :
>
> mpd &
> mpdtrace -l

You started stand-alone MPD rings of size one on each host. This is  
incorrect. You should use mpdboot and a machine file.

$ mpdboot -f machinefile -n <num_hosts> ...

Scott


From Michael.Frese at NumerEx-LLC.com  Mon Nov 16 09:49:23 2009
From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese)
Date: Mon, 16 Nov 2009 10:49:23 -0700
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <20091115004327.GA12781@stikine.its.sfu.ca>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
Message-ID: <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>

Martin,

Could it be that your MPI library was compiled using a small memory 
model?  The 180 million doubles sounds suspiciously close to a 2 GB 
addressing limit.

This issue came up on the list recently under the topic "Fortran 
Array size question."


Mike

At 05:43 PM 11/14/2009, Martin Siegert wrote:
>Hi,
>
>I am running into problems when sending large messages (about
>180000000 doubles) over IB. A fairly trivial example program is attached.
>
># mpicc -g sendrecv.c
># mpiexec -machinefile m2 -n 2 ./a.out
>id=1: calling irecv ...
>id=0: calling isend ...
>[[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 
>error polling LP CQ with status LOCAL LENGTH ERROR status number 1 
>for wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>
>This is with OpenMPI-1.3.3.
>Does anybody know a solution to this problem?
>
>If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>and never returns.
>I asked on the openmpi users list but got no response ...
>
>Cheers,
>Martin
>
>--
>Martin Siegert
>Head, Research Computing
>WestGrid Site Lead
>IT Services                                phone: 778 782-4691
>Simon Fraser University                    fax:   778 782-4242
>Burnaby, British Columbia                  email: siegert at sfu.ca
>Canada  V5A 1S6
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf


From siegert at sfu.ca  Mon Nov 16 12:56:21 2009
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 16 Nov 2009 12:56:21 -0800
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
Message-ID: <20091116205621.GB21826@stikine.its.sfu.ca>

Hi Michael,

On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
> Martin,
>
> Could it be that your MPI library was compiled using a small memory model?  
> The 180 million doubles sounds suspiciously close to a 2 GB addressing 
> limit.
>
> This issue came up on the list recently under the topic "Fortran Array size 
> question."
>
>
> Mike

I am running MPI applications that use more than 16GB of memory - 
I do not believe that this is the problem. Also -mmodel=large
does not appear to be a valid argument for gcc under x86_64:
gcc -DNDEBUG -g -fPIC -mmodel=large   conftest.c  >&5
cc1: error: unrecognized command line option "-mmodel=large"

- Martin

> At 05:43 PM 11/14/2009, Martin Siegert wrote:
>> Hi,
>>
>> I am running into problems when sending large messages (about
>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>
>> # mpicc -g sendrecv.c
>> # mpiexec -machinefile m2 -n 2 ./a.out
>> id=1: calling irecv ...
>> id=0: calling isend ...
>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error 
>> polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 
>> 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>
>> This is with OpenMPI-1.3.3.
>> Does anybody know a solution to this problem?
>>
>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>> and never returns.
>> I asked on the openmpi users list but got no response ...
>>
>> Cheers,
>> Martin
>>
>> --
>> Martin Siegert
>> Head, Research Computing
>> WestGrid Site Lead
>> IT Services                                phone: 778 782-4691
>> Simon Fraser University                    fax:   778 782-4242
>> Burnaby, British Columbia                  email: siegert at sfu.ca
>> Canada  V5A 1S6


From siegert at sfu.ca  Mon Nov 16 13:01:02 2009
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 16 Nov 2009 13:01:02 -0800
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <e75d22a90911151429k65154971xf0af97f6d2d50bda@mail.gmail.com>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<e75d22a90911151429k65154971xf0af97f6d2d50bda@mail.gmail.com>
Message-ID: <20091116210102.GC21826@stikine.its.sfu.ca>

On Sun, Nov 15, 2009 at 02:29:13PM -0800, Michael Di Domenico wrote:
> you might want to ask on the linux-rdma list (was openfabrics).  its
> been awhile since i looked at IB error messages, but what
> stack/version are you running?

This is under Scientific Linux 5.3 which is a RH 5.3 clone that comes
with OFED-1.3.2, which admittedly is quite old. Unfortunately,
upgrading this is a major forklift ... thus I must be sure that this is
really the problem. I'll do a few tests on a couple of nodes ...

Thanks!

- Martin

> On Sat, Nov 14, 2009 at 4:43 PM, Martin Siegert <siegert at sfu.ca> wrote:
> > Hi,
> >
> > I am running into problems when sending large messages (about
> > 180000000 doubles) over IB. A fairly trivial example program is attached.
> >
> > # mpicc -g sendrecv.c
> > # mpiexec -machinefile m2 -n 2 ./a.out
> > id=1: calling irecv ...
> > id=0: calling isend ...
> > [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 ?vendor error 105 qp_idx 3
> >
> > This is with OpenMPI-1.3.3.
> > Does anybody know a solution to this problem?
> >
> > If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
> > and never returns.
> > I asked on the openmpi users list but got no response ...
> >
> > Cheers,
> > Martin
> >
> > --
> > Martin Siegert
> > Head, Research Computing
> > WestGrid Site Lead
> > IT Services ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?phone: 778 782-4691
> > Simon Fraser University ? ? ? ? ? ? ? ? ? ?fax: ? 778 782-4242
> > Burnaby, British Columbia ? ? ? ? ? ? ? ? ?email: siegert at sfu.ca
> > Canada ?V5A 1S6


From siegert at sfu.ca  Mon Nov 16 13:24:50 2009
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 16 Nov 2009 13:24:50 -0800
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <Pine.LNX.4.64.0911151535410.20182@coffee.psychology.mcmaster.ca>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<Pine.LNX.4.64.0911151535410.20182@coffee.psychology.mcmaster.ca>
Message-ID: <20091116212450.GD21826@stikine.its.sfu.ca>

Hi Mark,

On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote:
>> I am running into problems when sending large messages (about
>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>
> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK
> set too low?  (ulimit -l)

Good point.
By now I have played with all kinds of ulimits (the nodes have 16GB
of memory and 16GB of swap space - this program is not even coming close
to those limits). This is the current setting:
# ulimit -a
core file size          (blocks, -c) 0                            
data seg size           (kbytes, -d) unlimited                    
scheduling priority             (-e) 0                            
file size               (blocks, -f) unlimited                    
pending signals                 (-i) 139264
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 139264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

... same error :-(

>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>
> 105 looks like it might be an errno to me:
> #define ENOBUFS         105     /* No buffer space available */
>
> regards, mark.

BTW: when using Intel-MPI (MPICH2) the program segfaults with
l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to
transfer the data internally and multiply the variable count by 8
without checking whether the integer overflows ...

- Martin


From gus at ldeo.columbia.edu  Mon Nov 16 13:55:51 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Mon, 16 Nov 2009 16:55:51 -0500
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <20091116205621.GB21826@stikine.its.sfu.ca>
References: <20091115004327.GA12781@stikine.its.sfu.ca>	<6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
	<20091116205621.GB21826@stikine.its.sfu.ca>
Message-ID: <4B01CA67.8020303@ldeo.columbia.edu>

Hi Martin

We didn't know which compiler you used.
So what Michael sent you ("mmodel=memory_model")
is the Intel compiler flag syntax.
(PGI uses the same syntax, IIRR.)

Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture.
I only used this with Intel ifort, hence I am not sure,
but "medium" should work fine for large data/not-so-large program
in gcc/gfortran.
The "large" model doesn't seem to be implemented by gcc (4.1.2)
anyway.
(Maybe it is there in newer gcc versions.)
The darn thing is that gcc says "medium" doesn't support building
shared libraries,
hence you may need to build OpenMPI static libraries instead,
I would guess.
(Again, check this if you have a newer gcc version.)
Here's an excerpt of my gcc (4.1.2) man page:


        -mcmodel=small
             Generate code for the small code model: the program and its 
symbols must be linked in the lower 2 GB of the address space.  Pointers 
are 64 bits.  Pro-
            grams can be statically or dynamically linked.  This is the 
default code model.

        -mcmodel=kernel
            Generate code for the kernel code model.  The kernel runs in 
the negative 2 GB of the address space.  This model has to be used for 
Linux kernel code.

        -mcmodel=medium
            Generate code for the medium model: The program is linked in 
the lower 2 GB of the address space but symbols can be located anywhere 
in the address
            space.  Programs can be statically or dynamically linked, 
but building of shared libraries are not supported with the medium model.

        -mcmodel=large
            Generate code for the large model: This model makes no 
assumptions about addresses and sizes of sections.  Currently GCC does 
not implement this model.


If you are using OpenMPI, "ompi-info -config"
will tell the flags used to compile it.
Mine is 1.3.2 and has no explicit mcmodel flag,
which according to the gcc man page should default to "small".

Are you using 16GB per process or for the whole set of processes?

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Martin Siegert wrote:
> Hi Michael,
> 
> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
>> Martin,
>>
>> Could it be that your MPI library was compiled using a small memory model?  
>> The 180 million doubles sounds suspiciously close to a 2 GB addressing 
>> limit.
>>
>> This issue came up on the list recently under the topic "Fortran Array size 
>> question."
>>
>>
>> Mike
> 
> I am running MPI applications that use more than 16GB of memory - 
> I do not believe that this is the problem. Also -mmodel=large
> does not appear to be a valid argument for gcc under x86_64:
> gcc -DNDEBUG -g -fPIC -mmodel=large   conftest.c  >&5
> cc1: error: unrecognized command line option "-mmodel=large"
> 
> - Martin
> 
>> At 05:43 PM 11/14/2009, Martin Siegert wrote:
>>> Hi,
>>>
>>> I am running into problems when sending large messages (about
>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>>
>>> # mpicc -g sendrecv.c
>>> # mpiexec -machinefile m2 -n 2 ./a.out
>>> id=1: calling irecv ...
>>> id=0: calling isend ...
>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error 
>>> polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 
>>> 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>>
>>> This is with OpenMPI-1.3.3.
>>> Does anybody know a solution to this problem?
>>>
>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>>> and never returns.
>>> I asked on the openmpi users list but got no response ...
>>>
>>> Cheers,
>>> Martin
>>>
>>> --
>>> Martin Siegert
>>> Head, Research Computing
>>> WestGrid Site Lead
>>> IT Services                                phone: 778 782-4691
>>> Simon Fraser University                    fax:   778 782-4242
>>> Burnaby, British Columbia                  email: siegert at sfu.ca
>>> Canada  V5A 1S6
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at mcmaster.ca  Mon Nov 16 13:58:25 2009
From: hahn at mcmaster.ca (Mark Hahn)
Date: Mon, 16 Nov 2009 16:58:25 -0500 (EST)
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <20091116212450.GD21826@stikine.its.sfu.ca>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<Pine.LNX.4.64.0911151535410.20182@coffee.psychology.mcmaster.ca>
	<20091116212450.GD21826@stikine.its.sfu.ca>
Message-ID: <Pine.LNX.4.64.0911161626130.31487@coffee.psychology.mcmaster.ca>

>>> I am running into problems when sending large messages (about
>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>
>> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK
>> set too low?  (ulimit -l)
>
> Good point.
...
> max locked memory       (kbytes, -l) unlimited
...
> ... same error :-(

well, at this point, I'd consider running the test program under strace.


From djholm at fnal.gov  Mon Nov 16 14:24:27 2009
From: djholm at fnal.gov (Don Holmgren)
Date: Mon, 16 Nov 2009 16:24:27 -0600 (CST)
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <20091116212450.GD21826@stikine.its.sfu.ca>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<Pine.LNX.4.64.0911151535410.20182@coffee.psychology.mcmaster.ca>
	<20091116212450.GD21826@stikine.its.sfu.ca>
Message-ID: <Pine.LNX.4.62.0911161618340.19269@ccfsrv2.fnal.gov>


Be careful - ulimit's can differ between an interative shell launched with
rsh/ssh, an interactive batch shell launched with "qsub -I" and the like, the 
environment of your batch script, and the environment of the processes launched
via mpirun.  I've been burned by this before.

If you are using a TM-based launch, for example (openmpi or OSU mpiexec), the
ulimit environment on a PBS/Torque batch setup will be governed by the ulimits
of pbs_mom, which in turn is governed by your init process and/or by any of
the ulimit commands in init.d/pbs-client.

The only way to be sure of a particuar ulimit is to to a "get_rlimits()" call in 
your mpi-launched binary and check the size.

Chances are this isn't your problem, though, because usually the error messages
make it pretty clear that a memory lock failure has occurred.

Don Holmgren
Fermilab


On Mon, 16 Nov 2009, Martin Siegert wrote:

> Hi Mark,
>
> On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote:
>>> I am running into problems when sending large messages (about
>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>
>> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK
>> set too low?  (ulimit -l)
>
> Good point.
> By now I have played with all kinds of ulimits (the nodes have 16GB
> of memory and 16GB of swap space - this program is not even coming close
> to those limits). This is the current setting:
> # ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 139264
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) unlimited
> real-time priority              (-r) 0
> stack size              (kbytes, -s) unlimited
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 139264
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> ... same error :-(
>
>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>
>> 105 looks like it might be an errno to me:
>> #define ENOBUFS         105     /* No buffer space available */
>>
>> regards, mark.
>
> BTW: when using Intel-MPI (MPICH2) the program segfaults with
> l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to
> transfer the data internally and multiply the variable count by 8
> without checking whether the integer overflows ...
>
> - Martin


From siegert at sfu.ca  Mon Nov 16 15:27:57 2009
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 16 Nov 2009 15:27:57 -0800
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <4B01CA67.8020303@ldeo.columbia.edu>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
	<20091116205621.GB21826@stikine.its.sfu.ca>
	<4B01CA67.8020303@ldeo.columbia.edu>
Message-ID: <20091116232757.GF21826@stikine.its.sfu.ca>

Hi,

On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote:
> Hi Martin
>
> We didn't know which compiler you used.
> So what Michael sent you ("mmodel=memory_model")
> is the Intel compiler flag syntax.
> (PGI uses the same syntax, IIRR.)

Now that was really stupid, I am using gcc-4.3.2 and even looked up
the correct syntax for the memory model, but nevertheless pasted the
Intel syntax into my configure script ... sorry.

> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture.
> I only used this with Intel ifort, hence I am not sure,
> but "medium" should work fine for large data/not-so-large program
> in gcc/gfortran.
> The "large" model doesn't seem to be implemented by gcc (4.1.2)
> anyway.
> (Maybe it is there in newer gcc versions.)
> The darn thing is that gcc says "medium" doesn't support building
> shared libraries,
> hence you may need to build OpenMPI static libraries instead,
> I would guess.
> (Again, check this if you have a newer gcc version.)
> Here's an excerpt of my gcc (4.1.2) man page:
>
>
>        -mcmodel=small
>             Generate code for the small code model: the program and its 
> symbols must be linked in the lower 2 GB of the address space.  Pointers 
> are 64 bits.  Pro-
>            grams can be statically or dynamically linked.  This is the 
> default code model.
>
>        -mcmodel=kernel
>            Generate code for the kernel code model.  The kernel runs in the 
> negative 2 GB of the address space.  This model has to be used for Linux 
> kernel code.
>
>        -mcmodel=medium
>            Generate code for the medium model: The program is linked in the 
> lower 2 GB of the address space but symbols can be located anywhere in the 
> address
>            space.  Programs can be statically or dynamically linked, but 
> building of shared libraries are not supported with the medium model.
>
>        -mcmodel=large
>            Generate code for the large model: This model makes no 
> assumptions about addresses and sizes of sections.  Currently GCC does not 
> implement this model.

I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program
still fails. The error message changes, however:

id=1: calling irecv ...
id=0: calling isend ...
mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e)
[[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046  vendor error 112 qp_idx 3

(strerror(112) is "Host is down", which is certainly not correct).
This now points to system libraries - libmlx4. Am I correct in assuming that
this is either an OFED problem or OpenMPI exceeding some buffers in OFED
libraries without checking?

> If you are using OpenMPI, "ompi-info -config"
> will tell the flags used to compile it.
> Mine is 1.3.2 and has no explicit mcmodel flag,
> which according to the gcc man page should default to "small".

Are you - in fact, is anybody - able to run my test program? I am
hoping that there is some stupid misconfiguration on the cluster
that can be fixed easily, without reinstalling/recompiling all
apps ...

> Are you using 16GB per process or for the whole set of processes?

I am running the two processes on different nodes (and nothing else
on the nodes), thus each process has the full 16GB available.
>
> I hope this helps,
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------

Thanks!

- Martin

> Martin Siegert wrote:
>> Hi Michael,
>>
>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
>>> Martin,
>>>
>>> Could it be that your MPI library was compiled using a small memory 
>>> model?  The 180 million doubles sounds suspiciously close to a 2 GB 
>>> addressing limit.
>>>
>>> This issue came up on the list recently under the topic "Fortran Array 
>>> size question."
>>>
>>>
>>> Mike
>>
>> I am running MPI applications that use more than 16GB of memory - I do not 
>> believe that this is the problem. Also -mmodel=large
>> does not appear to be a valid argument for gcc under x86_64:
>> gcc -DNDEBUG -g -fPIC -mmodel=large   conftest.c  >&5
>> cc1: error: unrecognized command line option "-mmodel=large"
>>
>> - Martin
>>
>>> At 05:43 PM 11/14/2009, Martin Siegert wrote:
>>>> Hi,
>>>>
>>>> I am running into problems when sending large messages (about
>>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>>>
>>>> # mpicc -g sendrecv.c
>>>> # mpiexec -machinefile m2 -n 2 ./a.out
>>>> id=1: calling irecv ...
>>>> id=0: calling isend ...
>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 
>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for 
>>>> wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>>>
>>>> This is with OpenMPI-1.3.3.
>>>> Does anybody know a solution to this problem?
>>>>
>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>>>> and never returns.
>>>> I asked on the openmpi users list but got no response ...
>>>>
>>>> Cheers,
>>>> Martin
>>>>
>>>> --
>>>> Martin Siegert
>>>> Head, Research Computing
>>>> WestGrid Site Lead
>>>> IT Services                                phone: 778 782-4691
>>>> Simon Fraser University                    fax:   778 782-4242
>>>> Burnaby, British Columbia                  email: siegert at sfu.ca
>>>> Canada  V5A 1S6
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6


From lindahl at pbm.com  Mon Nov 16 17:20:48 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 16 Nov 2009 17:20:48 -0800
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
Message-ID: <20091117012048.GD12561@bx9.net>

On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:

> Could it be that your MPI library was compiled using a small memory  
> model?  The 180 million doubles sounds suspiciously close to a 2 GB  
> addressing limit.
>
> This issue came up on the list recently under the topic "Fortran Array 
> size question."

If you need a memory model other than the default small, you'll get a
particular error message at link time; here's an example courtesy of
the Intel software forums, but I bet that every compiler for Linux
includes an example in their manual:

/tmp/ifort3X7vjE.o: In function `sph':
sph.f:41: relocation truncated to fit: R_X86_64_PC32 against `.bss'
sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss'
sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss'
sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss'

And it's only when your BSS is too big, not variables on the stack or
allocated/malloced. I really doubt this is the problem either now or
before.

-- greg


From siegert at sfu.ca  Mon Nov 16 18:38:09 2009
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 16 Nov 2009 18:38:09 -0800
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <20091117012048.GD12561@bx9.net>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
	<20091117012048.GD12561@bx9.net>
Message-ID: <20091117023809.GA25161@stikine.its.sfu.ca>

On Mon, Nov 16, 2009 at 05:20:48PM -0800, Greg Lindahl wrote:
> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
> 
> > Could it be that your MPI library was compiled using a small memory  
> > model?  The 180 million doubles sounds suspiciously close to a 2 GB  
> > addressing limit.
> >
> > This issue came up on the list recently under the topic "Fortran Array 
> > size question."
> 
> If you need a memory model other than the default small, you'll get a
> particular error message at link time; here's an example courtesy of
> the Intel software forums, but I bet that every compiler for Linux
> includes an example in their manual:
> 
> /tmp/ifort3X7vjE.o: In function `sph':
> sph.f:41: relocation truncated to fit: R_X86_64_PC32 against `.bss'
> sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss'
> sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss'
> sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss'
> 
> And it's only when your BSS is too big, not variables on the stack or
> allocated/malloced. I really doubt this is the problem either now or
> before.

Thanks, that's good to know - I certainly do not see any such messages
- neither with the Intel compiler nor gcc.
Furthermore, compiling openmpi with mcmodel=medium or large does not
make a difference.
(my previous email about the error message changing was a mistake:
the error message changes when l is 268435456 or larger).

Also: compiling openmpi with ofed-1.4.1 does not make a difference.
May I conclude that this just does not work? Or can anybody actually
send an array of 180000000 doubles?

- Martin


From gus at ldeo.columbia.edu  Mon Nov 16 19:40:51 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Mon, 16 Nov 2009 22:40:51 -0500
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message	sizes
In-Reply-To: <20091116232757.GF21826@stikine.its.sfu.ca>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
	<20091116205621.GB21826@stikine.its.sfu.ca>
	<4B01CA67.8020303@ldeo.columbia.edu>
	<20091116232757.GF21826@stikine.its.sfu.ca>
Message-ID: <4B021B43.5000505@ldeo.columbia.edu>

Hi Martin

I tried your program with the four combinations of
IB and TCP/IP, mcmodel small and medium.
I lazily didn't recompile OpenMPI (1.3.2) with mcmodel=medium,
just the program, hence this is not a very clean test.

FYI, we have dual-socket quad-core AMD Opteron
nodes with 16GB RAM each.
OpenMPI 1.3.2, CentOS 5.2, gcc 4.1.2, OFED 1.4.

When I ran on 2 nodes and 16 processes the program would always fail
with segmentation fault / address not mapped on all four
combinations above.

However, when I ran on 2 nodes and 2 processes ( -bynode flag in
use to direct each process to a separate node) then it
worked over all four combinations!

Here is the IB+medium stderr (you printed to stderr):
id=1: calling irecv ...
id=0: calling isend ...

and the corresponding stdout:
...
id=0: isend/irecv completed 1.954140
id=1: isend/irecv completed 4.192037

This rules out a problem with memory model, I suppose.
Small is good enough for your message size,
as long as there is enough RAM for all processes,
MPI overhead, etc.

Also, as Don Holmgren already pointed out to you,
make sure your limits are properly set on the nodes.
For instance, we use Torque, and we put these settings
on the nodes' /etc/init.d/pbs_mom:

ulimit -n 32768
ulimit -s unlimited
ulimit -l unlimited

Just like Don, we've been burned by this before, when using the
vendor original setup.
Of course these limits can be set in other ways.

As a practical matter:

Would it be possible/desirable to reduce the message size,
splitting the huge message into several smaller ones?
I know the wisdom is that one big message is better
than many small ones, but here we're talking about huge,
not big, and sizable, not small.

Even your tiny test program takes a detectable time to run
(4s+ seconds on IB, 14s+ on TCP/IP).
It may be worth writing another version of it looping over
smaller messages,
and do some timing tests to compare with the huge
message version.
There may be a sweet spot for the message size vs. number of
messages, I would guess.
Big may not always be better.

In the past a user here had a program sending very large messages
(big 3D arrays).
Not so big as to hit the 2GB threshold, but big enough to
slow down the nodes and the cluster.
Rewriting the program to loop over smaller messages
(2D array slices) solved the problem.
I remember other threads in the MPICH and OpenMPI
mailing lists that reported difficulties with huge messages.

My $0.02
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Martin Siegert wrote:
> Hi,
> 
> On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote:
>> Hi Martin
>>
>> We didn't know which compiler you used.
>> So what Michael sent you ("mmodel=memory_model")
>> is the Intel compiler flag syntax.
>> (PGI uses the same syntax, IIRR.)
> 
> Now that was really stupid, I am using gcc-4.3.2 and even looked up
> the correct syntax for the memory model, but nevertheless pasted the
> Intel syntax into my configure script ... sorry.
> 
>> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture.
>> I only used this with Intel ifort, hence I am not sure,
>> but "medium" should work fine for large data/not-so-large program
>> in gcc/gfortran.
>> The "large" model doesn't seem to be implemented by gcc (4.1.2)
>> anyway.
>> (Maybe it is there in newer gcc versions.)
>> The darn thing is that gcc says "medium" doesn't support building
>> shared libraries,
>> hence you may need to build OpenMPI static libraries instead,
>> I would guess.
>> (Again, check this if you have a newer gcc version.)
>> Here's an excerpt of my gcc (4.1.2) man page:
>>
>>
>>        -mcmodel=small
>>             Generate code for the small code model: the program and its 
>> symbols must be linked in the lower 2 GB of the address space.  Pointers 
>> are 64 bits.  Pro-
>>            grams can be statically or dynamically linked.  This is the 
>> default code model.
>>
>>        -mcmodel=kernel
>>            Generate code for the kernel code model.  The kernel runs in the 
>> negative 2 GB of the address space.  This model has to be used for Linux 
>> kernel code.
>>
>>        -mcmodel=medium
>>            Generate code for the medium model: The program is linked in the 
>> lower 2 GB of the address space but symbols can be located anywhere in the 
>> address
>>            space.  Programs can be statically or dynamically linked, but 
>> building of shared libraries are not supported with the medium model.
>>
>>        -mcmodel=large
>>            Generate code for the large model: This model makes no 
>> assumptions about addresses and sizes of sections.  Currently GCC does not 
>> implement this model.
> 
> I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program
> still fails. The error message changes, however:
> 
> id=1: calling irecv ...
> id=0: calling isend ...
> mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e)
> [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046  vendor error 112 qp_idx 3
> 
> (strerror(112) is "Host is down", which is certainly not correct).
> This now points to system libraries - libmlx4. Am I correct in assuming that
> this is either an OFED problem or OpenMPI exceeding some buffers in OFED
> libraries without checking?
> 
>> If you are using OpenMPI, "ompi-info -config"
>> will tell the flags used to compile it.
>> Mine is 1.3.2 and has no explicit mcmodel flag,
>> which according to the gcc man page should default to "small".
> 
> Are you - in fact, is anybody - able to run my test program? I am
> hoping that there is some stupid misconfiguration on the cluster
> that can be fixed easily, without reinstalling/recompiling all
> apps ...
> 
>> Are you using 16GB per process or for the whole set of processes?
> 
> I am running the two processes on different nodes (and nothing else
> on the nodes), thus each process has the full 16GB available.
>> I hope this helps,
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
> 
> Thanks!
> 
> - Martin
> 
>> Martin Siegert wrote:
>>> Hi Michael,
>>>
>>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
>>>> Martin,
>>>>
>>>> Could it be that your MPI library was compiled using a small memory 
>>>> model?  The 180 million doubles sounds suspiciously close to a 2 GB 
>>>> addressing limit.
>>>>
>>>> This issue came up on the list recently under the topic "Fortran Array 
>>>> size question."
>>>>
>>>>
>>>> Mike
>>> I am running MPI applications that use more than 16GB of memory - I do not 
>>> believe that this is the problem. Also -mmodel=large
>>> does not appear to be a valid argument for gcc under x86_64:
>>> gcc -DNDEBUG -g -fPIC -mmodel=large   conftest.c  >&5
>>> cc1: error: unrecognized command line option "-mmodel=large"
>>>
>>> - Martin
>>>
>>>> At 05:43 PM 11/14/2009, Martin Siegert wrote:
>>>>> Hi,
>>>>>
>>>>> I am running into problems when sending large messages (about
>>>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>>>>
>>>>> # mpicc -g sendrecv.c
>>>>> # mpiexec -machinefile m2 -n 2 ./a.out
>>>>> id=1: calling irecv ...
>>>>> id=0: calling isend ...
>>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 
>>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for 
>>>>> wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>>>>
>>>>> This is with OpenMPI-1.3.3.
>>>>> Does anybody know a solution to this problem?
>>>>>
>>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>>>>> and never returns.
>>>>> I asked on the openmpi users list but got no response ...
>>>>>
>>>>> Cheers,
>>>>> Martin
>>>>>
>>>>> --
>>>>> Martin Siegert
>>>>> Head, Research Computing
>>>>> WestGrid Site Lead
>>>>> IT Services                                phone: 778 782-4691
>>>>> Simon Fraser University                    fax:   778 782-4242
>>>>> Burnaby, British Columbia                  email: siegert at sfu.ca
>>>>> Canada  V5A 1S6
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


From siegert at sfu.ca  Mon Nov 16 21:04:07 2009
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 16 Nov 2009 21:04:07 -0800
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <4B021B43.5000505@ldeo.columbia.edu>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
	<20091116205621.GB21826@stikine.its.sfu.ca>
	<4B01CA67.8020303@ldeo.columbia.edu>
	<20091116232757.GF21826@stikine.its.sfu.ca>
	<4B021B43.5000505@ldeo.columbia.edu>
Message-ID: <20091117050407.GA25626@stikine.its.sfu.ca>

Hi Gus,

On Mon, Nov 16, 2009 at 10:40:51PM -0500, Gus Correa wrote:
> Hi Martin
>
> I tried your program with the four combinations of
> IB and TCP/IP, mcmodel small and medium.
> I lazily didn't recompile OpenMPI (1.3.2) with mcmodel=medium,
> just the program, hence this is not a very clean test.
>
> FYI, we have dual-socket quad-core AMD Opteron
> nodes with 16GB RAM each.
> OpenMPI 1.3.2, CentOS 5.2, gcc 4.1.2, OFED 1.4.

We have dual-socket quad-core Intel E5430, 16GB,
OpenMPI-1.3.3, SL 5.3, gcc 4.3.2 (and a bunch of other compilers,
but gcc-4.3.2 is used to compile OpenMPI), OFED-1.3.2 (tested
OFED-1.4.1 on two test nodes).

> When I ran on 2 nodes and 16 processes the program would always fail
> with segmentation fault / address not mapped on all four
> combinations above.
>
> However, when I ran on 2 nodes and 2 processes ( -bynode flag in
> use to direct each process to a separate node) then it
> worked over all four combinations!
>
> Here is the IB+medium stderr (you printed to stderr):
> id=1: calling irecv ...
> id=0: calling isend ...
>
> and the corresponding stdout:
> ...
> id=0: isend/irecv completed 1.954140
> id=1: isend/irecv completed 4.192037

Thanks!!
Now I am surprised ... this always fails here.
What's the difference?

> This rules out a problem with memory model, I suppose.
> Small is good enough for your message size,
> as long as there is enough RAM for all processes,
> MPI overhead, etc.
>
> Also, as Don Holmgren already pointed out to you,
> make sure your limits are properly set on the nodes.
> For instance, we use Torque, and we put these settings
> on the nodes' /etc/init.d/pbs_mom:
>
> ulimit -n 32768
> ulimit -s unlimited
> ulimit -l unlimited
>
> Just like Don, we've been burned by this before, when using the
> vendor original setup.
> Of course these limits can be set in other ways.

I have been running this on the two test nodes without going through
torque to avoid exactly these kind of problems.
Anyway, I just ran the same program through torque, ran "ulimit -a"
in the pbs script (all looks fine), but the program still fails.

> As a practical matter:
>
> Would it be possible/desirable to reduce the message size,
> splitting the huge message into several smaller ones?
> I know the wisdom is that one big message is better
> than many small ones, but here we're talking about huge,
> not big, and sizable, not small.
>
> Even your tiny test program takes a detectable time to run
> (4s+ seconds on IB, 14s+ on TCP/IP).
> It may be worth writing another version of it looping over
> smaller messages,
> and do some timing tests to compare with the huge
> message version.
> There may be a sweet spot for the message size vs. number of
> messages, I would guess.
> Big may not always be better.
>
> In the past a user here had a program sending very large messages
> (big 3D arrays).
> Not so big as to hit the 2GB threshold, but big enough to
> slow down the nodes and the cluster.
> Rewriting the program to loop over smaller messages
> (2D array slices) solved the problem.
> I remember other threads in the MPICH and OpenMPI
> mailing lists that reported difficulties with huge messages.
>
> My $0.02
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------

In principle, yes ... I already wrote wrapper functions
myMPI_Isend, myMPI_Irecv that do exactly that.
However, we are talking about one of those quantum chemistry
programs: many thousands of lines ... I'd really like to avoid
this.

- Martin

> Martin Siegert wrote:
>> Hi,
>>
>> On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote:
>>> Hi Martin
>>>
>>> We didn't know which compiler you used.
>>> So what Michael sent you ("mmodel=memory_model")
>>> is the Intel compiler flag syntax.
>>> (PGI uses the same syntax, IIRR.)
>>
>> Now that was really stupid, I am using gcc-4.3.2 and even looked up
>> the correct syntax for the memory model, but nevertheless pasted the
>> Intel syntax into my configure script ... sorry.
>>
>>> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture.
>>> I only used this with Intel ifort, hence I am not sure,
>>> but "medium" should work fine for large data/not-so-large program
>>> in gcc/gfortran.
>>> The "large" model doesn't seem to be implemented by gcc (4.1.2)
>>> anyway.
>>> (Maybe it is there in newer gcc versions.)
>>> The darn thing is that gcc says "medium" doesn't support building
>>> shared libraries,
>>> hence you may need to build OpenMPI static libraries instead,
>>> I would guess.
>>> (Again, check this if you have a newer gcc version.)
>>> Here's an excerpt of my gcc (4.1.2) man page:
>>>
>>>
>>>        -mcmodel=small
>>>             Generate code for the small code model: the program and its 
>>> symbols must be linked in the lower 2 GB of the address space.  Pointers 
>>> are 64 bits.  Pro-
>>>            grams can be statically or dynamically linked.  This is the 
>>> default code model.
>>>
>>>        -mcmodel=kernel
>>>            Generate code for the kernel code model.  The kernel runs in 
>>> the negative 2 GB of the address space.  This model has to be used for 
>>> Linux kernel code.
>>>
>>>        -mcmodel=medium
>>>            Generate code for the medium model: The program is linked in 
>>> the lower 2 GB of the address space but symbols can be located anywhere 
>>> in the address
>>>            space.  Programs can be statically or dynamically linked, but 
>>> building of shared libraries are not supported with the medium model.
>>>
>>>        -mcmodel=large
>>>            Generate code for the large model: This model makes no 
>>> assumptions about addresses and sizes of sections.  Currently GCC does 
>>> not implement this model.
>>
>> I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program
>> still fails. The error message changes, however:
>>
>> id=1: calling irecv ...
>> id=0: calling isend ...
>> mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e)
>> [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046  vendor error 112 qp_idx 3
>>
>> (strerror(112) is "Host is down", which is certainly not correct).
>> This now points to system libraries - libmlx4. Am I correct in assuming that
>> this is either an OFED problem or OpenMPI exceeding some buffers in OFED
>> libraries without checking?
>>
>>> If you are using OpenMPI, "ompi-info -config"
>>> will tell the flags used to compile it.
>>> Mine is 1.3.2 and has no explicit mcmodel flag,
>>> which according to the gcc man page should default to "small".
>>
>> Are you - in fact, is anybody - able to run my test program? I am
>> hoping that there is some stupid misconfiguration on the cluster
>> that can be fixed easily, without reinstalling/recompiling all
>> apps ...
>>
>>> Are you using 16GB per process or for the whole set of processes?
>>
>> I am running the two processes on different nodes (and nothing else
>> on the nodes), thus each process has the full 16GB available.
>>> I hope this helps,
>>> Gus Correa
>>> ---------------------------------------------------------------------
>>> Gustavo Correa
>>> Lamont-Doherty Earth Observatory - Columbia University
>>> Palisades, NY, 10964-8000 - USA
>>> ---------------------------------------------------------------------
>>
>> Thanks!
>>
>> - Martin
>>
>>> Martin Siegert wrote:
>>>> Hi Michael,
>>>>
>>>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
>>>>> Martin,
>>>>>
>>>>> Could it be that your MPI library was compiled using a small memory 
>>>>> model?  The 180 million doubles sounds suspiciously close to a 2 GB 
>>>>> addressing limit.
>>>>>
>>>>> This issue came up on the list recently under the topic "Fortran Array 
>>>>> size question."
>>>>>
>>>>>
>>>>> Mike
>>>> I am running MPI applications that use more than 16GB of memory - I do 
>>>> not believe that this is the problem. Also -mmodel=large
>>>> does not appear to be a valid argument for gcc under x86_64:
>>>> gcc -DNDEBUG -g -fPIC -mmodel=large   conftest.c  >&5
>>>> cc1: error: unrecognized command line option "-mmodel=large"
>>>>
>>>> - Martin
>>>>
>>>>> At 05:43 PM 11/14/2009, Martin Siegert wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I am running into problems when sending large messages (about
>>>>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>>>>>
>>>>>> # mpicc -g sendrecv.c
>>>>>> # mpiexec -machinefile m2 -n 2 ./a.out
>>>>>> id=1: calling irecv ...
>>>>>> id=0: calling isend ...
>>>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 
>>>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for 
>>>>>> wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>>>>>
>>>>>> This is with OpenMPI-1.3.3.
>>>>>> Does anybody know a solution to this problem?
>>>>>>
>>>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>>>>>> and never returns.
>>>>>> I asked on the openmpi users list but got no response ...
>>>>>>
>>>>>> Cheers,
>>>>>> Martin
>>>>>>
>>>>>> --
>>>>>> Martin Siegert
>>>>>> Head, Research Computing
>>>>>> WestGrid Site Lead
>>>>>> IT Services                                phone: 778 782-4691
>>>>>> Simon Fraser University                    fax:   778 782-4242
>>>>>> Burnaby, British Columbia                  email: siegert at sfu.ca
>>>>>> Canada  V5A 1S6
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>>

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6


From gus at ldeo.columbia.edu  Mon Nov 16 22:26:52 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Tue, 17 Nov 2009 01:26:52 -0500
Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
In-Reply-To: <20091117050407.GA25626@stikine.its.sfu.ca>
References: <20091115004327.GA12781@stikine.its.sfu.ca>
	<6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com>
	<20091116205621.GB21826@stikine.its.sfu.ca>
	<4B01CA67.8020303@ldeo.columbia.edu>
	<20091116232757.GF21826@stikine.its.sfu.ca>
	<4B021B43.5000505@ldeo.columbia.edu>
	<20091117050407.GA25626@stikine.its.sfu.ca>
Message-ID: <4B02422C.9080308@ldeo.columbia.edu>

Hi Martin

Answers/comments inline below

Martin Siegert wrote:
> Hi Gus,
> 
> On Mon, Nov 16, 2009 at 10:40:51PM -0500, Gus Correa wrote:
>> Hi Martin
>>
>> I tried your program with the four combinations of
>> IB and TCP/IP, mcmodel small and medium.
>> I lazily didn't recompile OpenMPI (1.3.2) with mcmodel=medium,
>> just the program, hence this is not a very clean test.
>>
>> FYI, we have dual-socket quad-core AMD Opteron
>> nodes with 16GB RAM each.
>> OpenMPI 1.3.2, CentOS 5.2, gcc 4.1.2, OFED 1.4.
> 
> We have dual-socket quad-core Intel E5430, 16GB,
> OpenMPI-1.3.3, SL 5.3, gcc 4.3.2 (and a bunch of other compilers,
> but gcc-4.3.2 is used to compile OpenMPI), OFED-1.3.2 (tested
> OFED-1.4.1 on two test nodes).
> 
>> When I ran on 2 nodes and 16 processes the program would always fail
>> with segmentation fault / address not mapped on all four
>> combinations above.
>>
>> However, when I ran on 2 nodes and 2 processes ( -bynode flag in
>> use to direct each process to a separate node) then it
>> worked over all four combinations!
>>
>> Here is the IB+medium stderr (you printed to stderr):
>> id=1: calling irecv ...
>> id=0: calling isend ...
>>
>> and the corresponding stdout:
>> ...
>> id=0: isend/irecv completed 1.954140
>> id=1: isend/irecv completed 4.192037
> 
> Thanks!!
> Now I am surprised ... this always fails here.
> What's the difference?
> 

The software stack is not the same, neither the hardware.
But I would guess they are not so far apart to make the difference.

Have you tried to run on TCP/IP?
Say, using:

         -mca btl tcp,sm,self \

and perhaps
	-mca btl_tcp_if_exclude lo,eth[0,1]
or
	-mca btl_tcp_if_include eth[0,1]
to select the Ethernet port?

I would guess you have at least one Ethernet network
to test the program over TCP/IP.
If it works on TCP/IP,
then the problem is likely to reside within IB.
(Maybe in OFED-1.3.2?)

	
>> This rules out a problem with memory model, I suppose.
>> Small is good enough for your message size,
>> as long as there is enough RAM for all processes,
>> MPI overhead, etc.
>>
>> Also, as Don Holmgren already pointed out to you,
>> make sure your limits are properly set on the nodes.
>> For instance, we use Torque, and we put these settings
>> on the nodes' /etc/init.d/pbs_mom:
>>
>> ulimit -n 32768
>> ulimit -s unlimited
>> ulimit -l unlimited
>>
>> Just like Don, we've been burned by this before, when using the
>> vendor original setup.
>> Of course these limits can be set in other ways.
> 
> I have been running this on the two test nodes without going through
> torque to avoid exactly these kind of problems.
> Anyway, I just ran the same program through torque, ran "ulimit -a"
> in the pbs script (all looks fine), but the program still fails.
> 
>> As a practical matter:
>>
>> Would it be possible/desirable to reduce the message size,
>> splitting the huge message into several smaller ones?
>> I know the wisdom is that one big message is better
>> than many small ones, but here we're talking about huge,
>> not big, and sizable, not small.
>>
>> Even your tiny test program takes a detectable time to run
>> (4s+ seconds on IB, 14s+ on TCP/IP).
>> It may be worth writing another version of it looping over
>> smaller messages,
>> and do some timing tests to compare with the huge
>> message version.
>> There may be a sweet spot for the message size vs. number of
>> messages, I would guess.
>> Big may not always be better.
>>
>> In the past a user here had a program sending very large messages
>> (big 3D arrays).
>> Not so big as to hit the 2GB threshold, but big enough to
>> slow down the nodes and the cluster.
>> Rewriting the program to loop over smaller messages
>> (2D array slices) solved the problem.
>> I remember other threads in the MPICH and OpenMPI
>> mailing lists that reported difficulties with huge messages.
>>
>> My $0.02
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
> 
> In principle, yes ... I already wrote wrapper functions
> myMPI_Isend, myMPI_Irecv that do exactly that.
> However, we are talking about one of those quantum chemistry
> programs: many thousands of lines ... I'd really like to avoid
> this.
> 
> - Martin
> 

A few days ago somebody posted here a tip on how to run VASP in
a more scalable/efficient way by just choosing some internal
code parameters (probably available through a mere namelist).
This was after a long discussion here on how to make VASP
more scalable by tweaking with OpenMPI MCA parameters, etc, etc.

Would your user be willing to take a look at the code documentation
and find out if there is a way to decompose his domain, or matrix,
or problem, or whatever, in a more sensible
(and hopefully scalable) way?
Often times there is.
These programs are not necessarily poorly designed,
but users need read the documentation (or articles about the method)
to find out how to use them right.
A knowledgeable user should understand what the mathematical
method and the algorithm are doing, or at least be willing
to learn the basics of them.

Unless the problem itself is huge, passing an array of 180 million
doubles doesn't sound reasonable,just a brute force approach,
particularly if only two processes are sharing the work,
if you don't mind my saying that.
And if the problem is huge, one could argue that more nodes/processes
and smaller messages could be used to get the job done better.

We're mostly a climate, atmosphere, ocean shop, but this doesn't
mean that we are protected from this type of problem either.

Just a suggestion.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

>> Martin Siegert wrote:
>>> Hi,
>>>
>>> On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote:
>>>> Hi Martin
>>>>
>>>> We didn't know which compiler you used.
>>>> So what Michael sent you ("mmodel=memory_model")
>>>> is the Intel compiler flag syntax.
>>>> (PGI uses the same syntax, IIRR.)
>>> Now that was really stupid, I am using gcc-4.3.2 and even looked up
>>> the correct syntax for the memory model, but nevertheless pasted the
>>> Intel syntax into my configure script ... sorry.
>>>
>>>> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture.
>>>> I only used this with Intel ifort, hence I am not sure,
>>>> but "medium" should work fine for large data/not-so-large program
>>>> in gcc/gfortran.
>>>> The "large" model doesn't seem to be implemented by gcc (4.1.2)
>>>> anyway.
>>>> (Maybe it is there in newer gcc versions.)
>>>> The darn thing is that gcc says "medium" doesn't support building
>>>> shared libraries,
>>>> hence you may need to build OpenMPI static libraries instead,
>>>> I would guess.
>>>> (Again, check this if you have a newer gcc version.)
>>>> Here's an excerpt of my gcc (4.1.2) man page:
>>>>
>>>>
>>>>        -mcmodel=small
>>>>             Generate code for the small code model: the program and its 
>>>> symbols must be linked in the lower 2 GB of the address space.  Pointers 
>>>> are 64 bits.  Pro-
>>>>            grams can be statically or dynamically linked.  This is the 
>>>> default code model.
>>>>
>>>>        -mcmodel=kernel
>>>>            Generate code for the kernel code model.  The kernel runs in 
>>>> the negative 2 GB of the address space.  This model has to be used for 
>>>> Linux kernel code.
>>>>
>>>>        -mcmodel=medium
>>>>            Generate code for the medium model: The program is linked in 
>>>> the lower 2 GB of the address space but symbols can be located anywhere 
>>>> in the address
>>>>            space.  Programs can be statically or dynamically linked, but 
>>>> building of shared libraries are not supported with the medium model.
>>>>
>>>>        -mcmodel=large
>>>>            Generate code for the large model: This model makes no 
>>>> assumptions about addresses and sizes of sections.  Currently GCC does 
>>>> not implement this model.
>>> I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program
>>> still fails. The error message changes, however:
>>>
>>> id=1: calling irecv ...
>>> id=0: calling isend ...
>>> mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e)
>>> [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046  vendor error 112 qp_idx 3
>>>
>>> (strerror(112) is "Host is down", which is certainly not correct).
>>> This now points to system libraries - libmlx4. Am I correct in assuming that
>>> this is either an OFED problem or OpenMPI exceeding some buffers in OFED
>>> libraries without checking?
>>>
>>>> If you are using OpenMPI, "ompi-info -config"
>>>> will tell the flags used to compile it.
>>>> Mine is 1.3.2 and has no explicit mcmodel flag,
>>>> which according to the gcc man page should default to "small".
>>> Are you - in fact, is anybody - able to run my test program? I am
>>> hoping that there is some stupid misconfiguration on the cluster
>>> that can be fixed easily, without reinstalling/recompiling all
>>> apps ...
>>>
>>>> Are you using 16GB per process or for the whole set of processes?
>>> I am running the two processes on different nodes (and nothing else
>>> on the nodes), thus each process has the full 16GB available.
>>>> I hope this helps,
>>>> Gus Correa
>>>> ---------------------------------------------------------------------
>>>> Gustavo Correa
>>>> Lamont-Doherty Earth Observatory - Columbia University
>>>> Palisades, NY, 10964-8000 - USA
>>>> ---------------------------------------------------------------------
>>> Thanks!
>>>
>>> - Martin
>>>
>>>> Martin Siegert wrote:
>>>>> Hi Michael,
>>>>>
>>>>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
>>>>>> Martin,
>>>>>>
>>>>>> Could it be that your MPI library was compiled using a small memory 
>>>>>> model?  The 180 million doubles sounds suspiciously close to a 2 GB 
>>>>>> addressing limit.
>>>>>>
>>>>>> This issue came up on the list recently under the topic "Fortran Array 
>>>>>> size question."
>>>>>>
>>>>>>
>>>>>> Mike
>>>>> I am running MPI applications that use more than 16GB of memory - I do 
>>>>> not believe that this is the problem. Also -mmodel=large
>>>>> does not appear to be a valid argument for gcc under x86_64:
>>>>> gcc -DNDEBUG -g -fPIC -mmodel=large   conftest.c  >&5
>>>>> cc1: error: unrecognized command line option "-mmodel=large"
>>>>>
>>>>> - Martin
>>>>>
>>>>>> At 05:43 PM 11/14/2009, Martin Siegert wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am running into problems when sending large messages (about
>>>>>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>>>>>>
>>>>>>> # mpicc -g sendrecv.c
>>>>>>> # mpiexec -machinefile m2 -n 2 ./a.out
>>>>>>> id=1: calling irecv ...
>>>>>>> id=0: calling isend ...
>>>>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 
>>>>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for 
>>>>>>> wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>>>>>>
>>>>>>> This is with OpenMPI-1.3.3.
>>>>>>> Does anybody know a solution to this problem?
>>>>>>>
>>>>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>>>>>>> and never returns.
>>>>>>> I asked on the openmpi users list but got no response ...
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Martin
>>>>>>>
>>>>>>> --
>>>>>>> Martin Siegert
>>>>>>> Head, Research Computing
>>>>>>> WestGrid Site Lead
>>>>>>> IT Services                                phone: 778 782-4691
>>>>>>> Simon Fraser University                    fax:   778 782-4242
>>>>>>> Burnaby, British Columbia                  email: siegert at sfu.ca
>>>>>>> Canada  V5A 1S6
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


From jlforrest at berkeley.edu  Tue Nov 17 10:11:27 2009
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Tue, 17 Nov 2009 10:11:27 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
Message-ID: <4B02E74F.8050103@berkeley.edu>

Let's say you have a brand new cluster with
brand new Infiniband hardware, and that
you've installed OFED 1.4 and the
appropriate drivers for your IB
HCAs (i.e. you see ib0 devices
on the frontend and all compute nodes).
The cluster appears to be working
fine but you're not sure about IB.

How would you test your IB network
to make sure all is well?

Cordially,
-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From bill at cse.ucdavis.edu  Tue Nov 17 10:33:17 2009
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Tue, 17 Nov 2009 10:33:17 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B02E74F.8050103@berkeley.edu>
References: <4B02E74F.8050103@berkeley.edu>
Message-ID: <4B02EC6D.60107@cse.ucdavis.edu>

Jon Forrest wrote:
> Let's say you have a brand new cluster with
> brand new Infiniband hardware, and that
> you've installed OFED 1.4 and the
> appropriate drivers for your IB
> HCAs (i.e. you see ib0 devices
> on the frontend and all compute nodes).
> The cluster appears to be working
> fine but you're not sure about IB.
> 
> How would you test your IB network
> to make sure all is well?

My first suggest sanity test would be to test latency and bandwidth to insure
you are getting IB numbers.  So 80-100MB/sec and 30-60us for a small packet
would imply GigE.  6-8 times the bandwidth certainly would imply SDR or
better.  Latency varies quite a bit among implementation, I'd try to get
within 30-40% of advertised latency numbers.

Then I'd try a workload that kept all nodes busy with something communications
intensive.  Pathscale has a mpi_nxnlatbw which works reasonable well to
identify ports/nodes that are are slower than expected.

After that works I'd suggest a production MPI work load with a known answer.


From jlforrest at berkeley.edu  Tue Nov 17 10:58:43 2009
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Tue, 17 Nov 2009 10:58:43 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B02EC6D.60107@cse.ucdavis.edu>
References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu>
Message-ID: <4B02F263.9030607@berkeley.edu>

Bill Broadley wrote:

> My first suggest sanity test would be to test latency and bandwidth to insure
> you are getting IB numbers.  So 80-100MB/sec and 30-60us for a small packet
> would imply GigE.  6-8 times the bandwidth certainly would imply SDR or
> better.  Latency varies quite a bit among implementation, I'd try to get
> within 30-40% of advertised latency numbers.

For those of us who aren't familiar with IB utilities,
could you give some examples of the commands you'd use
to do this?

Thanks,
Jon


From agshew at gmail.com  Tue Nov 17 11:21:05 2009
From: agshew at gmail.com (Andrew Shewmaker)
Date: Tue, 17 Nov 2009 12:21:05 -0700
Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the 
	BIOS
In-Reply-To: <c4d69730911121647m2f538e04v98780934f6b72cbd@mail.gmail.com>
References: <c4d69730911121647m2f538e04v98780934f6b72cbd@mail.gmail.com>
Message-ID: <f4050c9e0911171121s4ad379f7la78ccefe5d32cd28@mail.gmail.com>

On Thu, Nov 12, 2009 at 5:47 PM, Rahul Nabar <rpnabar at gmail.com> wrote:
> Has anyone tried out UEFI (Unified Extensible Firmware Interface) in
> the BIOS? The new servers I am buying come with this option in the
> BIOS. Out of curiosity I googled it up.
>
> I am not sure if there were any HPC implications of this and wanted to
> double check before I switched to this from my conventional
> plain-vanilla BIOS. Any sort of "industry standard" always sounds good
> but I thought it safer to check on the group first....
>
> Any advice or pitfalls?

Here's something on EFI I wrote up for myself in 2005.  It's a bit out
of date, but it covers some stuff that wikipedia doesn't.  In
particular, I would read the old Kernel Traffic to understand how
various developers dislike EFI.

And in case you are wondering, this post looks a bit different because
it is in moinmoin wiki markup.

== Firmware Awareness ==

You may have heard of Intel's EFI, but have wondered how does it
compare to legacy BIOSes, Open Firmware, and LinuxBIOS.  Here's some
stuff you might want to be aware of.

=== Acronyms ===

ACPI - Advanced Configuration and Power Interface

EBC - EFI Byte Code

EFI - Extensible Firmware Interface

UEFI Forum - The Unified EFI Forum is a group of companies (all of the
big PC players) responsible for the devolopment and promotion of EFI

LinuxBIOS - small, fast open source alternative to proprietary PC BIOSes

OpenBIOS - open source Open Firmware implementation

Open Firmware - defined by IEEE-1275 and used by Sun Microsystems
(since 1988), IBM, and Apple to initialize hardware and boot Operating
Systems in a largely hardware-independent manner

=== UEFI Forum ===

The UEFI Forum just announced its existence, and it looks like Intel
has convinced the major PC vendors and their rival AMD to adopt EFI as
a replacement for legacy BIOSes.  It looks like we will be seeing EFI
everywhere.

=== Overview of EFI ===

[UEFI]


Q:	Does UEFI completely replace a PC BIOS?
A:	No. While UEFI uses a different interface for boot services and
runtime services, some platform firmware must perform the functions
BIOS uses for system configuration (a.k.a. Power On Self Test or POST)
and Setup. UEFI does not specify how POST & Setup are implemented.
 	
Q:	How is UEFI implemented on a computer system?
A:	UEFI is an interface. It can be implemented on top of a traditional
BIOS (in which case it supplants the traditional INT entry points into
BIOS) or on top of non-BIOS implementations.


[Singh]


In a representative EFI system, a thin Pre-EFI Initialization Layer
(PEI) might do most of the POST-related work that is traditionally
done by the BIOS POST. This includes things like chipset
initialization, memory initialization, bus enumeration, etc. EFI
prepares a Driver Execution Environment (DXE) to provide generic
platform functions that EFI drivers may use. The drivers themselves
provide specific platform capabilities and customizations.


...


Andrew Fish invented EFI at his desk in the late 1990s, calling it
Intel Boot Initiative (IBI) at that time. He offered his 26 page
unsolicited white paper to his management. The paper was meant to be a
response to major operating system and hardware companies rejecting
legacy BIOS as the firmware for enterprise class Itanium? Processor
Platforms.

Andrew says:

"At that time, two firmware solutions were put on the table as
replacements for BIOS architectures for the Itanium: Alpha Reference
Console (ARC) and Open Firmware. It turned out that nobody really
owned the inellectual property to ARC, and in any case, it did not
have enough extensible properties to make it practical for a
horizontal industry. At this point, Open Firmware became the
frontrunner as Apple and Sun both used it. However, Open Firmware was
not without its own technical challenges. The PC had started down a
path of using the Advanced Configuration and Power Interface (ACPI) as
its runtime namespace to describe the platform to the operating
system. As I liked to say at the time, the only thing worse than one
namespace is keeping two namespaces in sync. The other problem was the
lack of third party support for Open Firmware. We invited the
FirmWorks guys to come visit us at Dupont (WA), and we had a great
talk. Given we had just gone through an exercise of inventing a
firmware base from scratch, I think we were uniquely qualified to
appreciate what Open Firmware had been able to achieve. Unfortunately,
it became clear that the infrastructure to support a transition to
Open Firmware did not exist. Given the namespace issue with Open
Firmware and the lack of industry enabling infrastructure, we decided
to go on and make EFI a reality."

"EFI is an interface specification and it really is more about how to
write an operating system loader and an Option ROM than it is about
how to make a BIOS that works. The Intel? Platform Innovation
Framework for EFI (Framework for short) is Intel's next generation
firmware architecture from the ground up. The core chunks of this code
are available under an Open Source license at www.TianoCore.org. Tiano
was the developer code name while Framework was the marketing name."

"To sum up, EFI is an industry interface specification that defines
how OS loaders and PCI Option ROMs work. The Framework defines a new
modular architecture that allows an entire firmware base to be
constructed in a modular fashion. The Framework has a nice property in
that it allows binary modules to work together in the boot process.
This allows the code from each vendor to have an arbitrary license
type. Intel? was interested in EFI from making a standard Itanium
platform (as well as IA-32 platforms of the future) to drive adoption
and enable a horizontal industry to make compatible platforms. The
Framework is more about silicon enabling, so it drills down to a much
lower level of how things work."


Fish says that PCs had already started down the path of ACPI and that
the Open Firmware namespace was incompatible.  Evidently when ACPI was
developed, the already existent Open Firmware specification/namespace
was ignored.

[Intel]

1992 APM 1.0
1993 APM 1.1, APM Energy Star
1994
1995 PCI Mobile Design Guide 1.1, PCI 2.1
1996 APM 1.2
1997 ACPI 1.0, PCI PM 1.0
1998 PCI 2.2
1999 ACPI 1.0b
2000 ACPI 2.0

                               Figure 2.1 PC Power Management
Specification Timeline


=== Linus Torvalds comments on EFI ===

Linus Torvalds.  [Brown]


EFI is doing all the wrong things. Trying to fix BIOSes by being "more
generic". It's going to be a total nightmare if you go down that path.

What will work is:

    * standard hardware interfaces. Instead of working on bytecode
interpreters, make the f*cking hardware definition instead, and make
it SANE and PUBLIC! So that we can write drivers that work, and that
come with source so that we can fix them when somebody has buggy
hardware.

      DO NOT MAKE ANOTHER FRIGGING BYTECODE INTERPRETER!

      Didn't Intel learn anything from past mistakes? ACPI was
supposed to be "simple". Codswallop.

      PCI works, because it had standard, and documented, hardware
interfaces. The interfaces aren't well specified enough to write a PCI
disk driver, of course, but they _are_ good enough to do discovery and
a lot of things.

      Intel _could_ make a "PCI disk controller interface definition",
and it will work. The way USB does actually work, and UHCI was
actually a fair standard, even if it left _way_ too much to software.
    * Source code. LinuxBIOS works today, and is a lot more flexible
than EFI will _ever_ be.
    * Compatibility. Make hardware that works with old drivers and old
BIOSes. This works. The fact that Intel forgot about that with ia-64
is not an excuse to make _more_ mistakes.


=== Intel's Reply to Linus ===

Mark Doran of Intel.  [Brown]


The trouble with the "architectural hardware" argument proved to be
that PCI is already well established and there is a vibrant industry
churning out innovative PCI cards on a regular basis. The idea of a
single interface definition for all cards of each of the network,
storage or video classes is viewed as simply too limiting and the
argument was made to us that to force such a model would be to stifle
innovation in peripherals. So effectively the feedback we got on
"architectural hardware" was therefore along the lines of "good idea
but not practical..."

...

As a practical matter carrying multiple instruction set versions of
the same code gets expensive in FLASH memory terms. Consider an EFI
compiled driver for IA-32 as the index, size: one unit. With code size
expansion, an Itanium compiled driver is going to be three to four
times that size. Total ROM container requirement: one unit for the
legacy ROM image plus one for an EFI IA-32 driver plus three to four
units for an Itanium compiled driver image; to make the card "just
work" when you plug it into a variety of systems is starting to
require a lot of FLASH on the card. More than the IHVs were willing to
countenance in most cases for cost reasons.

EFI Byte Code was born of this challenge. Its goals are pretty
straightforward: architecture neutral image, small foot print in the
add-in card ROM container and of course small footprint in the
motherboard which will have to carry an interpreter. We also insisted
that the C source for a driver should be the same regardless of
whether you build it for a native machine instruction set or EBC.

...

You may ask why we didn't just use an existing definition as opposed
to making a new one. We did actually spend quite a bit of time on that
very question. Most alternatives would have significantly swelled the
ROM container size requirement or the motherboard support overhead
requirement or had licensing, IP or other impediments to deployment
into the wider industry that we had no practical means to resolve.
With specific reference to why we chose not to use the IA-32
instruction set for this purpose, it was all about the size of an
interpreter for that instruction set. To provide compatibility 100%
for the universe of real mode option ROM binaries out there would
require a comprehensive treatment of a very rich instruction set
architecture. We could see no practical way to persuade OEMs building
systems using processors other than IA-32 to carry along that much
interpreter code in their motherboard ROM.

...

... EBC requires a small interpreter with no libraries (roughly 18k
uncompressed total on IA-32 for example) and the average add-in card
ROM image size is 1.5 units relative to native IA-32 code. And keep in
mind that using byte code for this purpose is in widespread, long time
use on other CPU architectures so we felt the technique in general was
viable based on industry experience with it. Yes, it's a compromise
but the best balance point we have been able find to date.

...

There is nothing about the definition of the EFI spec or the driver
model associated that prevents vendors from making add-in card drivers
and presenting them in Open Source form to the community. In fact
we've specifically included the ability to "late bind" a driver into a
system that speaks EFI. In practice that late binding means that code
that uses EFI services and that is GPL code can be used on systems
that also include EFI code that is not open source.

The decision on whether to make any given driver Open Source or not
therefore lies with the creator of that code. In the case of ROM
content for an add-in card that will usually be the IHV that makes the
card.

...

The patent license grant is thus in some sense a double coverage
approach...you don't really need a patent license grant since there
aren't any patents that read but to reinforce that you don't need to
worry about patents we give you the grant anyway. This helped make
some corporate entities more comfortable about implementing support
for EFI.


=== Comparing EBC to x86 bytecode ===

However, the EBC interpreter isn't that much smaller than an x86
emulator.  Add in the fact that x86 bytecode generation is much more
common and proven, and it looks like EBC isn't as big a win as Intel
believes.

[Lo]


In this paper we present our preliminary results on FreeVGA, an x86
emulator based on x86emu that can be used as such a compatibility
layer. We will show how we have successfully used FreeVGA to
initialize VGA cards from both ATI and Nvidia on a Tyan S2885
platform.

...

Integrating FreeVGA into LinuxBIOS had virtually no impact on the size
of the resulting ROM image. The compressed ROM image only increased by
16KB, but because the final ROM image is padded to the nearest power
of 2, this increase was absorbed into the existing unused space. The
runtime size of the uncompressed image was only increased by 40KB.


=== LinuxBIOS ===

Since EFI doesn't deal with POST or setup, it could actually sit on
top of something like LinuxBIOS.  In fact, the OpenBIOS project is
already planning on putting their Open Firmware on top of LinuxBIOS.
Certainly, if LinuxBIOS can use an x86 emulator, it can use a Forth or
EBC interpreter.

Interestingly, EFI might be a good way to have more vendors use
LinuxBIOS in their products.  How much can vendors differentiate
themselves in POST and Setup?  Probably not much.  They will want to
differentiate themselves in what they put on top of the abstraction
layer.  If that is the case, then it would be in everyone's best
interest to adopt LinuxBIOS for the boring, unprofitable POST and
Setup for cost sharing reasons.

The romcc compiler developed for the LinuxBIOS project is another good
reason to use it because it reduces the amount of assembly code needed
to initialize a machine.

[Minnich]


In 2002, Eric Biederman of Linux NetworX developed a compiler called
romcc. romcc is a simple optimizing C compiler-one file, 25,043 lines
of code-that uses only registers, not memory. The compiler can use
extended register sets such as MMX, SSI or 3DNOW. romcc allowed us to
junk almost all of the assembly code in LinuxBIOS, so that even the
earliest code, run with no working DRAM, can be written in C.

romcc is used only for early, pre-memory code. For code that runs
after memory comes up, we use GCC.


=== References ===

Brown, Zack Kernel Traffic #231 For 10 Sep 2003.
http://web.archive.org/web/20030926022111/http://www.kerneltraffic.org/kernel-traffic/kt20030910_231.html#7

Intel.  Power Management History and Motivation.
http://www.intel.com/intelpress/samples/ppm_chapter.pdf

Lo, Li-Ta; Watson, Gregory R.; Minnich, Ronald G.  FreeVGA:
Architecture Independent Video Graphics Initialization for LinuxBIOS.
http://www.linuxbios.org/data/vgabios/

Minnich, Ronald G.  Porting LinuxBIOS to the AMD SC520.
http://www.linuxjournal.com/article/8120

Singh, Amit.  More Power to Firmware.
http://kernelthread.com/publications/firmware/

UEFI.  About UEFI.  http://www.uefi.org/about.asp

-- 
Andrew Shewmaker


From w.a.sellers at nasa.gov  Mon Nov 16 05:44:50 2009
From: w.a.sellers at nasa.gov (Sellers, William A. (LARC-D205)[NCI])
Date: Mon, 16 Nov 2009 07:44:50 -0600
Subject: [Beowulf] mpd ..failed ..!
In-Reply-To: <SNT125-W16C4BBD8FAA5AA6A496BD8D6A70@phx.gbl>
References: <SNT125-W16C4BBD8FAA5AA6A496BD8D6A70@phx.gbl>
Message-ID: <B236CA9710FCCF4EAB144418A4E47F9A3D891B19D6@NDMSSCC06.ndc.nasa.gov>


You need to create a .mpd.conf file in your home directory - ~/.mpd.conf and it must contain a line:


MPD_SECRETWORD=change-me-to-something-else

The file must be mode 600 or it will not work.  Make sure this file is shared among all the nodes.


Also I suggest using mpdboot from the first node instead of invoking mpd directly.  I've had better success with mpdboot.

Regards,
Bill


________________________________________
From: beowulf-bounces at beowulf.org [beowulf-bounces at beowulf.org] On Behalf Of Zain elabedin hammade [zenabdin1988 at hotmail.com]
Sent: Saturday, November 14, 2009 7:24 AM
To: beowulf at beowulf.org
Subject: [Beowulf] mpd ..failed ..!

Hello All.
I have a cluster with 4 machines (fedora core 11).
I installed mpich2 - 1.1.1-1.fc11.i586.rpm .
I wrote on every machine :

mpd &
mpdtrace -l

then i wrote  on thr Master :

mpd -h Worker1.cluster.net - p 56128 -n

I got :
Master.cluster.net_38047 (connect_lhs 944): NOT OK to enter ring; one likely cause: mismatched secretwords
Master.cluster.net_38047 (enter_ring 873): lhs connect failed
Master.cluster.net_38047 (run 256): failed to enter ring

And the same was for other machines : Worker2 and Worker3 .
For information :
I have SSH works on ..

So where is the problem ?
What i have to do ?
I really need your help .

Regarded .

________________________________
Windows Live Hotmail: Your friends can get your Faceb! ook updates, right from Hotmail?.<http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009>


From sabujp at gmail.com  Tue Nov 17 11:10:55 2009
From: sabujp at gmail.com (Sabuj Pattanayek)
Date: Tue, 17 Nov 2009 13:10:55 -0600
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B02EC6D.60107@cse.ucdavis.edu>
References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu>
Message-ID: <e64671580911171110k61c0b2c6n54c7314056d5010c@mail.gmail.com>

Hi,

The OMB package for mvapich2 (mpich2 for IB) has some great programs
that you can use to test to see if your IB network is working
properly. Here are some of my results on our QDR IB network:

% mpiexec -n 2 ./osu_latency
# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
0                         1.87
1                         1.95
2                         1.97
4                         1.98
8                         1.98
16                        1.99
32                        2.04
64                        2.19
128                       3.63
256                       3.91
512                       4.38
1024                      5.25
2048                      6.80
4096                      8.12
8192                     10.94
16384                    16.35
32768                    22.13
65536                    33.28
131072                   55.09
262144                  100.01
524288                  166.54
1048576                 333.60
2097152                 636.91
4194304                1252.71

% mpiexec -n 2 ./osu_bw
# OSU MPI Bandwidth Test v3.1.1
# Size        Bandwidth (MB/s)
1                         1.44
2                         2.79
4                         5.51
8                        11.18
16                       21.17
32                       43.42
64                       82.41
128                     146.97
256                     314.42
512                     564.18
1024                   1033.38
2048                   1634.33
4096                   2168.96
8192                   2514.58
16384                  2788.07
32768                  3038.48
65536                  3213.89
131072                 3293.78
262144                 3334.07
524288                 3353.60
1048576                3355.25
2097152                3362.15
4194304                3365.81

That's 3.3GB/s or ~26.4gbps .

HTH,
Sabuj Pattanayek

On Tue, Nov 17, 2009 at 12:33 PM, Bill Broadley <bill at cse.ucdavis.edu> wrote:
> Jon Forrest wrote:
>> Let's say you have a brand new cluster with
>> brand new Infiniband hardware, and that
>> you've installed OFED 1.4 and the
>> appropriate drivers for your IB
>> HCAs (i.e. you see ib0 devices
>> on the frontend and all compute nodes).
>> The cluster appears to be working
>> fine but you're not sure about IB.
>>
>> How would you test your IB network
>> to make sure all is well?


From angelv at iac.es  Tue Nov 17 01:44:43 2009
From: angelv at iac.es (=?ISO-8859-1?Q?=C1ngel_de_Vicente?=)
Date: Tue, 17 Nov 2009 09:44:43 +0000
Subject: [Beowulf] Step by step guide for the installation and configuration
 of a cluster (with Rocks) to run ParaView
Message-ID: <4B02708B.2070908@iac.es>

Hi all,

we have recently installed a small test cluster to run ParaView 
visualization software in parallel. The configuration was not trivial, 
and I put a detailed step-by-step guide in  
http://www.iac.es/sieinvens/siepedia/pmwiki.php?n=HOWTOs.ParaviewInACluster, 
in case it can be of interest to someone else. Any comments or 
suggestions are welcome.

Cheers,
?ngel de Vicente

-- 
+---------------------------------------------+
|                                             |
| http://www.iac.es/galeria/angelv/           |
|                                             |
| High Performance Computing Support PostDoc  |
| Instituto de Astrof?sica de Canarias        |
|                                             |
+---------------------------------------------+

---------------------------------------------------------------------------------------------
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci?n de Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en


From bill at cse.ucdavis.edu  Tue Nov 17 14:46:43 2009
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Tue, 17 Nov 2009 14:46:43 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B02F263.9030607@berkeley.edu>
References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu>
	<4B02F263.9030607@berkeley.edu>
Message-ID: <4B0327D3.2070603@cse.ucdavis.edu>

Jon Forrest wrote:
> Bill Broadley wrote:
> 
>> My first suggest sanity test would be to test latency and bandwidth to
>> insure
>> you are getting IB numbers.  So 80-100MB/sec and 30-60us for a small
>> packet
>> would imply GigE.  6-8 times the bandwidth certainly would imply SDR or
>> better.  Latency varies quite a bit among implementation, I'd try to get
>> within 30-40% of advertised latency numbers.
> 
> For those of us who aren't familiar with IB utilities,
> could you give some examples of the commands you'd use
> to do this?
> 
> Thanks,
> Jon

Here's 2 that I use:
 http://cse.ucdavis.edu/bill/relay.c
 http://cse.ucdavis.edu/bill/mpi_nxnlatbw.c

So to compile, assuming a sane environment:
mpicc -O3 relay.c -o relay

The command to run an MPI program varies by environment and mpi
implementation, and batch queue environment (especially tight integration).
It should be something close to:
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1024
mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 8192

You should see something like:
c0-8 c0-22
size=     1,  16384 hops,  2 nodes in   0.75 sec ( 45.97 us/hop)     85 KB/sec
c0-8 c0-22
size=  1024,  16384 hops,  2 nodes in   2.00 sec (121.94 us/hop)  32803 KB/sec
c0-8 c0-22
size=  8192,  16384 hops,  2 nodes in   6.21 sec (379.05 us/hop)  84421 KB/sec

So basically on a tiny packet 45us of latency (normal for gigE), and on a
large package 84MB/sec or so (normal for GigE).

I'd start with 2 nodes, then if you are happy try it with all nodes.

Now for infiniband you should see something like:

c0-5 c0-4
size=     1,  16384 hops,  2 nodes in   0.03 sec (  1.72 us/hop)   2274 KB/sec
c0-5 c0-4
size=  1024,  16384 hops,  2 nodes in   0.16 sec (  9.92 us/hop) 403324 KB/sec
c0-5 c0-4
size=  8192,  16384 hops,  2 nodes in  0.50 sec ( 30.34 us/hop) 1054606 KB/sec

Note the latency is some 25 times less and the bandwidth some 10+ times
higher.  Note the hostnames are different, don't run multiple copies on the
same node unless you intend to.  Running 4 copies on a 4 cpu node doesn't test
infiniband.

So once you get what you expect I'd suggest something a bit more
comprehensive.  Something like:
mpirun -np <number of nodes> -machinefile <list of nodes> ./mpi_nxnlatbw

I'd expect some different in latency and bandwidth between nodes, but not any
big differences.  Something like:
[0<->1]		1.85us		1398.825264 (MillionBytes/sec)
[0<->2]		1.75us		1300.812337 (MillionBytes/sec)
[0<->3]		1.76us		1396.205242 (MillionBytes/sec)
[0<->4]		1.68us		1398.647324 (MillionBytes/sec)
[1<->0]		1.82us		1375.550155 (MillionBytes/sec)
[1<->2]		1.69us		1397.936020 (MillionBytes/sec)
...

Once those numbers are consistent and where you expect them (both latency and
bandwidth) I'd follow up with a production code that produces a known answer
and is likely to provide much wider MPI coverage.


From jlforrest at berkeley.edu  Tue Nov 17 16:26:29 2009
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Tue, 17 Nov 2009 16:26:29 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B032DA5.2010106@cse.ucdavis.edu>
References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu>
	<4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu>
Message-ID: <4B033F35.4020106@berkeley.edu>

For what it's worth, I'm using 10 nodes, where each node
has 12 cores. I'm also using Rocks with the Mellonox roll.
My HCA is a Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
(rev 20)

> mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1
> mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 1024
> mpirun -np <number of nodes> -machinefile <list of nodes> ./relay 8192
> 
> You should see something like:
> c0-8 c0-22
> size=     1,  16384 hops,  2 nodes in   0.75 sec ( 45.97 us/hop)     85 KB/sec
> c0-8 c0-22
> size=  1024,  16384 hops,  2 nodes in   2.00 sec (121.94 us/hop)  32803 KB/sec
> c0-8 c0-22
> size=  8192,  16384 hops,  2 nodes in   6.21 sec (379.05 us/hop)  84421 KB/sec
> 
> So basically on a tiny packet 45us of latency (normal for gigE), and on a
> large package 84MB/sec or so (normal for GigE).
> 
> I'd start with 2 nodes, then if you are happy try it with all nodes.

Since there are 10 nodes, I did the following, with the results shown
(I removed the node names):

$ mpirun -np 10  -machinefile hosts ./relay 1
size=     1,  16384 hops, 10 nodes in   0.20 sec ( 12.44 us/hop)    314 
KB/sec
$ mpirun -np 10  -machinefile hosts ./relay 1024
size=  1024,  16384 hops, 10 nodes in   0.33 sec ( 20.40 us/hop) 196074 
KB/sec
$ mpirun -np 10  -machinefile hosts ./relay 8192
size=  8192,  16384 hops, 10 nodes in   0.97 sec ( 59.51 us/hop) 537734 
KB/sec

I believe these are with IB.

> So once you get what you expect I'd suggest something a bit more
> comprehensive.  Something like:
> mpirun -np <number of nodes> -machinefile <list of nodes> ./mpi_nxnlatbw
> 
> I'd expect some different in latency and bandwidth between nodes, but not any
> big differences.  Something like:
> [0<->1]		1.85us		1398.825264 (MillionBytes/sec)

I did the following, with the results shown:

$ mpirun -np 2  -machinefile hosts ./mpi_nxnlatbw
[0<->1]         3.67us          1289.409397 (MillionBytes/sec)
[1<->0]         3.67us          1276.377689 (MillionBytes/sec)

I also ran this with more nodes but the point-to-point
times were about the same.

Does this look right? Based on your numbers, it looks like my
IB is slower than yours. Because of the strange way the OFED
was installed, I can't easily run over just ethernet.

Thanks for your help


-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From jlforrest at berkeley.edu  Tue Nov 17 17:01:12 2009
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Tue, 17 Nov 2009 17:01:12 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B032DA5.2010106@cse.ucdavis.edu>
References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu>
	<4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu>
Message-ID: <4B034758.6040401@berkeley.edu>

I had said "I believe these are with IB."
Now I'm not so sure. I just did a

	"ifconfig ib0"

on all the nodes and they all say

           BROADCAST MULTICAST  MTU:65520  Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:256
           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

So, it doesn't look like any of these tests used IB,
although I'm not sure because some of those numbers
looked better than I'd expect for just 1Gb ethernet.

I'll have to figure out how to force IB when
using OpenMPI.

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From tom.elken at qlogic.com  Tue Nov 17 17:35:48 2009
From: tom.elken at qlogic.com (Tom Elken)
Date: Tue, 17 Nov 2009 17:35:48 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B033F35.4020106@berkeley.edu>
References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu>
	<4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu>
	<4B033F35.4020106@berkeley.edu>
Message-ID: <35AAF1E4A771E142979F27B51793A4888702F998AE@AVEXMB1.qlogic.org>

> On Behalf Of Jon Forrest

> My HCA is a Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
> (rev 20)
> 
> 
> I did the following, with the results shown:
> 
> $ mpirun -np 2  -machinefile hosts ./mpi_nxnlatbw
> [0<->1]         3.67us          1289.409397 (MillionBytes/sec)
> [1<->0]         3.67us          1276.377689 (MillionBytes/sec)
> 
> I also ran this with more nodes but the point-to-point
> times were about the same.
> 
> Does this look right?

For InfiniHost III, these numbers look right, and you are using IB.

You may get somewhat higher bandwidth using OSU MPI Benchmarks or Intel MPI Benchmarks (formerly Pallas) because a fairly modest message size is used by mpi_nxnlatbw's bandwidth test.  It is written to get somewhat close to peak bandwidth and best latency and run over a fairly large cluster in a reasonable amount of time.  But as a result, the bandwidth test runs so quickly that taking an OS interrupt can skew a few of the results.  Before concluding that a link is underperforming based on mpi_nxnlatbw, re-run the test to see if the same link is slow, or use another more comprehensive benchmark like OMB or IMB.

-Tom


> Based on your numbers, it looks like my
> IB is slower than yours. Because of the strange way the OFED
> was installed, I can't easily run over just ethernet.
> 
> Thanks for your help
> 
> 
> --
> Jon Forrest
> Research Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


From worringen at googlemail.com  Tue Nov 17 12:14:28 2009
From: worringen at googlemail.com (Joachim Worringen)
Date: Tue, 17 Nov 2009 12:14:28 -0800
Subject: [Beowulf] Final Announcement: 11th Annual Beowulf Bash 9pm Nov 16
	2009
In-Reply-To: <Pine.LNX.4.44.0911160137410.6955-100000@bluewest.scyld.com>
References: <Pine.LNX.4.44.0911160137410.6955-100000@bluewest.scyld.com>
Message-ID: <981e81f00911171214h68a899f0me2fbb11124dc90a5@mail.gmail.com>

On Mon, Nov 16, 2009 at 1:40 AM, Donald Becker <becker at scyld.com> wrote:

>
>
> Final Announcement: 11th Annual Beowulf Bash   9pm Nov 16 2009
>
>
>  11th Annual Beowulf Bash
>        And
>      LECCIBG
>
>  Thanks for this great event - Norman Sylvester rocks!

 Joachim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091117/ffaf1b12/attachment.html>

From siegert at sfu.ca  Tue Nov 17 18:30:00 2009
From: siegert at sfu.ca (Martin Siegert)
Date: Tue, 17 Nov 2009 18:30:00 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B034758.6040401@berkeley.edu>
References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu>
	<4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu>
	<4B034758.6040401@berkeley.edu>
Message-ID: <20091118023000.GB453@stikine.its.sfu.ca>

On Tue, Nov 17, 2009 at 05:01:12PM -0800, Jon Forrest wrote:
> I had said "I believe these are with IB."
> Now I'm not so sure. I just did a
>
> 	"ifconfig ib0"
>
> on all the nodes and they all say
>
>           BROADCAST MULTICAST  MTU:65520  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:256
>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

AFAIK, ifconfig ib0 will show you the ipoib numbers. Since MPI
(hopefully) is not using this, you see zeros.

> So, it doesn't look like any of these tests used IB,
> although I'm not sure because some of those numbers
> looked better than I'd expect for just 1Gb ethernet.
>
> I'll have to figure out how to force IB when
> using OpenMPI.

Edit your ~/.openmpi/mca-params.conf file and add the line

btl = ^tcp

That will explicitly prevent openmpi using tcp (it would use ib before
tcp by default, but this way it will fail if ib is not working).

> -- 
> Jon Forrest
> Research Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6


From gus at ldeo.columbia.edu  Tue Nov 17 19:09:02 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Tue, 17 Nov 2009 22:09:02 -0500
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <20091118023000.GB453@stikine.its.sfu.ca>
References: <4B02E74F.8050103@berkeley.edu>
	<4B02EC6D.60107@cse.ucdavis.edu>	<4B02F263.9030607@berkeley.edu>
	<4B032DA5.2010106@cse.ucdavis.edu>	<4B034758.6040401@berkeley.edu>
	<20091118023000.GB453@stikine.its.sfu.ca>
Message-ID: <4B03654E.4010002@ldeo.columbia.edu>

Martin Siegert wrote:
> On Tue, Nov 17, 2009 at 05:01:12PM -0800, Jon Forrest wrote:
>> I had said "I believe these are with IB."
>> Now I'm not so sure. I just did a
>>
>> 	"ifconfig ib0"
>>
>> on all the nodes and they all say
>>
>>           BROADCAST MULTICAST  MTU:65520  Metric:1
>>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:256
>>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> 
> AFAIK, ifconfig ib0 will show you the ipoib numbers. Since MPI
> (hopefully) is not using this, you see zeros.
> 
>> So, it doesn't look like any of these tests used IB,
>> although I'm not sure because some of those numbers
>> looked better than I'd expect for just 1Gb ethernet.
>>
>> I'll have to figure out how to force IB when
>> using OpenMPI.
> 
> Edit your ~/.openmpi/mca-params.conf file and add the line
> 
> btl = ^tcp
> 
> That will explicitly prevent openmpi using tcp (it would use ib before
> tcp by default, but this way it will fail if ib is not working).
> 

Hi Jon

Martin's suggestion is the the best, particularly if you plan to
always use IB, never use TCP.

Alternatively you could
include these mca parameters on the mpiexec command
line to select IB:

-mca btl openib,sm,self

OpenMPI has several mechanisms to make these choices.
See these FAQ:
http://www.open-mpi.org/faq/?category=sysadmin#sysadmin-mca-params
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

My $0.02
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

>> -- 
>> Jon Forrest
>> Research Computing Support
>> College of Chemistry
>> 173 Tan Hall
>> University of California Berkeley
>> Berkeley, CA
>> 94720-1460
>> 510-643-1032
>> jlforrest at berkeley.edu
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> Cheers,
> Martin
> 


From bill at cse.ucdavis.edu  Tue Nov 17 19:18:58 2009
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Tue, 17 Nov 2009 19:18:58 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B034758.6040401@berkeley.edu>
References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu>
	<4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu>
	<4B034758.6040401@berkeley.edu>
Message-ID: <4B0367A2.4080803@cse.ucdavis.edu>

Jon Forrest wrote:
> I had said "I believe these are with IB."
> Now I'm not so sure. I just did a

The performance numbers you showed from relay and mpi_nxnlatbw are definitely
much faster than GigE.  Unless it's multiple copies running on a single
machine (thus printing the hostname).

Assuming that it was actually using the interconnect (not multiple copies
running on a single machine)

> 
>     "ifconfig ib0"

I suspect this is for TCPIP over ib, and doesn't show MPI traffic.  You didn't
mention which controllers do you have?  I suspect that there is a tool to show
the various counters on the HCA, let alone on the switch side.

> I'll have to figure out how to force IB when
> using OpenMPI.

Looks like IB to me, might want to do the reverse to see the real
differential, I find it very handy for cost justifying IB on future clusters
based on real application performance.


From deadline at eadline.org  Tue Nov 17 21:38:53 2009
From: deadline at eadline.org (Douglas Eadline)
Date: Wed, 18 Nov 2009 00:38:53 -0500 (EST)
Subject: [Beowulf] The Limulus Case
Message-ID: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org>


If you are at SC09 stop by the SICORP booth (1209) (I managed to
wrangle a pedestal) to see the the Limulus case - four microATX motherboards
in one case. I'll be around at times to answer questions.
Jess Cannata is also helping out.

If you are not at the show or want to see what I'm talking
about, you can see some pictures here:

http://limulus.basement-supercomputing.com/wiki/LimulusCase

BTW the Beobash was huge success, I think we had over 450
people. Pictures are up at InsideHPC

http://insidehpc.com/2009/11/17/beowulf-bash-2009-success/

--
Doug


From prentice at ias.edu  Tue Nov 17 21:11:41 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 17 Nov 2009 21:11:41 -0800
Subject: [Beowulf] How Would You Test Infiniband in New Cluster?
In-Reply-To: <4B02E74F.8050103@berkeley.edu>
References: <4B02E74F.8050103@berkeley.edu>
Message-ID: <4B03820D.8090305@ias.edu>

Jon Forrest wrote:
> Let's say you have a brand new cluster with
> brand new Infiniband hardware, and that
> you've installed OFED 1.4 and the
> appropriate drivers for your IB
> HCAs (i.e. you see ib0 devices
> on the frontend and all compute nodes).
> The cluster appears to be working
> fine but you're not sure about IB.
>
> How would you test your IB network
> to make sure all is well?
>
> Cordially,
I would start with the basic IB diagnostic utilities. On RHEL-based
systems, they are in the infiniband-diags rpm. I have limited experience
of them myself, but you can check the man pages. They may not give you
performance metrics, but can definitely help you determine if everything
is connected and working properly.

Here's a list of the commands available from this package in my RHEL 5.3
rebuild:

$ rpm -ql infiniband-diags | grep bin
/usr/sbin/check_lft_balance.pl
/usr/sbin/dump_lfts.sh
/usr/sbin/dump_mfts.sh
/usr/sbin/ibaddr
/usr/sbin/ibcheckerrors
/usr/sbin/ibcheckerrs
/usr/sbin/ibchecknet
/usr/sbin/ibchecknode
/usr/sbin/ibcheckport
/usr/sbin/ibcheckportstate
/usr/sbin/ibcheckportwidth
/usr/sbin/ibcheckstate
/usr/sbin/ibcheckwidth
/usr/sbin/ibclearcounters
/usr/sbin/ibclearerrors
/usr/sbin/ibdatacounters
/usr/sbin/ibdatacounts
/usr/sbin/ibdiscover.pl
/usr/sbin/ibfindnodesusing.pl
/usr/sbin/ibhosts
/usr/sbin/ibidsverify.pl
/usr/sbin/iblinkinfo.pl
/usr/sbin/ibnetdiscover
/usr/sbin/ibnodes
/usr/sbin/ibping
/usr/sbin/ibportstate
/usr/sbin/ibprintca.pl
/usr/sbin/ibprintrt.pl
/usr/sbin/ibprintswitch.pl
/usr/sbin/ibqueryerrors.pl
/usr/sbin/ibroute
/usr/sbin/ibrouters
/usr/sbin/ibstat
/usr/sbin/ibstatus
/usr/sbin/ibswitches
/usr/sbin/ibswportwatch.pl
/usr/sbin/ibsysstat
/usr/sbin/ibtracert
/usr/sbin/perfquery
/usr/sbin/saquery
/usr/sbin/set_nodedesc.sh
/usr/sbin/sminfo
/usr/sbin/smpdump
/usr/sbin/smpquery
/usr/sbin/vendstat


From prentice at ias.edu  Tue Nov 17 21:21:27 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 17 Nov 2009 21:21:27 -0800
Subject: [Beowulf] The Limulus Case
In-Reply-To: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org>
References: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org>
Message-ID: <4B038457.4060303@ias.edu>

Douglas Eadline wrote:
> If you are at SC09 stop by the SICORP booth (1209) (I managed to
> wrangle a pedestal) to see the the Limulus case - four microATX motherboards
> in one case. I'll be around at times to answer questions.
> Jess Cannata is also helping out.
>
> If you are not at the show or want to see what I'm talking
> about, you can see some pictures here:
>
> http://limulus.basement-supercomputing.com/wiki/LimulusCase
>   
Doug,

I was wandering around looking for this today. I wish I had my laptop
with me to get this information then.
> BTW the Beobash was huge success, I think we had over 450
> people. Pictures are up at InsideHPC
>
> http://insidehpc.com/2009/11/17/beowulf-bash-2009-success/
>   
If Walt wants some accompaniment next year, I can bring my bass and Bill
Wichser from Princeton said he'd bring his harmonica. At least he said
he would last night. He might not remember this morning.

Prentice


From deadline at eadline.org  Wed Nov 18 07:50:48 2009
From: deadline at eadline.org (Douglas Eadline)
Date: Wed, 18 Nov 2009 10:50:48 -0500 (EST)
Subject: [Beowulf] The Limulus Case
In-Reply-To: <4B038457.4060303@ias.edu>
References: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org>
	<4B038457.4060303@ias.edu>
Message-ID: <35706.173.8.196.93.1258559448.squirrel@mail.eadline.org>


Look for the Appro booth, it is right next to it.
On Thursday I'll have more time, I can open the case
and play with it some more.

Today I'm continuing my totally un-professional
video interviews for Linux magazine. You know
an "HPC gone Wild" kind of thing.

If you see me with the camera, say hi share your thoughts,
or what ever.

--
Doug


> Douglas Eadline wrote:
>> If you are at SC09 stop by the SICORP booth (1209) (I managed to
>> wrangle a pedestal) to see the the Limulus case - four microATX
>> motherboards
>> in one case. I'll be around at times to answer questions.
>> Jess Cannata is also helping out.
>>
>> If you are not at the show or want to see what I'm talking
>> about, you can see some pictures here:
>>
>> http://limulus.basement-supercomputing.com/wiki/LimulusCase
>>
> Doug,
>
> I was wandering around looking for this today. I wish I had my laptop
> with me to get this information then.
>> BTW the Beobash was huge success, I think we had over 450
>> people. Pictures are up at InsideHPC
>>
>> http://insidehpc.com/2009/11/17/beowulf-bash-2009-success/
>>
> If Walt wants some accompaniment next year, I can bring my bass and Bill
> Wichser from Princeton said he'd bring his harmonica. At least he said
> he would last night. He might not remember this morning.
>
> Prentice
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


--
Doug


From prentice at ias.edu  Wed Nov 18 08:00:46 2009
From: prentice at ias.edu (Prentice Bisbal)
Date: Wed, 18 Nov 2009 08:00:46 -0800
Subject: [Beowulf] The Limulus Case
In-Reply-To: <35706.173.8.196.93.1258559448.squirrel@mail.eadline.org>
References: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org>
	<4B038457.4060303@ias.edu>
	<35706.173.8.196.93.1258559448.squirrel@mail.eadline.org>
Message-ID: <4B041A2E.5030801@ias.edu>

Douglas Eadline wrote:
> Look for the Appro booth, it is right next to it.
> On Thursday I'll have more time, I can open the case
> and play with it some more.
>
> Today I'm continuing my totally un-professional
> video interviews for Linux magazine. You know
> an "HPC gone Wild" kind of thing.
>
> If you see me with the camera, say hi share your thoughts,
> or what ever.
>
> --
> Doug
>
>   
Can I lift up my shirt and flash the audience?

Prentice


From jbardin at bu.edu  Thu Nov 19 10:18:41 2009
From: jbardin at bu.edu (james bardin)
Date: Thu, 19 Nov 2009 13:18:41 -0500
Subject: [Beowulf] Large raid rebuild times
Message-ID: <a3b675320911191018s43cba643p55d56c5df8b759ec@mail.gmail.com>

Hello,

Has anyone here seen any numbers, or tested themselves, the rebuild
times for large raid arrays (raid 6 specifically)?
I can't seem to find anything concrete to go by, and I haven't had to
rebuild anything larger than a few TB. What happens when you loose a
2TB drive in a 20TB array for instance? Do any hardware raid solutions
help. I don't think ZFS is an option right now, so I'm looking at
Linux and/or hardware raid.


Thanks
-jim


From vanallsburg at hope.edu  Thu Nov 19 11:18:49 2009
From: vanallsburg at hope.edu (Paul Van Allsburg)
Date: Thu, 19 Nov 2009 14:18:49 -0500
Subject: [Beowulf] Large raid rebuild times
In-Reply-To: <a3b675320911191018s43cba643p55d56c5df8b759ec@mail.gmail.com>
References: <a3b675320911191018s43cba643p55d56c5df8b759ec@mail.gmail.com>
Message-ID: <4B059A19.909@hope.edu>

james bardin wrote:
> Hello,
>
> Has anyone here seen any numbers, or tested themselves, the rebuild
> times for large raid arrays (raid 6 specifically)?
> I can't seem to find anything concrete to go by, and I haven't had to
> rebuild anything larger than a few TB. What happens when you loose a
> 2TB drive in a 20TB array for instance? Do any hardware raid solutions
> help. I don't think ZFS is an option right now, so I'm looking at
> Linux and/or hardware raid.
>
>
> Thanks
> -jim
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>   
Jim,

I have a new Infortrend a24s-g2130 with 24 x 1T drives. I setup a raid6 
with 2 spares and during the raid initialization a drive failed. The 
raid6 initialization took 20 hours and the rebuild took 16 hours.  I've 
since changed the rebuild priority from normal to high but have not had 
any more drives fail.

Cheers,
Paul


http://www.infortrend.com/main/2_product/es_a24s-g2130.asp

-- 
Paul Van Allsburg       
Scientific Computing Specialist
Natural Sciences Division, Hope College
35 E. 12th St.  Holland, Michigan 49423
616-395-7292    vanallsburg at hope.edu
http://www.hope.edu/academic/csm/


From rsandilands at authentium.com  Thu Nov 19 12:29:12 2009
From: rsandilands at authentium.com (Robert Sandilands)
Date: Thu, 19 Nov 2009 15:29:12 -0500
Subject: [Beowulf] Re: Large raid rebuild times
In-Reply-To: <200911192000.nAJK077q009800@bluewest.scyld.com>
References: <200911192000.nAJK077q009800@bluewest.scyld.com>
Message-ID: <4B05AA98.4020401@authentium.com>

I am currently rebuilding a 28 x 1 TB RAID 5 volume that is based on the 
SurfRAID TRITON 16S3 with a JBOD unit. It is 16.6% complete after 2 1/2 
hours.

I am rebuilding before copying the data and returning all the Seagate 1 
TB drives. Replacing it with Hitachi 2 TB drives and RAID 6.

The combination of a large number of drives in RAID 5 and/or Seagate 1 
TB drives is not to be recommended.

Robert

beowulf-request at beowulf.org wrote:
>
> Hello,
>
> Has anyone here seen any numbers, or tested themselves, the rebuild
> times for large raid arrays (raid 6 specifically)?
> I can't seem to find anything concrete to go by, and I haven't had to
> rebuild anything larger than a few TB. What happens when you loose a
> 2TB drive in a 20TB array for instance? Do any hardware raid solutions
> help. I don't think ZFS is an option right now, so I'm looking at
> Linux and/or hardware raid.
>
>
>   


From dimitrios.v.gerasimatos at jpl.nasa.gov  Thu Nov 19 20:37:17 2009
From: dimitrios.v.gerasimatos at jpl.nasa.gov (Gerasimatos, Dimitrios V (343K))
Date: Thu, 19 Nov 2009 20:37:17 -0800
Subject: [Beowulf] Re: Large raid rebuild times 
In-Reply-To: <200911192000.nAJK077r009800@bluewest.scyld.com>
References: <200911192000.nAJK077r009800@bluewest.scyld.com>
Message-ID: <6F127CF61C0FE143B5E46BF06036094F952A7CC595@ALTPHYEMBEVSP30.RES.AD.JPL>


If rebuild times are a problem then I recommend you go with ZFS or a Netapp.
High-performance Netapp can be expensive. ZFS doesn't have to be. A fsck
of a large filesystem can easily take > 24 hours.


Dimitri


--
Dimitrios Gerasimatos                     dimitrios.gerasimatos at jpl.nasa.gov
Section 343                                        Jet Propulsion Laboratory
4800 Oak Grove Dr.           Mail Stop 264-820           Pasadena, CA  91109
Voice: 818.354.4910           FAX: 818.393.7413           Cell: 818.726.8617


-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of beowulf-request at beowulf.org
Sent: Thursday, November 19, 2009 12:01 PM
To: beowulf at beowulf.org
Subject: Beowulf Digest, Vol 69, Issue 22

Send Beowulf mailing list submissions to
	beowulf at beowulf.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://www.beowulf.org/mailman/listinfo/beowulf
or, via email, send a message with subject or body 'help' to
	beowulf-request at beowulf.org

You can reach the person managing the list at
	beowulf-owner at beowulf.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beowulf digest..."


Today's Topics:

   1. Large raid rebuild times (james bardin)
   2. Re: Large raid rebuild times (Paul Van Allsburg)


----------------------------------------------------------------------

Message: 1
Date: Thu, 19 Nov 2009 13:18:41 -0500
From: james bardin <jbardin at bu.edu>
Subject: [Beowulf] Large raid rebuild times
To: beowulf at beowulf.org
Message-ID:
	<a3b675320911191018s43cba643p55d56c5df8b759ec at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hello,

Has anyone here seen any numbers, or tested themselves, the rebuild
times for large raid arrays (raid 6 specifically)?
I can't seem to find anything concrete to go by, and I haven't had to
rebuild anything larger than a few TB. What happens when you loose a
2TB drive in a 20TB array for instance? Do any hardware raid solutions
help. I don't think ZFS is an option right now, so I'm looking at
Linux and/or hardware raid.


Thanks
-jim


------------------------------

Message: 2
Date: Thu, 19 Nov 2009 14:18:49 -0500
From: Paul Van Allsburg <vanallsburg at hope.edu>
Subject: Re: [Beowulf] Large raid rebuild times
To: james bardin <jbardin at bu.edu>
Cc: beowulf at beowulf.org
Message-ID: <4B059A19.909 at hope.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

james bardin wrote:
> Hello,
>
> Has anyone here seen any numbers, or tested themselves, the rebuild
> times for large raid arrays (raid 6 specifically)?
> I can't seem to find anything concrete to go by, and I haven't had to
> rebuild anything larger than a few TB. What happens when you loose a
> 2TB drive in a 20TB array for instance? Do any hardware raid solutions
> help. I don't think ZFS is an option right now, so I'm looking at
> Linux and/or hardware raid.
>
>
> Thanks
> -jim
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>   
Jim,

I have a new Infortrend a24s-g2130 with 24 x 1T drives. I setup a raid6 
with 2 spares and during the raid initialization a drive failed. The 
raid6 initialization took 20 hours and the rebuild took 16 hours.  I've 
since changed the rebuild priority from normal to high but have not had 
any more drives fail.

Cheers,
Paul


http://www.infortrend.com/main/2_product/es_a24s-g2130.asp

-- 
Paul Van Allsburg       
Scientific Computing Specialist
Natural Sciences Division, Hope College
35 E. 12th St.  Holland, Michigan 49423
616-395-7292    vanallsburg at hope.edu
http://www.hope.edu/academic/csm/


------------------------------

_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf


End of Beowulf Digest, Vol 69, Issue 22
***************************************


From eagles051387 at gmail.com  Fri Nov 20 02:21:26 2009
From: eagles051387 at gmail.com (Jonathan Aquilina)
Date: Fri, 20 Nov 2009 10:21:26 +0000
Subject: [Beowulf] Re: Large raid rebuild times
In-Reply-To: <6F127CF61C0FE143B5E46BF06036094F952A7CC595@ALTPHYEMBEVSP30.RES.AD.JPL>
References: <200911192000.nAJK077r009800@bluewest.scyld.com> 
	<6F127CF61C0FE143B5E46BF06036094F952A7CC595@ALTPHYEMBEVSP30.RES.AD.JPL>
Message-ID: <a31cd3860911200221k7cde6654l97efcee3627be071@mail.gmail.com>

wouldnt the limiting factor be the read and write times of the drives in
regards to rebuilding an array?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091120/d61572f8/attachment.html>

From pal at di.fct.unl.pt  Fri Nov 20 04:13:12 2009
From: pal at di.fct.unl.pt (Paulo Afonso Lopes)
Date: Fri, 20 Nov 2009 12:13:12 -0000 (WET)
Subject: [Beowulf] Re: Large raid rebuild times
Message-ID: <21910.89.180.105.138.1258719192.squirrel@webmail.fct.unl.pt>

(Sorry: I forgot to cc the mailing list)

> wouldnt the limiting factor be the read and write times of the drives in
regards to rebuilding an array?

I would expect the write BW of a single drive (the one being rebuilt) in
sequential access mode to be the limiting factor.

If, in a 5 drive RAID-5, a single drive sustains, say, 50 MB/s write BW
(and, say, 50MB/s read for simplicity):

  - time for reading 4 drives in parallel == time for reading 1 drive
   (e.g., 1TB at 50MB/s == 5,5 hours)

If the data recovery compute time is negligible (I expect it to be), then
the time to rebuild a 5 disk RAID5 with 1 TB disks should be better than
6h, provided that the "host" is able to sustain more than 250 MB/s for the
disks.

If you are using a disk-array (not an entry-level cheapo, but FC ones like
IBM DS4800, HP EVA 6000, EMC CX500 or better - these are old models) with
no activity other than the rebuilding, it should take less than 6h for a
5x 1TB/drive RAID5, IMO.

Regards,


-- 
Paulo Afonso Lopes                        | Tel: +351- 21 294 8536
Departamento de Inform?tica               | 294 8300 ext.10702
Faculdade de Ci?ncias e Tecnologia        | Fax: +351- 21 294 8541
Universidade Nova de Lisboa               | e-mail: poral at fct.unl.pt
2829-516 Caparica, PORTUGAL


From landman at scalableinformatics.com  Fri Nov 20 06:32:08 2009
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 20 Nov 2009 09:32:08 -0500
Subject: [Beowulf] Re: Large raid rebuild times
In-Reply-To: <4B05AA98.4020401@authentium.com>
References: <200911192000.nAJK077q009800@bluewest.scyld.com>
	<4B05AA98.4020401@authentium.com>
Message-ID: <4B06A868.5050000@scalableinformatics.com>

Robert Sandilands wrote:
> I am currently rebuilding a 28 x 1 TB RAID 5 volume that is based on the 

Hmmm....

> SurfRAID TRITON 16S3 with a JBOD unit. It is 16.6% complete after 2 1/2 
> hours.
> 
> I am rebuilding before copying the data and returning all the Seagate 1 
> TB drives. Replacing it with Hitachi 2 TB drives and RAID 6.
> 
> The combination of a large number of drives in RAID 5 and/or Seagate 1 
> TB drives is not to be recommended.

Actually, you shouldn't be using RAID5 any more.  The bit error rates 
suggest that you will in all likelihood, hit an uncorrectable error 
within a very small number of rebuilds, and lose data.

This is fairly well known at this point, so I have to admit surprise to 
hear of anyone using 20+ drives of TB size in a RAID5.

On Seagate, YMMV, they have been rock solid for us and our customers 
(thousands shipped, scales of PBs of storage).  Failure rates somewhat 
above their statistical estimates, very much in line with what Google 
indicates they have observed with their drives.

We haven't used Hitachi very much in our units, so I can't comment much 
on their quality/failure rate.

Joe


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From landman at scalableinformatics.com  Fri Nov 20 06:43:48 2009
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 20 Nov 2009 09:43:48 -0500
Subject: [Beowulf] Large raid rebuild times
In-Reply-To: <a3b675320911191018s43cba643p55d56c5df8b759ec@mail.gmail.com>
References: <a3b675320911191018s43cba643p55d56c5df8b759ec@mail.gmail.com>
Message-ID: <4B06AB24.7020302@scalableinformatics.com>

james bardin wrote:
> Hello,
> 
> Has anyone here seen any numbers, or tested themselves, the rebuild

Hi James,

   Yes, we test this (by purposely failing a drive on the storage we 
ship to our customers).

> times for large raid arrays (raid 6 specifically)?

   Yes, RAID6 with up to 24 drives per RAID, up to 2TB drives.  Due to 
the bit error rate failure models for the uncorrectable errors on disks, 
RAID5 is *strongly* contra-indicated for storage of more than a few 
small TB, and more than a few drives (less then 5 and less than 1TB). 
The risk of a second failure during rebuild is simply unacceptably high, 
which would/does permanently take out your data in RAID5.

> I can't seem to find anything concrete to go by, and I haven't had to
> rebuild anything larger than a few TB. What happens when you loose a
> 2TB drive in a 20TB array for instance? Do any hardware raid solutions
> help. I don't think ZFS is an option right now, so I'm looking at

   We have customers with 32TB raw per RAID, and when a drive fails, it 
rebuilds.  Rebuild time is a function of how fast the card is set up to 
do rebuilds, you can tune the better cards in terms of "background" 
rebuild performance.  For low rebuild speeds, we have seen 24 hours+, 
for high rebuild speeds, we have seen 12-15 hours for the 32TB.

   ZFS is probably not what you want to do ... building a critical 
dependency upon a product that has a somewhat uncertain future ...

Joe


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From jbardin at bu.edu  Fri Nov 20 08:56:18 2009
From: jbardin at bu.edu (james bardin)
Date: Fri, 20 Nov 2009 11:56:18 -0500
Subject: [Beowulf] Large raid rebuild times
In-Reply-To: <4B06AB24.7020302@scalableinformatics.com>
References: <a3b675320911191018s43cba643p55d56c5df8b759ec@mail.gmail.com>
	<4B06AB24.7020302@scalableinformatics.com>
Message-ID: <a3b675320911200856scfcf745y2ddc64ed8ae83df6@mail.gmail.com>

Hi Joe,

On Fri, Nov 20, 2009 at 9:43 AM, Joe Landman
<landman at scalableinformatics.com> wrote:
>
> ?We have customers with 32TB raw per RAID, and when a drive fails, it
> rebuilds. ?Rebuild time is a function of how fast the card is set up to do
> rebuilds, you can tune the better cards in terms of "background" rebuild
> performance. ?For low rebuild speeds, we have seen 24 hours+, for high
> rebuild speeds, we have seen 12-15 hours for the 32TB.
>

Thanks for that. That sounds inline with my expectations.
Any chance you've compared linux md raid6 to hardware solutions for
your devices?


> ?ZFS is probably not what you want to do ... building a critical dependency
> upon a product that has a somewhat uncertain future ...
>

I'm not too worried about zfs - it has plenty of following. I'm
personally waiting for btrfs to stabilize so we can start testing, but
that's a ways off. The issue here is that the group setting up these
storage servers is an all linux shop, and they don't want the overhead
of another OS.


Thanks
-jim


From hahn at mcmaster.ca  Fri Nov 20 13:10:23 2009
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 20 Nov 2009 16:10:23 -0500 (EST)
Subject: [Beowulf] Re: Large raid rebuild times
In-Reply-To: <a31cd3860911200221k7cde6654l97efcee3627be071@mail.gmail.com>
References: <200911192000.nAJK077r009800@bluewest.scyld.com> 
	<6F127CF61C0FE143B5E46BF06036094F952A7CC595@ALTPHYEMBEVSP30.RES.AD.JPL>
	<a31cd3860911200221k7cde6654l97efcee3627be071@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0911201552260.7542@coffee.psychology.mcmaster.ca>

> wouldnt the limiting factor be the read and write times of the drives in
> regards to rebuilding an array?

assuming your controller/xor engine/bus/etc are not a bottleneck,
rebuild takes 2 * singleDiskSize / singleDiskSpeed.  for a current-gen
2TB disk at 100 MB/s, that's only 6 hours.  (in particular, rebuild 
speed need not be proportional to the total volume size, since all data
blocks can be read concurrently.)  since people are talking about
much longer times than that, they must have bottlenecks (or slow disks).
online reconstruction (where the volume is available for use while
rebuilding) normally causes a significant slowdown as well.


From amjad11 at gmail.com  Sat Nov 21 17:51:26 2009
From: amjad11 at gmail.com (amjad ali)
Date: Sat, 21 Nov 2009 20:51:26 -0500
Subject: [Beowulf] Performance profiling/tuning on different systems
Message-ID: <428810f20911211751r37749319v9cd9e1cd01d021d5@mail.gmail.com>

Hi all,

Suppose a code is tuned on a specific system (e.g Intel Xeon based using
Vtune or Trace Collector). Then to how much extent this tuning will be
useful if this code is compiled and run on some other system (e.g. AMD
Opteron based)? Means whether the code tuning performed at one system
(possibly with a Profiler specific for that system)  is almost equivalently
good on another system ? Or we need to tune it again (possibly using a
Profiler specific to the new system).

Thank you for your attention.

A.Ali
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091121/ab1a7244/attachment.html>

From thakur at mcs.anl.gov  Sun Nov 22 15:55:58 2009
From: thakur at mcs.anl.gov (Rajeev Thakur)
Date: Sun, 22 Nov 2009 17:55:58 -0600
Subject: [Beowulf] FW: MPI Forum community feedback survey
Message-ID: <E2CCE9B8A7414EDF994221DBDA9EF753@thakurlaptop>


-----Original Message-----
From: Jeff Squyres
Sent: Friday, November 20, 2009 4:01 PM
Subject: MPI Forum community feedback survey

The MPI Forum announced at its SC09 BOF that they are soliciting  
community feedback to help guide the MPI-3 standards process.  A  
survey is available online at the following URL:

     http://mpi-forum.questionpro.com/
     Password: mpi3

In this survey, the MPI Forum is asking as many people as possible for  
feedback on the MPI-3 process -- what features to include, what  
features to not include, etc.

We encourage you to forward this survey on to as many interested and  
relevant parties as possible.

It will take approximately 10 minutes to complete the questionnaire.

No question in the survey is mandatory; feel free to only answer the  
questions which are relevant to you and your applications. Your  
answers will help the MPI Forum guide its process to create a  
genuinely useful MPI-3 standard.

This survey closes December 31, 2009.

Your survey responses will be strictly confidential and data from this  
research will be reported only in the aggregate. Your information will  
be coded and will remain confidential. If you have questions at any  
time about the survey or the procedures, you may contact the MPI Forum  
via email to mpi-comments at mpi-forum.org.

Thank you very much for your time and support.

-- 
Jeff Squyres


From csamuel at vpac.org  Sun Nov 22 19:57:44 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Mon, 23 Nov 2009 14:57:44 +1100 (EST)
Subject: [Beowulf] Performance profiling/tuning on different systems
In-Reply-To: <428810f20911211751r37749319v9cd9e1cd01d021d5@mail.gmail.com>
Message-ID: <26188990.651258948662183.JavaMail.csamuel@sys26>


----- "amjad ali" <amjad11 at gmail.com> wrote:

> Hi all,

Hiya,

> Or we need to tune it again (possibly using a Profiler
> specific to the new system).

The only way to know for certain is to test it and see.

Best of luck!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From brockp at umich.edu  Mon Nov 23 07:13:57 2009
From: brockp at umich.edu (Brock Palen)
Date: Mon, 23 Nov 2009 10:13:57 -0500
Subject: [Beowulf] SC09 podcast for those who missed the show
Message-ID: <089B1C9F-F3A7-467D-A288-1EFFBE647571@umich.edu>

For those of you who did not make it to sc09, Jeff Squyres and Brock  
Palen (that's me), did a special version of our podcast (rce- 
cast.com)  on some of the things we took away from the show.

Show notes and mp3 download
http://www.rce-cast.com/index.php/Podcast/rce-21-sc09-supercomputing-09.html

iTunes Subscribe:
http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=302882307

RSS Feed:
http://www.rce-cast.com/index.php/component/option,com_bca-rss-syndicator/feed_id,1/

Feel free to contact me off list with show ideas!

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985


From thakur at mcs.anl.gov  Sun Nov 22 16:06:54 2009
From: thakur at mcs.anl.gov (Rajeev Thakur)
Date: Sun, 22 Nov 2009 18:06:54 -0600
Subject: [Beowulf] [hpc-announce] FW: MPI Forum community feedback survey
Message-ID: <C1F53845AA904126AB4C65B7F313247A@thakurlaptop>

-----Original Message-----
From: Jeff Squyres
Sent: Friday, November 20, 2009 4:01 PM
Subject: MPI Forum community feedback survey

The MPI Forum announced at its SC09 BOF that they are soliciting community feedback to help guide the MPI-3 standards process.  A
survey is available online at the following URL:

     http://mpi-forum.questionpro.com/
     Password: mpi3

In this survey, the MPI Forum is asking as many people as possible for feedback on the MPI-3 process -- what features to include,
what features to not include, etc.

We encourage you to forward this survey on to as many interested and relevant parties as possible.

It will take approximately 10 minutes to complete the questionnaire.

No question in the survey is mandatory; feel free to only answer the questions which are relevant to you and your applications. Your
answers will help the MPI Forum guide its process to create a genuinely useful MPI-3 standard.

This survey closes December 31, 2009.

Your survey responses will be strictly confidential and data from this research will be reported only in the aggregate. Your
information will be coded and will remain confidential. If you have questions at any time about the survey or the procedures, you
may contact the MPI Forum via email to mpi-comments at mpi-forum.org.

Thank you very much for your time and support.

--
Jeff Squyres


From rpnabar at gmail.com  Mon Nov 23 16:31:47 2009
From: rpnabar at gmail.com (Rahul Nabar)
Date: Mon, 23 Nov 2009 18:31:47 -0600
Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the 
	BIOS
In-Reply-To: <f4050c9e0911171121s4ad379f7la78ccefe5d32cd28@mail.gmail.com>
References: <c4d69730911121647m2f538e04v98780934f6b72cbd@mail.gmail.com>
	<f4050c9e0911171121s4ad379f7la78ccefe5d32cd28@mail.gmail.com>
Message-ID: <c4d69730911231631q6c58d43ehaeeda26112434d6d@mail.gmail.com>

On Tue, Nov 17, 2009 at 1:21 PM, Andrew Shewmaker <agshew at gmail.com> wrote:
>
> Here's something on EFI I wrote up for myself in 2005. ?It's a bit out
> of date, but it covers some stuff that wikipedia doesn't. ?In
> particular, I would read the old Kernel Traffic to understand how
> various developers dislike EFI.

Thanks Andrew! Very useful.  That explains a lot of stuff about EFI.

-- 
Rahul


From lindahl at pbm.com  Mon Nov 23 16:35:34 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 23 Nov 2009 16:35:34 -0800
Subject: [Beowulf] CentOS plus Fedora kernel?
Message-ID: <20091124003534.GA15927@bx9.net>

For reasons complicated to explain, I want to run a Fedora kernel on
CentOS 5. Does anyone have any words of wisdom or pointers to webpages
for people who've done this?

-- greg

p.s. missed you guys at SC, I was stuck racking 500 servers...
http://www.flickr.com/photos/skrenta/sets/72157622738924345/


From geoff at galitz.org  Tue Nov 24 03:05:54 2009
From: geoff at galitz.org (Geoff Galitz)
Date: Tue, 24 Nov 2009 12:05:54 +0100
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <20091124003534.GA15927@bx9.net>
References: <20091124003534.GA15927@bx9.net>
Message-ID: <486144B3B40B401D9D45AAE089D7D5BD@geoffPC>


It would be safer to custom build a new Centos kernel.  There are also
"enhanced" kernels available for Centos in the centosplus software
repository.  They include support for technologies not normally found in
Centos.

Without knowing more about what you are looking for, I'd recommend checking
out centosplus before exploring kernels from alternate distributions.  

That is the only wisdom I have... as minor as it is.


-geoff

---------------------------------
Geoff Galitz
Blankenheim NRW, Germany
http://www.galitz.org/
http://german-way.com/blog/

> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On
> Behalf Of Greg Lindahl
> Sent: Dienstag, 24. November 2009 01:36
> To: beowulf at beowulf.org
> Subject: [Beowulf] CentOS plus Fedora kernel?
> 
> For reasons complicated to explain, I want to run a Fedora kernel on
> CentOS 5. Does anyone have any words of wisdom or pointers to webpages
> for people who've done this?
> 
> -- greg
> 
> p.s. missed you guys at SC, I was stuck racking 500 servers...
> http://www.flickr.com/photos/skrenta/sets/72157622738924345/
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


From eagles051387 at gmail.com  Tue Nov 24 03:54:45 2009
From: eagles051387 at gmail.com (Jonathan Aquilina)
Date: Tue, 24 Nov 2009 12:54:45 +0100
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <486144B3B40B401D9D45AAE089D7D5BD@geoffPC>
References: <20091124003534.GA15927@bx9.net>
	<486144B3B40B401D9D45AAE089D7D5BD@geoffPC>
Message-ID: <a31cd3860911240354n15aaa42bre862997cce71ef5@mail.gmail.com>

you also have to ask yourself what does the fedora kernel have that centos
doesnt and that you cant add with a recompilation of the kernel or as
mentioned in the previous email from the centosplus repo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091124/e07d9d48/attachment.html>

From dnlombar at ichips.intel.com  Tue Nov 24 08:34:11 2009
From: dnlombar at ichips.intel.com (David N. Lombard)
Date: Tue, 24 Nov 2009 08:34:11 -0800
Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the
	BIOS
In-Reply-To: <c4d69730911121647m2f538e04v98780934f6b72cbd@mail.gmail.com>
References: <c4d69730911121647m2f538e04v98780934f6b72cbd@mail.gmail.com>
Message-ID: <20091124163411.GA13390@nlxdcldnl2.cl.intel.com>

On Thu, Nov 12, 2009 at 04:47:42PM -0800, Rahul Nabar wrote:
> Has anyone tried out UEFI (Unified Extensible Firmware Interface) in
> the BIOS? The new servers I am buying come with this option in the
> BIOS. Out of curiosity I googled it up.
> 
> I am not sure if there were any HPC implications of this and wanted to
> double check before I switched to this from my conventional
> plain-vanilla BIOS. Any sort of "industry standard" always sounds good
> but I thought it safer to check on the group first....

Just catching up w/ email.  SC derails most normal work...

I use UEFI, but, well, that shouldn't be a large surprise.  elilo is the boot loader,
it's written in C, and quite easy to manage.  I actually use an abridged version of
elilo to load a Linux kernel/initrd that provides my booting support.  You can, for
example, put the kernel/initrd normally PXE booted directly on the node and get into
that w/o having to deal with PXE, TFTP, et al.  Once you get that far, you can then
carefully tune the kernel/initrd to exquisitely control the boot process.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.


From amacater at galactic.demon.co.uk  Tue Nov 24 13:40:11 2009
From: amacater at galactic.demon.co.uk (Andrew M.A. Cater)
Date: Tue, 24 Nov 2009 21:40:11 +0000
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <20091124003534.GA15927@bx9.net>
References: <20091124003534.GA15927@bx9.net>
Message-ID: <20091124214011.GB31909@galactic.demon.co.uk>

On Mon, Nov 23, 2009 at 04:35:34PM -0800, Greg Lindahl wrote:
> For reasons complicated to explain, I want to run a Fedora kernel on
> CentOS 5. Does anyone have any words of wisdom or pointers to webpages
> for people who've done this?
> 
> -- greg
> 
> p.s. missed you guys at SC, I was stuck racking 500 servers...
> http://www.flickr.com/photos/skrenta/sets/72157622738924345/
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

I hesitate to say this, because I'm talking to someone whose 
reputation is stellar and because my own biases may show slightly :)

	Don't - under any circumstances whatever - use Fedora on a 
production system or a system on which you want to do real work.

If you _MUST_ do it - because, for example, your hardware is too new and 
not yet supported under the Red Hat Enterprise Linux / Centos 5.4
kernel - 2.6.18-164* if I recall correctly - then it _may_ work but it 
WILL cause you some degree of instability, interesting debugging 
interaction problems and some hours/days of frustration.

Red Hat 5 was based on FC6 or Fedora 8 IIRC. Both now unavailable on the 
main mirrors.

Fedora 9 has 2.6.25 which is quite a jump. It might be worth getting the 
source RPMs and building the kernels from each of 10,11,12 but on a 
CentOS machine.

RPMForge doesn't seem to have much to help, here :(

All best,

AndyC


From jack at crepinc.com  Mon Nov 23 17:17:43 2009
From: jack at crepinc.com (Jack Carrozzo)
Date: Mon, 23 Nov 2009 20:17:43 -0500
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <20091124003534.GA15927@bx9.net>
References: <20091124003534.GA15927@bx9.net>
Message-ID: <2ad0f9f60911231717w20c2277bp7f47827b45ef19c0@mail.gmail.com>

Technically, there's no reason it wouldn't work, as it's the same to
the OS as if you were just running a vanilla kernel (you're SURE you
need the fedora kernel and a vanilla won't do?)

Get the kernel source for the Fedora one, drop on CentOS, and go to
town as you would a vanilla kernel.

Cheers,

-Jack Carrozzo

On Mon, Nov 23, 2009 at 7:35 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> For reasons complicated to explain, I want to run a Fedora kernel on
> CentOS 5. Does anyone have any words of wisdom or pointers to webpages
> for people who've done this?
>
> -- greg
>
> p.s. missed you guys at SC, I was stuck racking 500 servers...
> http://www.flickr.com/photos/skrenta/sets/72157622738924345/
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From ralph.mason at gmail.com  Tue Nov 24 09:30:24 2009
From: ralph.mason at gmail.com (Ralph Mason)
Date: Tue, 24 Nov 2009 09:30:24 -0800
Subject: [Beowulf] Low cost Hi Density - Nehalem clusters
Message-ID: <772f86d60911240930p22a7b2ddga6bb9f2238a3532a@mail.gmail.com>

We have been building a development cluster using consumer core i7
processors with 12gb of ram each and a ide cf boot disk on a motherboard
with embedded everything.  This is very cost effective as it uses consumer
processors and the cheapest ram (ram prices skyrocket once you go higher
than 2gb modules).

Does anyone know of any commercial offering that can pack these nodes into a
high density rack or offer a similar price performance curve for the given
ram and processing power?

Thanks
Ralph
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091124/09e598f3/attachment.html>

From ed at eh3.com  Tue Nov 24 18:46:33 2009
From: ed at eh3.com (Ed Hill)
Date: Tue, 24 Nov 2009 21:46:33 -0500
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <20091124214011.GB31909@galactic.demon.co.uk>
References: <20091124003534.GA15927@bx9.net>
	<20091124214011.GB31909@galactic.demon.co.uk>
Message-ID: <20091124214633.6b20112c@localhost.localdomain>

On Tue, 24 Nov 2009 21:40:11 +0000 "Andrew M.A. Cater" wrote:
> 
> I hesitate to say this, because I'm talking to someone whose 
> reputation is stellar and because my own biases may show slightly :)
> 
> 	Don't - under any circumstances whatever - use Fedora on a 
> production system or a system on which you want to do real work.
> 
> If you _MUST_ do it - because, for example, your hardware is too new
> and not yet supported under the Red Hat Enterprise Linux / Centos 5.4
> kernel - 2.6.18-164* if I recall correctly - then it _may_ work but
> it WILL cause you some degree of instability, interesting debugging 
> interaction problems and some hours/days of frustration.


Please take a deep breath and lay off the FUD.

Fedora is a perfectly capable cluster OS.  I know of a half-dozen
operational clusters (some with literally hundreds of CPUs) that rely on
it for software development, data analysis, production runs, etc., etc.
In my experience, a well-managed Fedora-based cluster will have the same
stability and usability as any other well-managed Linux cluster.

And yes, Fedora does ship with some {leading,bleeding} edge bits which
cuts *both* ways -- sometimes its a big help (e.g., kernel support for
just-released hardware) and sometimes (e.g.; newer library versions) it
can be a bit of a hassle.  As others have mentioned, the short Fedora
lifetimes are not for everyone.

Unless you stumble upon a batch of truly unreliable hardware (which
does occasionally happen), the overall utility of a Linux cluster
is a *direct* result of the skill and care of those who manage it.

Ed

-- 
Edward H. Hill III, PhD  |  ed at eh3.com  |  http://eh3.com/


From jlforrest at berkeley.edu  Tue Nov 24 19:41:56 2009
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Tue, 24 Nov 2009 19:41:56 -0800
Subject: [Beowulf] Low cost Hi Density - Nehalem clusters
In-Reply-To: <772f86d60911240930p22a7b2ddga6bb9f2238a3532a@mail.gmail.com>
References: <772f86d60911240930p22a7b2ddga6bb9f2238a3532a@mail.gmail.com>
Message-ID: <4B0CA784.2060402@berkeley.edu>

Ralph Mason wrote:
> We have been building a development cluster using consumer core i7 
> processors with 12gb of ram each and a ide cf boot disk on a motherboard 
> with embedded everything.  This is very cost effective as it uses 
> consumer processors and the cheapest ram (ram prices skyrocket once you 
> go higher than 2gb modules).  
> 
> Does anyone know of any commercial offering that can pack these nodes 
> into a high density rack or offer a similar price performance curve for 
> the given ram and processing power?  

I've recently used Finetec (www.finetec.com) to put together
a cluster based on the dual-motherboard Supermicro cases.
Since each motherboard can hold 2 processors, and each AMD
Istanbul processor has 6-cores, I can get 24 cores per rack
unit. That's pretty dense.

I believe that SuperMicro also makes similar motherboards
for Intel processors.

Check it out!

Cordially,
-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From amjad11 at gmail.com  Tue Nov 24 20:42:24 2009
From: amjad11 at gmail.com (amjad ali)
Date: Tue, 24 Nov 2009 23:42:24 -0500
Subject: [Beowulf] Low cost Hi Density - Nehalem clusters
In-Reply-To: <4B0CA784.2060402@berkeley.edu>
References: <772f86d60911240930p22a7b2ddga6bb9f2238a3532a@mail.gmail.com>
	<4B0CA784.2060402@berkeley.edu>
Message-ID: <428810f20911242042w151ab679y82258f85ad15ba2c@mail.gmail.com>

Hi,
while looking for high density (fat) nodes; one should keep in mind that
more number of cores striving simultaneously to access memory, cause memory
contention. Infact today CPU cores are fast and scalable but not the memory
bandwidth. Experts say that buying a CPU is infact buying "bandwidth" not
the "speed".

Intel's specialty is QPI (quick path interconnect) --- removed Front Side
Bus (FSB), in Nehalem CPUs, as an effort to over come the bottleneck of
memory bandwidth. This bottleneck arises when more number of cpu cores, each
performing floating point oprations in HPC applications, strive to get
access to main memory simultaneously. Also Intel introduced QPI to compete
with the AMD's speciality Direct Connect Architecture (DCA) in its latest
CPUs.

Nehalem Xeon 55xx are "server" class while Nehalem Core i7/i9 are "desktop"
class cpus.

In server processors
Today in Servers latest (Nehalem) Quad core Xeon 55xx (DP/ for dual socket
boards) and Xeon 35xx ( UP/ for single socket boards) are alive.
Today in Servers Quad core Xeon 54xx (DP/ for dual socket boards) are not so
good.
And Xeon 53xx (DP/ for dual socket boards) are virtually dead.
And Xeon 33xx/32xx ( UP/ for single socket boards) are not so good.

Today in Servers latest Quad core and Six Core Opteron 83xx and 84xx
(Shanghai -- 3rd generation) are alive.
Today in Servers Opterons 13xx/23xx/24xx (Budapest/Barcelona --- 2nd
generation) are not so good.

There price differences reflect these 'facts'.


On Tue, Nov 24, 2009 at 10:41 PM, Jon Forrest <jlforrest at berkeley.edu>wrote:

> Ralph Mason wrote:
>
>> We have been building a development cluster using consumer core i7
>> processors with 12gb of ram each and a ide cf boot disk on a motherboard
>> with embedded everything.  This is very cost effective as it uses consumer
>> processors and the cheapest ram (ram prices skyrocket once you go higher
>> than 2gb modules).
>> Does anyone know of any commercial offering that can pack these nodes into
>> a high density rack or offer a similar price performance curve for the given
>> ram and processing power?
>>
>
> I've recently used Finetec (www.finetec.com) to put together
> a cluster based on the dual-motherboard Supermicro cases.
> Since each motherboard can hold 2 processors, and each AMD
> Istanbul processor has 6-cores, I can get 24 cores per rack
> unit. That's pretty dense.
>
> I believe that SuperMicro also makes similar motherboards
> for Intel processors.
>
> Check it out!
>
> Cordially,
> --
> Jon Forrest
> Research Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091124/4cf910f0/attachment.html>

From bcostescu at gmail.com  Wed Nov 25 03:37:01 2009
From: bcostescu at gmail.com (Bogdan Costescu)
Date: Wed, 25 Nov 2009 12:37:01 +0100
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <20091124003534.GA15927@bx9.net>
References: <20091124003534.GA15927@bx9.net>
Message-ID: <c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>

On Tue, Nov 24, 2009 at 1:35 AM, Greg Lindahl <lindahl at pbm.com> wrote:
> For reasons complicated to explain, I want to run a Fedora kernel on
> CentOS 5. Does anyone have any words of wisdom or pointers to webpages
> for people who've done this?

A much newer kernel from Fedora usually requires newer utils as well
and this is where it gets hairy. A recent Fedora kernel SRPM probably
would not even compile on CentOS 5 (haven't tried lately) and a binary
kernel RPM downloaded from a Fedora mirror will certainly not install
(but I guess that you've tried that already ;-)). You can try forcing
the installation (rpm --nodeps) and watch what breaks - chances are
that at least basic functionality will remain; you can also try
installing all the dependencies, but at that point you are probably
running Fedora with only high-level user apps from CentOS; or
something in between where you install only the Fedora kernel and the
required few dependencies for the features you are interested in and
hope that the rest will remain in some functional form.

What do you need from the Fedora kernel ? And why not running the
newly released Fedora 12 which will be the base for CentOS 6 anyway
shortly ? (for a vague definition of shortly ;-))

Bogdan


From pal at di.fct.unl.pt  Wed Nov 25 04:24:00 2009
From: pal at di.fct.unl.pt (Paulo Afonso Lopes)
Date: Wed, 25 Nov 2009 12:24:00 -0000 (WET)
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>
References: <20091124003534.GA15927@bx9.net>
	<c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>
Message-ID: <26580.193.136.122.19.1259151840.squirrel@webmail.fct.unl.pt>


> On Tue, Nov 24, 2009 at 1:35 AM, Greg Lindahl <lindahl at pbm.com> wrote:
>> For reasons complicated to explain, I want to run a Fedora kernel on
>> CentOS 5. Does anyone have any words of wisdom or pointers to webpages
>> for people who've done this?
>

Just guessing: if "reasons complicated to explain" refers to hardware
which is not supported by CentOS, installing Fedora and booting CentOS in
a VM guest would be a solution for you?

paulo

-- 
Paulo Afonso Lopes                        | Tel: +351- 21 294 8536
Departamento de Inform?tica               | 294 8300 ext.10702
Faculdade de Ci?ncias e Tecnologia        | Fax: +351- 21 294 8541
Universidade Nova de Lisboa               | e-mail: poral at fct.unl.pt
2829-516 Caparica, PORTUGAL


From h.nakashima at media.kyoto-u.ac.jp  Wed Nov 25 00:03:53 2009
From: h.nakashima at media.kyoto-u.ac.jp (Hiroshi Nakashima)
Date: Wed, 25 Nov 2009 17:03:53 +0900
Subject: [Beowulf] [hpc-announce] CFP of ICS'10: Intl. Conf. Supercomputing
Message-ID: <163C0DE85F86479B8EE11B6B8945321B@MEDUZA>

[Our apologies if you receive multiple copies of this CFP]

CALL FOR PAPERS
24th International Conference on Supercomputing (ICS'10)
http://www.ics-conference.org
June 1-4, 2010
Epochal Tsukuba (Tsukuba International Congress Center)
Tsukuba, Japan  http://www.epochal.or.jp/eng/
Sponsored by ACM/SIGARCH

ICS is the premier international forum for the presentation of
research results in high-performance computing systems.  In 2010 the
conference will be held at the Epochal Tsukuba (Tsukuba International
Congress Center) in Tsukuba City, the largest high-tech and academic
city in Japan.

Papers are solicited on all aspects of research, development, and
application of high-performance experimental and commercial systems.
Special emphasis will be given to work that leads to better
understanding of the implications of the new era of million-scale
parallelism and Exa-scale performance; including (but not limited to):

* Computationally challenging scientific and commercial applications:
studies and experiences to exploit ultra large scale parallelism, a
large number of accelerators, and/or cloud computing paradigm.

* High-performance computational and programming models: studies and
proposals of new models, paradigms and languages for scalable
application development, seamless exploitation of accelerators, and
grid/cloud computing.

* Architecture and hardware aspects: processor, accelerator, memory,
interconnection network, storage and I/O architecture to make future
systems scalable, reliable and power efficient.

* Software aspects: compilers and runtime systems, programming and
development tools, middleware and operating systems to enable us to
scale applications and systems easily, efficiently and reliably.

* Performance evaluation studies and theoretical underpinnings of any
of the above topics, especially those giving us perspective toward
future generation high-performance computing.

* Large scale installations in the Petaflop era: design, scaling,
power, and reliability, including case studies and experience reports,
to show the baselines for future systems.

In order to encourage open discussion on future directions, the
program committee will provide higher priority for papers that present
highly innovative and challenging ideas.

Papers should not exceed 6,000 words, and should be submitted
electronically, in PDF format using the ICS'10 submission web
site. Submissions should be blind.  The review process will include a
rebuttal period. Please refer to the ICS'10 web site for detailed
instructions.

Workshop and tutorial proposals are also be solicited and due by
January 18, 2010.  For further information and future updates, refer
to the ICS'10 web site at http://www.ics-conference.org or contact the
General Chair (ics10-chair at hpcs.cs.tsukuba.ac.jp) or Program Co-Chairs
(ics10-chairs at ac.upc.edu).

Important Dates
Abstract submission:  January 11, 2010
Paper submission:     January 18, 2010
Author notification:  March 22, 2010
Final papers:         April 15, 2010

For more information, please visit the conference web site at
http://www.ics-conference.org

[ICS 2010 Committee Members]
GENRAL CHAIR
    Taisuke Boku, U. Tsukuba
PROGRAM CO-CHAIRS
    Hiroshi Nakashima, Kyoto U.
    Avi Mendelson, Microsoft
FINANCE CHAIR
    Kazuki Joe, Nara Women's U.
PUBLICATION CHAIR
    Osamu Tatebe, U. Tsukuba
PUBLICITY CO-CHAIRS
    Darren Kerbyson, LANL
    Hironori Nakajo, Tokyo U. Agric. & Tech.
    Serge Petiton, CNRS/LIFL
WORKSHOP & TUTORIAL CHAIR
    Koji Inoue, Kyushu U.
WEB & SUBMISSION CO-CHAIRS
    Eduard Ayguade, BSC/UPC
    Alex Ramirez, BSC/UPC
LOCAL ARRANGEMENT CHAIR
    Daisuke Takahashi, U. Tsukuba

PROGRAM COMITTEE
    Jung Ho Ahn, Seoul NU.
    Eduard Ayguade, BSC/UPC
    Carl Beckmann, Intel
    Muli Ben-Yehuda, IBM
    Gianfranco Bilardi, U. Padova
    Greg Byrd, NCSU
    Franck Cappello, INRIA
    Marcelo Cintra, U. Edinburgh
    Luiz De Rose, Cray
    Bronis De Supinski, LLNL/CASC
    Jack Dongarra, UTenn/ORNL
    Eytan Frachtenberg, Powerset Research
    Kyle Gallivan, FSU
    Stratis Gallopoulos, ,U. Patras
    Milind Girkar, Intel
    Bill Gropp, UIUC
    Mike Heroux, SNL
    Adolfy Hoisie, LANL
    Koh Hotta, Fujitsu
    Yutaka Ishikawa, U. Tokyo
    Takeshi Iwashita, Kyoto U.
    Kazuki Joe, Nara Woman's U.
    Hironori Kasahara, U. Waseda
    Arun Kejariwal, Yahoo
    Darren Kerbyson, LANL
    Moe Khaleel, PNNL
    Bill Kramer, NCSA
    Andrew Lewis, Griffith U.
    Jose Moreira, IBM
    Walid Najjar, U.C. Riverside
    Kengo Nakajima, U. Tokyo
    Hironori Nakajo, Tokyo U. Agric. & Tech.
    Hiroshi Nakamura, U. Tokyo
    Toshio Nakatani, IBM Research Tokyo
    Michael O'Boyle, U. Edinburgh
    Lenny Oliker, LBNL
    Theodore Papatheodoro, U. Patras
    Miquel Pericas, BSC
    Keshav Pingali, U. Texas 
    Depei Qian, Beihang U.
    Alex Ramirez, BSC/UPC
    Valentina Salapura, IBM
    Mitsuhisa Sato, U. Tsukuba
    John Shalf, LBNL
    Takeshi Shimizu, Fujitsu
    Joshua Simons, Sun Microsystems
    Shinji Sumimoto, Fujitsu
    Makoto Taiji, Riken
    Toshikazu Takada, Riken
    Daisuke Takahashi, U. Tsukuba
    Guangming Tan, ICT
    Osamu Tatebe, U. Tsukuba
    Kenjiro Taura, U. Tokyo
    Rajeev Thakur, ANL
    Rong Tian, NCIC
    Robert Van Engelen, FSU
    Harry Wijshoff, Leiden
    Mitsuo Yokokawa, Riken
    Ayal Zaks, IBM 
    Yunquan Zhang, ISCAS
---------------------------------------------------------------------
Hiroshi Nakashima (h.nakashima at media.kyoto-u.ac.jp)  Professor
Academic Center for Computing and Media Studies
Kyoto University
ACCMS North Bldg., Yoshida Hon-machi, Sakyo-ku, Kyoto, 606-8501, JAPAN
+81-75-753-7457/7448(F)


From agshew at gmail.com  Wed Nov 25 10:35:02 2009
From: agshew at gmail.com (Andrew Shewmaker)
Date: Wed, 25 Nov 2009 11:35:02 -0700
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <20091124003534.GA15927@bx9.net>
References: <20091124003534.GA15927@bx9.net>
Message-ID: <f4050c9e0911251035m437c288h9ef50c65ca139313@mail.gmail.com>

On Mon, Nov 23, 2009 at 5:35 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> For reasons complicated to explain, I want to run a Fedora kernel on
> CentOS 5. Does anyone have any words of wisdom or pointers to webpages
> for people who've done this?

I'm interested in this sort of thing too.  I haven't taken the latest
Fedora 12 kernel and put it on a RHEL 5 distro yet, but I have
backported the spec file to Fedora 9.  Older distros don't checksum
the rpms the same way, so the first step is to unpack it with rpm2cpio
and then recreate the src rpm.  I used vimdiff between the Fedora 12
spec file and a previous Fedora 9 kernel's spec file, and changed
things like the list of directories included in the header subpackage.

I also remember updating grubby and mkinitrd packages, but I don't
recall if that ended up being totally necessary.

If I get this done for a RHEL5 distro, then I'll let you know.
Likewise, if you get it done, then I'd like to try it.

-- 
Andrew Shewmaker


From lindahl at pbm.com  Wed Nov 25 11:34:27 2009
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 25 Nov 2009 11:34:27 -0800
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>
References: <20091124003534.GA15927@bx9.net>
	<c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>
Message-ID: <20091125193427.GE16652@bx9.net>

On Wed, Nov 25, 2009 at 12:37:01PM +0100, Bogdan Costescu wrote:

> A much newer kernel from Fedora usually requires newer utils as well

Yeah, last time I did this I didn't rpm-ize the kernel, and that saved
me quite a bit of work. I snagged the .config file out of Fedora, but
didn't grab any patches.

Since you guys seem determined to speculate about what feature I need,
it's the new scheduler, CFS, which appeared in 2.6.23.

> And why not running the newly released Fedora 12 which will be the
> base for CentOS 6 anyway shortly ? (for a vague definition of
> shortly ;-))

Fedora on a big production cluster? Not a chance.

-- greg


From agshew at gmail.com  Wed Nov 25 12:14:01 2009
From: agshew at gmail.com (Andrew Shewmaker)
Date: Wed, 25 Nov 2009 13:14:01 -0700
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <20091125193427.GE16652@bx9.net>
References: <20091124003534.GA15927@bx9.net>
	<c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>
	<20091125193427.GE16652@bx9.net>
Message-ID: <f4050c9e0911251214x13541497m19d4d8b7dacf7de2@mail.gmail.com>

On Wed, Nov 25, 2009 at 12:34 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> On Wed, Nov 25, 2009 at 12:37:01PM +0100, Bogdan Costescu wrote:
>
>> A much newer kernel from Fedora usually requires newer utils as well
>
> Yeah, last time I did this I didn't rpm-ize the kernel, and that saved
> me quite a bit of work. I snagged the .config file out of Fedora, but
> didn't grab any patches.
>
> Since you guys seem determined to speculate about what feature I need,
> it's the new scheduler, CFS, which appeared in 2.6.23.

In case you were considering the 2.6.25 kernel that shipped with Fedora 9,
I recommend against it.  I know there have been studies showing that it is
a nice kernel with regard to low interrupt noise, but I have regularly seen it
lock up while running MPI apps.  2.6.27 hasn't had the same issue, but we
have seen 10-20% regression in IP network performance.

-- 
Andrew Shewmaker


From Michael.Frese at NumerEx-LLC.com  Wed Nov 25 16:31:41 2009
From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese)
Date: Wed, 25 Nov 2009 17:31:41 -0700
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <f4050c9e0911251214x13541497m19d4d8b7dacf7de2@mail.gmail.co
 m>
References: <20091124003534.GA15927@bx9.net>
	<c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>
	<20091125193427.GE16652@bx9.net>
	<f4050c9e0911251214x13541497m19d4d8b7dacf7de2@mail.gmail.com>
Message-ID: <6.2.5.6.2.20091125171548.060750b8@NumerEx-LLC.com>

At 01:14 PM 11/25/2009, Andrew Shewmaker wrote:
>On Wed, Nov 25, 2009 at 12:34 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> > On Wed, Nov 25, 2009 at 12:37:01PM +0100, Bogdan Costescu wrote:
> >
> >> A much newer kernel from Fedora usually requires newer utils as well
> >
> > Yeah, last time I did this I didn't rpm-ize the kernel, and that saved
> > me quite a bit of work. I snagged the .config file out of Fedora, but
> > didn't grab any patches.
> >
> > Since you guys seem determined to speculate about what feature I need,
> > it's the new scheduler, CFS, which appeared in 2.6.23.
>
>In case you were considering the 2.6.25 kernel that shipped with Fedora 9,
>I recommend against it.  I know there have been studies showing that it is
>a nice kernel with regard to low interrupt noise, but I have regularly seen it
>lock up while running MPI apps.  2.6.27 hasn't had the same issue, but we
>have seen 10-20% regression in IP network performance.

I think I'm seeing 2.6.23 lockup with a big MPI app, but I wouldn't 
have guessed the connection without Andrew's message.

We've been moving away from Fedora toward CentOS 5.4, and thus, back 
to 2.6.18, but apparently not fast enough.


Mike


From shaeffer at neuralscape.com  Wed Nov 25 20:01:22 2009
From: shaeffer at neuralscape.com (Karen Shaeffer)
Date: Wed, 25 Nov 2009 20:01:22 -0800
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <f4050c9e0911251214x13541497m19d4d8b7dacf7de2@mail.gmail.com>
References: <20091124003534.GA15927@bx9.net>
	<c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>
	<20091125193427.GE16652@bx9.net>
	<f4050c9e0911251214x13541497m19d4d8b7dacf7de2@mail.gmail.com>
Message-ID: <20091126040122.GA689@synapse.neuralscape.com>

On Wed, Nov 25, 2009 at 01:14:01PM -0700, Andrew Shewmaker wrote:
> On Wed, Nov 25, 2009 at 12:34 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> > On Wed, Nov 25, 2009 at 12:37:01PM +0100, Bogdan Costescu wrote:
> >
> In case you were considering the 2.6.25 kernel that shipped with Fedora 9,
> I recommend against it.  I know there have been studies showing that it is
> a nice kernel with regard to low interrupt noise, but I have regularly seen it
> lock up while running MPI apps.  2.6.27 hasn't had the same issue, but we
> have seen 10-20% regression in IP network performance.

Hi,
The 2.6.27 kernel had a lot of new networking code in there. And some of the
network performance issues carry forward into the 2.6.28 kernel. I would
suggest you go to the 2.6.29 kernel. BTW, centos 5 runs a modified ext3
filesystem. So, that is an issue you'll need to come to terms with in moving
to other kernels. FYI, fedora core 12 runs the 2.6.31 kernel.

Good luck with it.
Karen
-- 
 Karen Shaeffer
 Neuralscape, Palo Alto, Ca. 94306
 shaeffer at neuralscape.com  http://www.neuralscape.com


From csamuel at vpac.org  Wed Nov 25 20:20:14 2009
From: csamuel at vpac.org (Chris Samuel)
Date: Thu, 26 Nov 2009 15:20:14 +1100 (EST)
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <20091126040122.GA689@synapse.neuralscape.com>
Message-ID: <14565064.571259209210844.JavaMail.csamuel@sys26>


----- "Karen Shaeffer" <shaeffer at neuralscape.com> wrote:

Hiya,

> BTW, centos 5 runs a modified ext3 filesystem. So,
> that is an issue you'll need to come to terms with
> in moving to other kernels.

We've not seen any issues running mainline kernels
(2.6.30.x at present) with CentOS 5, what issues have
you seen with this ?

> FYI, fedora core 12 runs the 2.6.31 kernel.

2.6.31 also introduces new (better?) support for MCE's
on the k10h family of CPUs (Barcelona, Shanghai), though
if you (like us) have scripts that parse /var/log/mcelog
you'll need to look at the data in /sys instead.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


From d.love at liverpool.ac.uk  Thu Nov 26 09:02:05 2009
From: d.love at liverpool.ac.uk (Dave Love)
Date: Thu, 26 Nov 2009 17:02:05 +0000
Subject: [Beowulf] Re: CentOS plus Fedora kernel?
References: <20091124003534.GA15927@bx9.net>
	<c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>
	<20091125193427.GE16652@bx9.net>
Message-ID: <87zl69xlrm.fsf@liv.ac.uk>

Greg Lindahl <lindahl at pbm.com> writes:

> Since you guys seem determined to speculate about what feature I need,
> it's the new scheduler, CFS, which appeared in 2.6.23.

What's wrong with the Linux from RedHat's HPC/`Grid' offering (whatever
it's called), then?  It has a 2.6.24 base.  I can't try it because of a
pestilential proprietary driver, but it seemed really to be more
appropriate for compute nodes in various respects (given that I don't
have a choice about running a RHEL-based system).


From shaeffer at neuralscape.com  Thu Nov 26 10:32:38 2009
From: shaeffer at neuralscape.com (Karen Shaeffer)
Date: Thu, 26 Nov 2009 10:32:38 -0800
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <14565064.571259209210844.JavaMail.csamuel@sys26>
References: <20091126040122.GA689@synapse.neuralscape.com>
	<14565064.571259209210844.JavaMail.csamuel@sys26>
Message-ID: <20091126183238.GA9765@synapse.neuralscape.com>

On Thu, Nov 26, 2009 at 03:20:14PM +1100, Chris Samuel wrote:
> 
> ----- "Karen Shaeffer" <shaeffer at neuralscape.com> wrote:
> 
> Hiya,
> 
> > BTW, centos 5 runs a modified ext3 filesystem. So,
> > that is an issue you'll need to come to terms with
> > in moving to other kernels.
> 
> We've not seen any issues running mainline kernels
> (2.6.30.x at present) with CentOS 5, what issues have
> you seen with this ?

Hi Chris,
Actually, I haven't tried it recently. But I have recently tried
to run a recompiled and reconfigured RHEL 5 kernel on a stock
ext3 filesystem. And it crashed almost immediately with filesystem
corruption. And the RHEL 5 grub can't read a stock ext3 filesystem
either.

I appreciate your feedback, because I was wondering about that
very question. And my comment only meant to suggest one needed to
be aware of the issue. Based on your comments, it appears the RH
extension of ext3 is backward compatible with kernel.org kernels.
But apparently the RH kernels require the extensions in order to
write.

Thanks,
Karen


-- 
 Karen Shaeffer
 Neuralscape, Palo Alto, Ca. 94306
 shaeffer at neuralscape.com  http://www.neuralscape.com


From ispmarin at gmail.com  Thu Nov 26 16:38:38 2009
From: ispmarin at gmail.com (Ivan Marin)
Date: Thu, 26 Nov 2009 22:38:38 -0200
Subject: [Beowulf] Parallel programming using Scalapack and OpenMPI
Message-ID: <751c63ee0911261638v6bf2e136y26df1c734413615f@mail.gmail.com>

Hello all,

I've been following the discussions here in this list for quite a while and
always enjoying the discussions, and did some admin work in beowulf
clusters. But after a long time far from parallel programming, now for my
PhD in groundwater simulation I'm trying again to implement the linear
solver pdgesv from Scalapack. I'm having some troubles with the definitions
in the function call within C++ and the best data distribution, so I would
like to ask: is there anybody on this list developing with Scalapack and
C++? Where is the proper place to ask Scalapack questions? It seems that
both the forum and the mailing list doesn't have any activity recently.

Thank you in advance!

Ivan Marin

Laborat?rio de Hidr?ulica Computacional - LHC
Departamento de Hidr?ulica e Saneamento - SHS
Escola de Engenharia de S?o Carlos - EESC
Universidade de S?o Paulo - USP

http://albatroz.shs.eesc.usp.br
+55 16 3373 8270
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091126/2c7cb791/attachment.html>

From greg.matthews at diamond.ac.uk  Fri Nov 27 02:08:42 2009
From: greg.matthews at diamond.ac.uk (Gregory Matthews)
Date: Fri, 27 Nov 2009 10:08:42 +0000
Subject: [Beowulf] CentOS plus Fedora kernel?
In-Reply-To: <20091126183238.GA9765@synapse.neuralscape.com>
References: <20091126040122.GA689@synapse.neuralscape.com>	<14565064.571259209210844.JavaMail.csamuel@sys26>
	<20091126183238.GA9765@synapse.neuralscape.com>
Message-ID: <4B0FA52A.3050901@diamond.ac.uk>

ack... meant this to go to the list:

Karen Shaeffer wrote:
> On Thu, Nov 26, 2009 at 03:20:14PM +1100, Chris Samuel wrote:
>> ----- "Karen Shaeffer" <shaeffer at neuralscape.com> wrote:
>>
>> Hiya,
>>
>>> BTW, centos 5 runs a modified ext3 filesystem. So,
>>> that is an issue you'll need to come to terms with
>>> in moving to other kernels.
>> We've not seen any issues running mainline kernels
>> (2.6.30.x at present) with CentOS 5, what issues have
>> you seen with this ?
> 
> Hi Chris,
> Actually, I haven't tried it recently. But I have recently tried
> to run a recompiled and reconfigured RHEL 5 kernel on a stock
> ext3 filesystem. And it crashed almost immediately with filesystem
> corruption. And the RHEL 5 grub can't read a stock ext3 filesystem
> either.

this is news to me. where can I find info on the modifications that RH
are using? Google hasn't turned anything up for me this morning.

GREG

> 
> I appreciate your feedback, because I was wondering about that
> very question. And my comment only meant to suggest one needed to
> be aware of the issue. Based on your comments, it appears the RH
> extension of ext3 is backward compatible with kernel.org kernels.
> But apparently the RH kernels require the extensions in order to
> write.
> 
> Thanks,
> Karen
> 
> 


-- 
Greg Matthews            01235 778658
Senior Computer Systems Administrator
Diamond Light Source, Oxfordshire, UK


From jellogum at gmail.com  Fri Nov 27 21:21:42 2009
From: jellogum at gmail.com (Jeremy Baker)
Date: Fri, 27 Nov 2009 21:21:42 -0800
Subject: [Beowulf] ask about mpich
In-Reply-To: <c761caee0911120433u2a9f27b4u435872577bc9dbee@mail.gmail.com>
References: <c761caee0911120433u2a9f27b4u435872577bc9dbee@mail.gmail.com>
Message-ID: <d2300d1c0911272121pe500df0if3780f751c8e4bf3@mail.gmail.com>

Edited revision of original post without permission from author:

"Hello folks;

I want to make a cluster system employing the command/function: mpich, in
Ubuntu, but I am not too familiar with it. I could use some advice for a
problem that a good translation might solve for me.

I have followed the instructions [assuming translated into your native
language] for mpich, the information related to the cluster that I am using,
but it does not work. The project is related to an important work that is my
thesis. Help would be greatly appreciated!"


What is your native language?


On Thu, Nov 12, 2009 at 4:33 AM, christian suhendra <
christiansuhendra at gmail.com> wrote:

> halo guys i wants to make a cluster system with mpich in ubuntu,,but i have
> troubleshooting with mpich..
> but when i run the example program in mpich..it doesn't work in
> cluster..but i've registered the node on machine.LINUX..
> but still not working
> please help me..this is my thesis...
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>


-- 
Jeremy Baker
PO 297
Johnson, VT
05656
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091127/2c1e0420/attachment.html>

From d.love at liverpool.ac.uk  Sun Nov 29 15:01:02 2009
From: d.love at liverpool.ac.uk (Dave Love)
Date: Sun, 29 Nov 2009 23:01:02 +0000
Subject: [Beowulf] Re: CentOS plus Fedora kernel?
References: <20091126040122.GA689@synapse.neuralscape.com>
	<14565064.571259209210844.JavaMail.csamuel@sys26>
	<20091126183238.GA9765@synapse.neuralscape.com>
	<4B0FA52A.3050901@diamond.ac.uk>
Message-ID: <87bpillyvl.fsf@liv.ac.uk>

Gregory Matthews <greg.matthews at diamond.ac.uk> writes:

>> Actually, I haven't tried it recently. But I have recently tried
>> to run a recompiled and reconfigured RHEL 5 kernel on a stock
>> ext3 filesystem. And it crashed almost immediately with filesystem
>> corruption. And the RHEL 5 grub can't read a stock ext3 filesystem
>> either.
>
> this is news to me. where can I find info on the modifications that RH
> are using? Google hasn't turned anything up for me this morning.

It sounds implausible, and the RH grub has no patches that mention ext3
in their name, but I'm not sure I've actually booted that way round.

>> But apparently the RH kernels require the extensions in order to
>> write.

Experimentally an ext3 filesystem made on SuSE, seems fine for i/o from
RH 5.4.


From agshew at gmail.com  Mon Nov 30 09:15:38 2009
From: agshew at gmail.com (Andrew Shewmaker)
Date: Mon, 30 Nov 2009 10:15:38 -0700
Subject: [Beowulf] Re: CentOS plus Fedora kernel?
In-Reply-To: <87zl69xlrm.fsf@liv.ac.uk>
References: <20091124003534.GA15927@bx9.net>
	<c609bc800911250337m2461809dhb33fd850a6f45f97@mail.gmail.com>
	<20091125193427.GE16652@bx9.net> <87zl69xlrm.fsf@liv.ac.uk>
Message-ID: <f4050c9e0911300915r64450155u728fbfa384b5be61@mail.gmail.com>

On Thu, Nov 26, 2009 at 10:02 AM, Dave Love <d.love at liverpool.ac.uk> wrote:
>> Since you guys seem determined to speculate about what feature I need,
>> it's the new scheduler, CFS, which appeared in 2.6.23.
>
> What's wrong with the Linux from RedHat's HPC/`Grid' offering (whatever
> it's called), then? ?It has a 2.6.24 base. ?I can't try it because of a
> pestilential proprietary driver, but it seemed really to be more
> appropriate for compute nodes in various respects (given that I don't
> have a choice about running a RHEL-based system).

I had assumed that they were using the same 2.6.18 base, just with
more patches.  I might try that out, but part of the reason I want to
use the latest kernel is that I want to do a better job of providing
feedback to the kernel developers.

-- 
Andrew Shewmaker


From gus at ldeo.columbia.edu  Mon Nov 30 10:24:09 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Mon, 30 Nov 2009 13:24:09 -0500
Subject: [Beowulf] Parallel programming using Scalapack and OpenMPI
In-Reply-To: <751c63ee0911261638v6bf2e136y26df1c734413615f@mail.gmail.com>
References: <751c63ee0911261638v6bf2e136y26df1c734413615f@mail.gmail.com>
Message-ID: <4B140DC9.5020100@ldeo.columbia.edu>

Hi Ivan

PETSc (short for
Portable Extensible Toolkit for Scientific Computation,
from Argonne Natl. Lab.)
gives you the ability to use Scalapack
and provides a number of linear and PDE solvers:

http://www.mcs.anl.gov/petsc/petsc-as/

It builds on top of MPI, BLAS and LAPACK,
but you can bind it to a variety of Linear Algebra and
other packages, including Scalapack.

IIRR, PETSc's default MPI is MPICH2, which is how
I built it here a while ago.
However, I think it can be built with OpenMPI as well.
See this FAQ:
http://www.open-mpi.org/faq/?category=mpi-apps#petsc

PETSc has C, Fortran, C++, and Python APIs.

Some people here used PETSc very successfully
(problems were solved, theses were written,
PhDs were awarded, papers were published)
on global ocean circulation inverse problems,
magma migration modeling (i.e., reactive fluid
flow in porous media, sounds familiar?), etc.

PETSc is certainly good for prototyping,
although there is a learning curve.
As for efficiency and production codes, I don't know,
but you can check their FAQ:

http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html

and mailing list,
where you may also find some useful information about
your questions and specific problem:

https://lists.mcs.anl.gov/mailman/listinfo/petsc-users
http://lists.mcs.anl.gov/pipermail/petsc-users/

In case you don't know, the (very simple)
Scalapack home page is on Netlib site:

http://www.netlib.org/scalapack/scalapack_home.html


My two cents.

Boa sorte!
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Ivan Marin wrote:
> Hello all,
> 
> I've been following the discussions here in this list for quite a while 
> and always enjoying the discussions, and did some admin work in beowulf 
> clusters. But after a long time far from parallel programming, now for 
> my PhD in groundwater simulation I'm trying again to implement the 
> linear solver pdgesv from Scalapack. I'm having some troubles with the 
> definitions in the function call within C++ and the best data 
> distribution, so I would like to ask: is there anybody on this list 
> developing with Scalapack and C++? Where is the proper place to ask 
> Scalapack questions? It seems that both the forum and the mailing list 
> doesn't have any activity recently.
> 
> Thank you in advance!
> 
> Ivan Marin
> 
> Laborat?rio de Hidr?ulica Computacional - LHC
> Departamento de Hidr?ulica e Saneamento - SHS
> Escola de Engenharia de S?o Carlos - EESC
> Universidade de S?o Paulo - USP
> 
> http://albatroz.shs.eesc.usp.br
> +55 16 3373 8270
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gus at ldeo.columbia.edu  Mon Nov 30 12:15:42 2009
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Mon, 30 Nov 2009 15:15:42 -0500
Subject: [Beowulf] Parallel programming using Scalapack and OpenMPI
In-Reply-To: <428810f20911301147g20049d51x9a9606a27ec950ff@mail.gmail.com>
References: <751c63ee0911261638v6bf2e136y26df1c734413615f@mail.gmail.com>	
	<4B140DC9.5020100@ldeo.columbia.edu>
	<428810f20911301147g20049d51x9a9606a27ec950ff@mail.gmail.com>
Message-ID: <4B1427EE.3010000@ldeo.columbia.edu>

Hi Amjad

amjad ali wrote:
> Hi,
> Please explain in detail about:
> 
> PETSc is certainly good for prototyping,
> although there is a learning curve.
> 
> What is meant by learning curve.  [?]

http://en.wikipedia.org/wiki/Learning_curve

Google is your friend!
Wikipedia is your friend!

IHIH
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

> 
> 
> 
> On Mon, Nov 30, 2009 at 1:24 PM, Gus Correa <gus at ldeo.columbia.edu 
> <mailto:gus at ldeo.columbia.edu>> wrote:
> 
>     Hi Ivan
> 
>     PETSc (short for
>     Portable Extensible Toolkit for Scientific Computation,
>     from Argonne Natl. Lab.)
>     gives you the ability to use Scalapack
>     and provides a number of linear and PDE solvers:
> 
>     http://www.mcs.anl.gov/petsc/petsc-as/
> 
>     It builds on top of MPI, BLAS and LAPACK,
>     but you can bind it to a variety of Linear Algebra and
>     other packages, including Scalapack.
> 
>     IIRR, PETSc's default MPI is MPICH2, which is how
>     I built it here a while ago.
>     However, I think it can be built with OpenMPI as well.
>     See this FAQ:
>     http://www.open-mpi.org/faq/?category=mpi-apps#petsc
> 
>     PETSc has C, Fortran, C++, and Python APIs.
> 
>     Some people here used PETSc very successfully
>     (problems were solved, theses were written,
>     PhDs were awarded, papers were published)
>     on global ocean circulation inverse problems,
>     magma migration modeling (i.e., reactive fluid
>     flow in porous media, sounds familiar?), etc.
> 
>     PETSc is certainly good for prototyping,
>     although there is a learning curve.
>     As for efficiency and production codes, I don't know,
>     but you can check their FAQ:
> 
>     http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html
> 
>     and mailing list,
>     where you may also find some useful information about
>     your questions and specific problem:
> 
>     https://lists.mcs.anl.gov/mailman/listinfo/petsc-users
>     http://lists.mcs.anl.gov/pipermail/petsc-users/
> 
>     In case you don't know, the (very simple)
>     Scalapack home page is on Netlib site:
> 
>     http://www.netlib.org/scalapack/scalapack_home.html
> 
> 
>     My two cents.
> 
>     Boa sorte!
>     Gus Correa
>     ---------------------------------------------------------------------
>     Gustavo Correa
>     Lamont-Doherty Earth Observatory - Columbia University
>     Palisades, NY, 10964-8000 - USA
>     ---------------------------------------------------------------------
> 
> 
>     Ivan Marin wrote:
> 
>         Hello all,
> 
>         I've been following the discussions here in this list for quite
>         a while and always enjoying the discussions, and did some admin
>         work in beowulf clusters. But after a long time far from
>         parallel programming, now for my PhD in groundwater simulation
>         I'm trying again to implement the linear solver pdgesv from
>         Scalapack. I'm having some troubles with the definitions in the
>         function call within C++ and the best data distribution, so I
>         would like to ask: is there anybody on this list developing with
>         Scalapack and C++? Where is the proper place to ask Scalapack
>         questions? It seems that both the forum and the mailing list
>         doesn't have any activity recently.
> 
>         Thank you in advance!
> 
>         Ivan Marin
> 
>         Laborat?rio de Hidr?ulica Computacional - LHC
>         Departamento de Hidr?ulica e Saneamento - SHS
>         Escola de Engenharia de S?o Carlos - EESC
>         Universidade de S?o Paulo - USP
> 
>         http://albatroz.shs.eesc.usp.br
>         +55 16 3373 8270
> 
> 
>         ------------------------------------------------------------------------
> 
>         _______________________________________________
>         Beowulf mailing list, Beowulf at beowulf.org
>         <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>         To change your subscription (digest mode or unsubscribe) visit
>         http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org
>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 


From amjad11 at gmail.com  Mon Nov 30 12:24:34 2009
From: amjad11 at gmail.com (amjad ali)
Date: Mon, 30 Nov 2009 15:24:34 -0500
Subject: [Beowulf] MPI Processes + Auto Vectorization
Message-ID: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com>

Hi,
Suppose we run a parallel MPI code with 64 processes on a cluster, say of 16
nodes. The cluster nodes has multicore CPU say 4 cores on each node.

Now all the 64 cores on the cluster running a process. Program is SPMD,
means all processes has the same workload.

Now if we had done auto-vectorization while compiling the code (for example
with Intel compilers); Will there be any benefit (efficiency/scalability
improvement) of having code with the auto-vectorization? Or we will get the
same performance as without Auto-vectorization in this example case?


How can we really get benefit in performance improvement with
Auto-Vectorization?

Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091130/c69de3e6/attachment.html>

From dnlombar at ichips.intel.com  Mon Nov 30 14:50:24 2009
From: dnlombar at ichips.intel.com (David N. Lombard)
Date: Mon, 30 Nov 2009 14:50:24 -0800
Subject: [Beowulf] MPI Processes + Auto Vectorization
In-Reply-To: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com>
References: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com>
Message-ID: <20091130225024.GA28311@nlxdcldnl2.cl.intel.com>

On Mon, Nov 30, 2009 at 01:24:34PM -0700, amjad ali wrote:
> Hi,
> Suppose we run a parallel MPI code with 64 processes on a cluster, say of 16 nodes. The cluster nodes has multicore CPU say 4 cores on each node.
> 
> Now all the 64 cores on the cluster running a process. Program is SPMD, means all processes has the same workload.
> 
> Now if we had done auto-vectorization while compiling the code (for example with Intel compilers); Will there be any benefit (efficiency/scalability improvement) of having code with the auto-vectorization? Or we will get the same performance as without Auto-vectorization in this example case?
> 
> How can we really get benefit in performance improvement with Auto-Vectorization?

Vectorization takes advantage of the processor's vector instructions to increase data-level parallelism.
How much that benefits your code depends very much on your code; you would need to recompile your code and test.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.


From amjad11 at gmail.com  Mon Nov 30 22:14:13 2009
From: amjad11 at gmail.com (amjad ali)
Date: Tue, 1 Dec 2009 01:14:13 -0500
Subject: [Beowulf] MPI Processes + Auto Vectorization
In-Reply-To: <20091130225024.GA28311@nlxdcldnl2.cl.intel.com>
References: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com>
	<20091130225024.GA28311@nlxdcldnl2.cl.intel.com>
Message-ID: <428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com>

Hi,
perhaps I could not better ask my question.

My question is that if we do not have free cpu cores in a PC or cluster (all
cores are running MPI processes), still the auto-vertorization is
beneficial? Or it is beneficial only if we have some free cpu cores locally?


thanks


On Mon, Nov 30, 2009 at 5:50 PM, David N. Lombard <dnlombar at ichips.intel.com
> wrote:

> On Mon, Nov 30, 2009 at 01:24:34PM -0700, amjad ali wrote:
> > Hi,
> > Suppose we run a parallel MPI code with 64 processes on a cluster, say of
> 16 nodes. The cluster nodes has multicore CPU say 4 cores on each node.
> >
> > Now all the 64 cores on the cluster running a process. Program is SPMD,
> means all processes has the same workload.
> >
> > Now if we had done auto-vectorization while compiling the code (for
> example with Intel compilers); Will there be any benefit
> (efficiency/scalability improvement) of having code with the
> auto-vectorization? Or we will get the same performance as without
> Auto-vectorization in this example case?
> >
> > How can we really get benefit in performance improvement with
> Auto-Vectorization?
>
> Vectorization takes advantage of the processor's vector instructions to
> increase data-level parallelism.
> How much that benefits your code depends very much on your code; you would
> need to recompile your code and test.
>
> --
> David N. Lombard, Intel, Irvine, CA
> I do not speak for Intel Corporation; all comments are strictly my own.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091201/7d576b88/attachment.html>

From h-bugge at online.no  Mon Nov 30 23:54:54 2009
From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=)
Date: Tue, 1 Dec 2009 08:54:54 +0100
Subject: [Beowulf] MPI Processes + Auto Vectorization
In-Reply-To: <428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com>
References: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com>
	<20091130225024.GA28311@nlxdcldnl2.cl.intel.com>
	<428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com>
Message-ID: <B4323BC0-1FDA-4450-8BBF-61A971AF57B9@online.no>


On Dec 1, 2009, at 7:14 , amjad ali wrote:
> My question is that if we do not have free cpu cores in a PC or  
> cluster (all cores are running MPI processes), still the auto- 
> vertorization is beneficial? Or it is beneficial only if we have  
> some free cpu cores locally?


Amjad,

Vectorization is in x86_64 parlor a compilation technique where the  
compiler will utilize certain instructions which operate on short  
vectors. When you execute such a program on a particular core, these  
vector-instructions will execute on special execution unit _within_  
the core you're executing on. Hence, no additional resources or cores  
are required to use vector instructions and you will benefit from them  
independent of whether you fully use all cores in your cluster or not.

H?kon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091201/e10c6c3a/attachment.html>

From rjtucke at gmail.com  Wed Nov 25 13:53:49 2009
From: rjtucke at gmail.com (Ross Tucker)
Date: Wed, 25 Nov 2009 14:53:49 -0700
Subject: [Beowulf] New member, upgrading our existing Beowulf cluster
Message-ID: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com>

Greetings!

I'm a new member to this list, but the research group that I work for has
had a working cluster for many years. I am now looking at upgrading our
current configuration. I was wondering if anyone has actual experience with
running more than one node from a single power supply. Even just two boards
on one PSU would be nice. We will be using barely 200W per node for 50 nodes
and it just seems like a big waste to buy 50 power supply units. I have read
the old posts but did not see any reports of success.

Best regards,
Ross Tucker
Ariz State Univ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091125/114221c3/attachment.html>

From lm.moreira at gmail.com  Thu Nov 26 10:15:58 2009
From: lm.moreira at gmail.com (Leonardo Machado Moreira)
Date: Thu, 26 Nov 2009 16:15:58 -0200
Subject: [Beowulf] Cluster Users in Clusters Linux and Windows
Message-ID: <4788ffe70911261015t2817fcd4i55044d692b1aed64@mail.gmail.com>

Hi!

I am trying to create a cluster with only two machines.

The server will be a Linux machine, an Arch Linux distribution to be more
specific. The slave machine will be a Windows 7 machine.

I have found it is possible, but I was looking and have found that each
machine on the cluster must have the same user for the cluster.

I was wondering how would I deal with it with the windows machine ?

Do I have do implement a specific program in it? Would it found the rsh ?

Thanks in advance!

Leonardo Machado Moreira
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091126/1ab4bfaa/attachment.html>

From toon.knapen at gmail.com  Thu Nov 26 12:46:34 2009
From: toon.knapen at gmail.com (Toon Knapen)
Date: Thu, 26 Nov 2009 21:46:34 +0100
Subject: [Beowulf] rhel hpc
Message-ID: <d5bdff000911261246y6a89079co30bca6ad7082d68c@mail.gmail.com>

Dear all,

I've been working on hpux-itanium for the last 2 years (and even
unsubscribed to beowulf-ml during most of that time, my bad) but soon will
turn back to a beowulf cluster (HP DL380G6's with Xeon X5560, amcc/3ware
9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few questions on the
config.

1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1
instead of going to the latest make a difference.
2) What are the advantages of the hpc version of rhel. I browsed the doc but
unless having to compile mpi myself I do not see a difference or did I miss
soth.
3) which filesystem is advisable knowing that we're calculating on large
berkeley db databases

thanks in advance,

toon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091126/c38442c3/attachment.html>

From dzaletnev at yandex.ru  Fri Nov 27 23:27:04 2009
From: dzaletnev at yandex.ru (Dmitry Zaletnev)
Date: Sat, 28 Nov 2009 10:27:04 +0300
Subject: [Beowulf] ask about mpich
Message-ID: <141481259393224@webmail111.yandex.ru>

MPICH2 condensed instructions:

mpd daemon ring setup (MPICH2 installed supposed)

1. At all nodes:
cd $HOME
touch .mpd.conf
chmod 600 .mpd.conf
nano .mpd.conf
(if there's no nano - aptitude install nano)
Enter in the file:
MPD_SECRETWORD=_secretword_
_secretword_ must be the same at all nodes
2. At all nodes in /etc/hosts as root enter all nodes IP's, for the current node 127.0.0.1 change by its actual IP.
3. At head node run:
mpd &
mpdtrace -l

Get _host_ _ _port_
4. At slave nodes run
mpd -h _host_ -p _port_ &
5. See if the daemon ring started by running
mpdtrace

Running mpiexec (2 examples)

1. mpiexec -machinefile /home/user/mfile -np 4 -wdir /home/user ./_YourCode_
The content of the file mfile:
Slave0:2
Slave1:2

2 - number of cores

2. mpiexec -genv FOO BAR -n 2 -host Slave0 a.out : -n 2 -host Slave1 b.out

FOO - environment variable
BAR - its value
a.out and b.out - executables. 

> Edited revision of original post without permission from author:
> "Hello folks;
> I want to make a cluster system employing the command/function: mpich, in Ubuntu, but I am not too familiar with it. I could use some advice for a problem that a good translation might solve for me.
> I have followed the instructions [assuming translated into your native language] for mpich, the information related to the cluster that I am using, but it does not work. The project is related to an important work that is my thesis. Help would be greatly appreciated!"
> What is your native language?
> On Thu, Nov 12, 2009 at 4:33 AM, christian suhendra  wrote:
> > halo guys i wants to make a cluster system with mpich in ubuntu,,but i have troubleshooting with mpich..
> >  but when i run the example program in mpich..it doesn't work in cluster..but i've registered the node on machine.LINUX..
> >  but still not working
> > please help me..this is my thesis...
> > 
> > _______________________________________________
> >  Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> >  To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> -- 
> Jeremy Baker
> PO 297
> Johnson, VT
> 05656
> 

??????.?????. ?????? ????. ????? - ???. http://mail.yandex.ru/nospam/sign


From dmitri.chubarov at gmail.com  Sat Nov 28 00:55:10 2009
From: dmitri.chubarov at gmail.com (Dmitri Chubarov)
Date: Sat, 28 Nov 2009 14:55:10 +0600
Subject: [Beowulf] ask about mpich
In-Reply-To: <d2300d1c0911272121pe500df0if3780f751c8e4bf3@mail.gmail.com>
References: <c761caee0911120433u2a9f27b4u435872577bc9dbee@mail.gmail.com>
	<d2300d1c0911272121pe500df0if3780f751c8e4bf3@mail.gmail.com>
Message-ID: <a7b3ee1e0911280055r33fc15c2v7557a4e5579c6d9a@mail.gmail.com>

Hello, Christian,

you probably will need some advice off the list with your mpich setup.
If you tell the list where you are located and studying, you might
receive the help you need from someone who speaks your language or is
located nearby since the audience of Beowulf list is indeed very
widely dispersed around the globe. Also you might want to send a more
detailed description of your problems to mpich-discuss mailing list.

Jeremy, judging by the name of the original poster, the native
language is probably Indonesian.

Best regards,
  Dima

On Sat, Nov 28, 2009 at 11:21 AM, Jeremy Baker <jellogum at gmail.com> wrote:
> Edited revision of original post without permission from author:
>
> "Hello folks;
>
> I want to make a cluster system employing the command/function: mpich, in
> Ubuntu, but I am not too familiar with it. I could use some advice for a
> problem that a good translation might solve for me.
>
> I have followed the instructions [assuming translated into your native
> language] for mpich, the information related to the cluster that I am using,
> but it does not work. The project is related to an important work that is my
> thesis. Help would be greatly appreciated!"
>
>
> What is your native language?
>
>
>
> On Thu, Nov 12, 2009 at 4:33 AM, christian suhendra
> <christiansuhendra at gmail.com> wrote:
>>
>> halo guys i wants to make a cluster system with mpich in ubuntu,,but i
>> have troubleshooting with mpich..
>> but when i run the example program in mpich..it doesn't work in
>> cluster..but i've registered the node on machine.LINUX..
>> but still not working
>> please help me..this is my thesis...
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
>
> --
> Jeremy Baker
> PO 297
> Johnson, VT
> 05656
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From toon.knapen at gmail.com  Mon Nov 30 01:35:47 2009
From: toon.knapen at gmail.com (Toon Knapen)
Date: Mon, 30 Nov 2009 10:35:47 +0100
Subject: [Beowulf] Fwd: rhel hpc
In-Reply-To: <d5bdff000911261246y6a89079co30bca6ad7082d68c@mail.gmail.com>
References: <d5bdff000911261246y6a89079co30bca6ad7082d68c@mail.gmail.com>
Message-ID: <d5bdff000911300135g2ea0c342ha83c714e28b60540@mail.gmail.com>

 Dear all,

I've been working on hpux-itanium for the last 2 years (and even
unsubscribed to beowulf-ml during most of that time, my bad) but soon will
turn back to a beowulf cluster (HP DL380G6's with Xeon X5560, amcc/3ware
9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few questions on the
config.

1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1
instead of going to the latest make a difference.
2) What are the advantages of the hpc version of rhel. I browsed the doc but
unless having to compile mpi myself I do not see a difference or did I miss
soth.
3) which filesystem is advisable knowing that we're calculating on large
berkeley db databases

thanks in advance,

toon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20091130/d9ab7d20/attachment.html>